Hi Vitalli,

DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
this isDateCorrect flag have for each possiblity, especially for
META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
same things, or different?

Thanks.

Jinfeng



On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers <[email protected]> wrote:
> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the 
> writer, files are labeled as having that new version. No fuzziness during a 
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should be 
> keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather 
> surprising if they didn’t.
>
> - Paul
>
>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <[email protected]> 
>> wrote:
>>
>> I agree that it would be good if the approach of parquet date correctness
>> detection will be upgraded. So I created the jira for it DRILL-4980
>> <https://issues.apache.org/jira/browse/DRILL-4980>.
>>
>> But now we have two ideas:
>> 1. To add checking of the drill version additionally, so later we can
>> delete isDateCorrect label from parquet metadata.
>> 2. To add parquet writer version to the parquet metadata and check this
>> value instead of isDateCorrect and drillVersion.
>>
>> So which way, we should prefer now?
>>
>> Kind regards
>> Vitalii
>>
>> 2016-10-27 23:54 GMT+00:00 Paul Rogers <[email protected]>:
>>
>>> FWIW: back on the magic flag issue…
>>>
>>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>>> grained for our needs.
>>>
>>> A typical solution is include the version of the Parquet writer in
>>> addition to that of Drill. Each time we change something in the writer,
>>> increment the version number. If we number changes, we can easily handle
>>> two changes in the same Drill release, or differentiate between the “early
>>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>>>
>>> Since we have no version now, start it at some arbitrary point (2?).
>>>
>>> Now, if the Parquet file has a Drill Writer version in the header, and
>>> that version is 2 or greater, the date is in the “correct” format. Anything
>>> written by Drill before writer version 2, the date is wrong. The “check the
>>> data to see if it is sane” approach is needed only for files were we can’t
>>> tell if an older Drill wrote it.
>>>
>>> Do other tools label the data? Does Hive say that it wrote the file? If
>>> so, we don’t need to do the sanity check if we can tell the data comes from
>>> Hive (or Impala, or anything other than old Drill.)
>>>
>>> - Paul
>>>
>>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <[email protected]> wrote:
>>>>
>>>> Vitalii -- are you still planning to open a ticket and pull request for
>>> the
>>>> fix you've noted below?
>>>>
>>>> -- Zelaine
>>>>
>>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>>> [email protected]>
>>>> wrote:
>>>>
>>>>> @Paul Rogers
>>>>> It may be the undefined case when the file is generated with
>>> drill.version
>>>>> = 1.9-SNAPSHOT.
>>>>> It is more easy to determine corrupted date with this flag and there is
>>> no
>>>>> need to wait the end of release to merge these changes.
>>>>>
>>>>> @Jinfeng NI
>>>>> It looks like you are right.
>>>>> With consistent mode (isDateCorrect = true) all tests are passed. So I
>>> am
>>>>> going to open a jira ticket for it with next changes
>>>>> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
>>>>> 3410cac5d3
>>>>> Thanks.
>>>>>
>>>>> Kind regards
>>>>> Vitalii
>>>>>
>>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <[email protected]>:
>>>>>
>>>>>> I'm not sure if I fully understand your answers. The bottom line is
>>>>>> quite simple: given a set of parquet files, the ParquetTableMeta
>>>>>> instance constructed in Drill should have identical value for
>>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet
>>>>>> metadata cache, or whether there is partition pruning or not. However,
>>>>>> the code shows that this flag is not in consistent mode across
>>>>>> different cases.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>>>>> <[email protected]> wrote:
>>>>>>> Hi Jinfeng,
>>>>>>>
>>>>>>> 1.If the parquet files are generated with Drill after Drill-4203 these
>>>>>>> files have "isDateCorrect = true" property.
>>>>>>> Drill serializes this property from metadata now. When we set this
>>>>>> property
>>>>>>> in the first constructor we will hide the value from metadata.
>>>>>>> IsDateCorrect will be false only if this value equals to the false (no
>>>>>> case
>>>>>>> for it now) or absent in parquet metadata footer.
>>>>>>>
>>>>>>>
>>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property
>>>>> when
>>>>>>> the user disable dates correction.
>>>>>>> If you have some use case it would be great if you provide it.
>>>>>>>
>>>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>>>>>>> Here I added the property in the same manner as Jason's new property
>>>>>>> "drillVersion. So need it a separate unit test?
>>>>>>>
>>>>>>>
>>>>>>> Kind regards
>>>>>>> Vitalii
>>>>>>>
>>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <[email protected]>:
>>>>>>>
>>>>>>>> Forgot to copy the link to the code.
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>>> Metadata.java#L950-L955
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <[email protected]> wrote:
>>>>>>>>> @Jason, @Vitalli,
>>>>>>>>>
>>>>>>>>> Any thoughts on this question, since both you worked on fix of
>>>>>>>> DRILL-4203?
>>>>>>>>>
>>>>>>>>> Looking through the code, there is a third case [1], where this flag
>>>>>>>>> is set to false when Parquet metadata is cloned (after partition
>>>>>>>>> pruning, etc).  That means, for the 2nd case where the flag is set
>>>>> to
>>>>>>>>> true, if there is pruning happening, the new parquet metadata will
>>>>> see
>>>>>>>>> the flag is flipped to false. This does not make sense to me.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <[email protected]> wrote:
>>>>>>>>>> Hello All,
>>>>>>>>>>
>>>>>>>>>> DRILL-4203 addressed the date field issue.  In the fix, it
>>>>> introduced
>>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
>>>>> some
>>>>>>>>>> difficulty in understanding the meaning of this field.
>>>>>>>>>>
>>>>>>>>>> According to [1], this field is set to false, when Drill gets
>>>>> parquet
>>>>>>>>>> metadata from parquet footer.  This field is  set to true in code
>>>>>> flow
>>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data
>>>>>> cache.
>>>>>>>>>>
>>>>>>>>>> Questions I have:
>>>>>>>>>> 1.  If the parquet files are generated with Drill after DRILL-4203,
>>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect =
>>>>> false)?
>>>>>>>>>> 2.  Why does this filed have nothing to do with "autoCorrection"
>>>>> flag
>>>>>>>>>> [4]?  If someone turns off autoCorrection, will it have impact on
>>>>>> this
>>>>>>>>>> "isDateCorrect" flag ?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance for any input,
>>>>>>>>>>
>>>>>>>>>> Jinfeng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>> Metadata.java#L932
>>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>> Metadata.java#L936
>>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>> Metadata.java#L187
>>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>>> Metadata.java#L354-L355
>>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>

Reply via email to