Hi Vitalli, DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION, META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES. What value will this isDateCorrect flag have for each possiblity, especially for META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect same things, or different?
Thanks. Jinfeng On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers <[email protected]> wrote: > Thanks Vitalii. > > The Parquet Writer solution “just works”. As soon as someone upgrades the > writer, files are labeled as having that new version. No fuzziness during a > release as in 1.9. > > It is fine to also include the Drill version. But, format decisions should be > keyed off of the writer version. > > By the way, do other tools happen to already do this? It would be rather > surprising if they didn’t. > > - Paul > >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <[email protected]> >> wrote: >> >> I agree that it would be good if the approach of parquet date correctness >> detection will be upgraded. So I created the jira for it DRILL-4980 >> <https://issues.apache.org/jira/browse/DRILL-4980>. >> >> But now we have two ideas: >> 1. To add checking of the drill version additionally, so later we can >> delete isDateCorrect label from parquet metadata. >> 2. To add parquet writer version to the parquet metadata and check this >> value instead of isDateCorrect and drillVersion. >> >> So which way, we should prefer now? >> >> Kind regards >> Vitalii >> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers <[email protected]>: >> >>> FWIW: back on the magic flag issue… >>> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course >>> grained for our needs. >>> >>> A typical solution is include the version of the Parquet writer in >>> addition to that of Drill. Each time we change something in the writer, >>> increment the version number. If we number changes, we can easily handle >>> two changes in the same Drill release, or differentiate between the “early >>> 1.9” files with old-style dates and “late 1.9” files with correct dates. >>> >>> Since we have no version now, start it at some arbitrary point (2?). >>> >>> Now, if the Parquet file has a Drill Writer version in the header, and >>> that version is 2 or greater, the date is in the “correct” format. Anything >>> written by Drill before writer version 2, the date is wrong. The “check the >>> data to see if it is sane” approach is needed only for files were we can’t >>> tell if an older Drill wrote it. >>> >>> Do other tools label the data? Does Hive say that it wrote the file? If >>> so, we don’t need to do the sanity check if we can tell the data comes from >>> Hive (or Impala, or anything other than old Drill.) >>> >>> - Paul >>> >>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <[email protected]> wrote: >>>> >>>> Vitalii -- are you still planning to open a ticket and pull request for >>> the >>>> fix you've noted below? >>>> >>>> -- Zelaine >>>> >>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < >>> [email protected]> >>>> wrote: >>>> >>>>> @Paul Rogers >>>>> It may be the undefined case when the file is generated with >>> drill.version >>>>> = 1.9-SNAPSHOT. >>>>> It is more easy to determine corrupted date with this flag and there is >>> no >>>>> need to wait the end of release to merge these changes. >>>>> >>>>> @Jinfeng NI >>>>> It looks like you are right. >>>>> With consistent mode (isDateCorrect = true) all tests are passed. So I >>> am >>>>> going to open a jira ticket for it with next changes >>>>> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730 >>>>> 3410cac5d3 >>>>> Thanks. >>>>> >>>>> Kind regards >>>>> Vitalii >>>>> >>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <[email protected]>: >>>>> >>>>>> I'm not sure if I fully understand your answers. The bottom line is >>>>>> quite simple: given a set of parquet files, the ParquetTableMeta >>>>>> instance constructed in Drill should have identical value for >>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet >>>>>> metadata cache, or whether there is partition pruning or not. However, >>>>>> the code shows that this flag is not in consistent mode across >>>>>> different cases. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka >>>>>> <[email protected]> wrote: >>>>>>> Hi Jinfeng, >>>>>>> >>>>>>> 1.If the parquet files are generated with Drill after Drill-4203 these >>>>>>> files have "isDateCorrect = true" property. >>>>>>> Drill serializes this property from metadata now. When we set this >>>>>> property >>>>>>> in the first constructor we will hide the value from metadata. >>>>>>> IsDateCorrect will be false only if this value equals to the false (no >>>>>> case >>>>>>> for it now) or absent in parquet metadata footer. >>>>>>> >>>>>>> >>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property >>>>> when >>>>>>> the user disable dates correction. >>>>>>> If you have some use case it would be great if you provide it. >>>>>>> >>>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned. >>>>>>> Here I added the property in the same manner as Jason's new property >>>>>>> "drillVersion. So need it a separate unit test? >>>>>>> >>>>>>> >>>>>>> Kind regards >>>>>>> Vitalii >>>>>>> >>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <[email protected]>: >>>>>>> >>>>>>>> Forgot to copy the link to the code. >>>>>>>> >>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>>> Metadata.java#L950-L955 >>>>>>>> >>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <[email protected]> wrote: >>>>>>>>> @Jason, @Vitalli, >>>>>>>>> >>>>>>>>> Any thoughts on this question, since both you worked on fix of >>>>>>>> DRILL-4203? >>>>>>>>> >>>>>>>>> Looking through the code, there is a third case [1], where this flag >>>>>>>>> is set to false when Parquet metadata is cloned (after partition >>>>>>>>> pruning, etc). That means, for the 2nd case where the flag is set >>>>> to >>>>>>>>> true, if there is pruning happening, the new parquet metadata will >>>>> see >>>>>>>>> the flag is flipped to false. This does not make sense to me. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <[email protected]> wrote: >>>>>>>>>> Hello All, >>>>>>>>>> >>>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it >>>>> introduced >>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I have >>>>> some >>>>>>>>>> difficulty in understanding the meaning of this field. >>>>>>>>>> >>>>>>>>>> According to [1], this field is set to false, when Drill gets >>>>> parquet >>>>>>>>>> metadata from parquet footer. This field is set to true in code >>>>>> flow >>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data >>>>>> cache. >>>>>>>>>> >>>>>>>>>> Questions I have: >>>>>>>>>> 1. If the parquet files are generated with Drill after DRILL-4203, >>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect = >>>>> false)? >>>>>>>>>> 2. Why does this filed have nothing to do with "autoCorrection" >>>>> flag >>>>>>>>>> [4]? If someone turns off autoCorrection, will it have impact on >>>>>> this >>>>>>>>>> "isDateCorrect" flag ? >>>>>>>>>> >>>>>>>>>> Thanks in advance for any input, >>>>>>>>>> >>>>>>>>>> Jinfeng >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>> Metadata.java#L932 >>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>> Metadata.java#L936 >>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>> Metadata.java#L187 >>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>>> Metadata.java#L354-L355 >>>>>>>> >>>>>> >>>>> >>> >>> >
