I like the proposal. The Parquet Writer version should just be 2 (no .0.0 as we won’t have major or minor versions.) With things like writer versions (or RPC versions, etc.) the usual rule is to use increasing integers.
I am surprised that the other tools don’t include more detail about the version; if they ever change their writers, they’ll have the same vagueness problem that we’re trying to address for Drill… Thanks, - Paul > On Oct 28, 2016, at 1:03 PM, Vitalii Diravka <[email protected]> > wrote: > > I explored metadata of parquet files generated from different tools: > > * Impala:* > creator: impala version 2.2.0-cdh5.4.5 (build > 4a81c1d04c39961ef14ff6131d543dd96ef60e6e) > > *Hive:* > creator: parquet-mr version 1.6.0 > > *Pig:* > creator: parquet-mr version 1.5.1-SNAPSHOT > extra: pig.schema = recipe: chararray,ingredients: {(name: > chararray)},inventor: (name: chararray,age: int) > > *Spark:* > creator: parquet-mr > extra: parquet.proto.descriptor = name: > "ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label: > LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2 > label: LABEL_REPEATED type: TYPE_MESSAGE type_name: > ".TestProtobuf.MyComplex" } > extra: parquet.proto.class = > parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray > > > *Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build > 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) > extra: drill.version = 1.9.0-SNAPSHOT > extra: is.date.correct = true > > So we can replace second extra with "parquet-writer.version = 2.0.0". > Thoughts? > > > Kind regards > Vitalii > > 2016-10-28 16:26 GMT+00:00 Paul Rogers <[email protected]>: > >> Thanks Vitalii. >> >> The Parquet Writer solution “just works”. As soon as someone upgrades the >> writer, files are labeled as having that new version. No fuzziness during a >> release as in 1.9. >> >> It is fine to also include the Drill version. But, format decisions should >> be keyed off of the writer version. >> >> By the way, do other tools happen to already do this? It would be rather >> surprising if they didn’t. >> >> - Paul >> >>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <[email protected]> >> wrote: >>> >>> I agree that it would be good if the approach of parquet date correctness >>> detection will be upgraded. So I created the jira for it DRILL-4980 >>> <https://issues.apache.org/jira/browse/DRILL-4980>. >>> >>> But now we have two ideas: >>> 1. To add checking of the drill version additionally, so later we can >>> delete isDateCorrect label from parquet metadata. >>> 2. To add parquet writer version to the parquet metadata and check this >>> value instead of isDateCorrect and drillVersion. >>> >>> So which way, we should prefer now? >>> >>> Kind regards >>> Vitalii >>> >>> 2016-10-27 23:54 GMT+00:00 Paul Rogers <[email protected]>: >>> >>>> FWIW: back on the magic flag issue… >>>> >>>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course >>>> grained for our needs. >>>> >>>> A typical solution is include the version of the Parquet writer in >>>> addition to that of Drill. Each time we change something in the writer, >>>> increment the version number. If we number changes, we can easily handle >>>> two changes in the same Drill release, or differentiate between the >> “early >>>> 1.9” files with old-style dates and “late 1.9” files with correct dates. >>>> >>>> Since we have no version now, start it at some arbitrary point (2?). >>>> >>>> Now, if the Parquet file has a Drill Writer version in the header, and >>>> that version is 2 or greater, the date is in the “correct” format. >> Anything >>>> written by Drill before writer version 2, the date is wrong. The “check >> the >>>> data to see if it is sane” approach is needed only for files were we >> can’t >>>> tell if an older Drill wrote it. >>>> >>>> Do other tools label the data? Does Hive say that it wrote the file? If >>>> so, we don’t need to do the sanity check if we can tell the data comes >> from >>>> Hive (or Impala, or anything other than old Drill.) >>>> >>>> - Paul >>>> >>>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <[email protected]> wrote: >>>>> >>>>> Vitalii -- are you still planning to open a ticket and pull request for >>>> the >>>>> fix you've noted below? >>>>> >>>>> -- Zelaine >>>>> >>>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < >>>> [email protected]> >>>>> wrote: >>>>> >>>>>> @Paul Rogers >>>>>> It may be the undefined case when the file is generated with >>>> drill.version >>>>>> = 1.9-SNAPSHOT. >>>>>> It is more easy to determine corrupted date with this flag and there >> is >>>> no >>>>>> need to wait the end of release to merge these changes. >>>>>> >>>>>> @Jinfeng NI >>>>>> It looks like you are right. >>>>>> With consistent mode (isDateCorrect = true) all tests are passed. So I >>>> am >>>>>> going to open a jira ticket for it with next changes >>>>>> https://github.com/vdiravka/drill/commit/ >> ff8d5c7d601915f760d1b0e9618730 >>>>>> 3410cac5d3 >>>>>> Thanks. >>>>>> >>>>>> Kind regards >>>>>> Vitalii >>>>>> >>>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <[email protected]>: >>>>>> >>>>>>> I'm not sure if I fully understand your answers. The bottom line is >>>>>>> quite simple: given a set of parquet files, the ParquetTableMeta >>>>>>> instance constructed in Drill should have identical value for >>>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet >>>>>>> metadata cache, or whether there is partition pruning or not. >> However, >>>>>>> the code shows that this flag is not in consistent mode across >>>>>>> different cases. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka >>>>>>> <[email protected]> wrote: >>>>>>>> Hi Jinfeng, >>>>>>>> >>>>>>>> 1.If the parquet files are generated with Drill after Drill-4203 >> these >>>>>>>> files have "isDateCorrect = true" property. >>>>>>>> Drill serializes this property from metadata now. When we set this >>>>>>> property >>>>>>>> in the first constructor we will hide the value from metadata. >>>>>>>> IsDateCorrect will be false only if this value equals to the false >> (no >>>>>>> case >>>>>>>> for it now) or absent in parquet metadata footer. >>>>>>>> >>>>>>>> >>>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property >>>>>> when >>>>>>>> the user disable dates correction. >>>>>>>> If you have some use case it would be great if you provide it. >>>>>>>> >>>>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned. >>>>>>>> Here I added the property in the same manner as Jason's new property >>>>>>>> "drillVersion. So need it a separate unit test? >>>>>>>> >>>>>>>> >>>>>>>> Kind regards >>>>>>>> Vitalii >>>>>>>> >>>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <[email protected]>: >>>>>>>> >>>>>>>>> Forgot to copy the link to the code. >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>>>> Metadata.java#L950-L955 >>>>>>>>> >>>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <[email protected]> >> wrote: >>>>>>>>>> @Jason, @Vitalli, >>>>>>>>>> >>>>>>>>>> Any thoughts on this question, since both you worked on fix of >>>>>>>>> DRILL-4203? >>>>>>>>>> >>>>>>>>>> Looking through the code, there is a third case [1], where this >> flag >>>>>>>>>> is set to false when Parquet metadata is cloned (after partition >>>>>>>>>> pruning, etc). That means, for the 2nd case where the flag is set >>>>>> to >>>>>>>>>> true, if there is pruning happening, the new parquet metadata will >>>>>> see >>>>>>>>>> the flag is flipped to false. This does not make sense to me. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <[email protected]> >> wrote: >>>>>>>>>>> Hello All, >>>>>>>>>>> >>>>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it >>>>>> introduced >>>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I have >>>>>> some >>>>>>>>>>> difficulty in understanding the meaning of this field. >>>>>>>>>>> >>>>>>>>>>> According to [1], this field is set to false, when Drill gets >>>>>> parquet >>>>>>>>>>> metadata from parquet footer. This field is set to true in code >>>>>>> flow >>>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data >>>>>>> cache. >>>>>>>>>>> >>>>>>>>>>> Questions I have: >>>>>>>>>>> 1. If the parquet files are generated with Drill after >> DRILL-4203, >>>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect = >>>>>> false)? >>>>>>>>>>> 2. Why does this filed have nothing to do with "autoCorrection" >>>>>> flag >>>>>>>>>>> [4]? If someone turns off autoCorrection, will it have impact on >>>>>>> this >>>>>>>>>>> "isDateCorrect" flag ? >>>>>>>>>>> >>>>>>>>>>> Thanks in advance for any input, >>>>>>>>>>> >>>>>>>>>>> Jinfeng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>> Metadata.java#L932 >>>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>> Metadata.java#L936 >>>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>> Metadata.java#L187 >>>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- >>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>>>> Metadata.java#L354-L355 >>>>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >>
