I like the proposal. The Parquet Writer version should just be 2 (no .0.0 as we 
won’t have major or minor versions.) With things like writer versions (or RPC 
versions, etc.) the usual rule is to use increasing integers.

I am surprised that the other tools don’t include more detail about the 
version; if they ever change their writers, they’ll have the same vagueness 
problem that we’re trying to address for Drill…

Thanks,

- Paul

> On Oct 28, 2016, at 1:03 PM, Vitalii Diravka <[email protected]> 
> wrote:
> 
> I explored metadata of parquet files generated from different tools:
> 
> * Impala:*
> creator:         impala version 2.2.0-cdh5.4.5 (build
> 4a81c1d04c39961ef14ff6131d543dd96ef60e6e)
> 
> *Hive:*
> creator:       parquet-mr version 1.6.0
> 
> *Pig:*
> creator:     parquet-mr version 1.5.1-SNAPSHOT
> extra:       pig.schema = recipe: chararray,ingredients: {(name:
> chararray)},inventor: (name: chararray,age: int)
> 
> *Spark:*
> creator:          parquet-mr
> extra:            parquet.proto.descriptor = name:
> "ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
> LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
> label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
> ".TestProtobuf.MyComplex" }
> extra:            parquet.proto.class =
> parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray
> 
> 
> *Drill (now):*creator:     parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra:       drill.version = 1.9.0-SNAPSHOT
> extra:       is.date.correct = true
> 
> So we can replace second extra with "parquet-writer.version = 2.0.0".
> Thoughts?
> 
> 
> Kind regards
> Vitalii
> 
> 2016-10-28 16:26 GMT+00:00 Paul Rogers <[email protected]>:
> 
>> Thanks Vitalii.
>> 
>> The Parquet Writer solution “just works”. As soon as someone upgrades the
>> writer, files are labeled as having that new version. No fuzziness during a
>> release as in 1.9.
>> 
>> It is fine to also include the Drill version. But, format decisions should
>> be keyed off of the writer version.
>> 
>> By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> 
>> - Paul
>> 
>>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <[email protected]>
>> wrote:
>>> 
>>> I agree that it would be good if the approach of parquet date correctness
>>> detection will be upgraded. So I created the jira for it DRILL-4980
>>> <https://issues.apache.org/jira/browse/DRILL-4980>.
>>> 
>>> But now we have two ideas:
>>> 1. To add checking of the drill version additionally, so later we can
>>> delete isDateCorrect label from parquet metadata.
>>> 2. To add parquet writer version to the parquet metadata and check this
>>> value instead of isDateCorrect and drillVersion.
>>> 
>>> So which way, we should prefer now?
>>> 
>>> Kind regards
>>> Vitalii
>>> 
>>> 2016-10-27 23:54 GMT+00:00 Paul Rogers <[email protected]>:
>>> 
>>>> FWIW: back on the magic flag issue…
>>>> 
>>>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>>>> grained for our needs.
>>>> 
>>>> A typical solution is include the version of the Parquet writer in
>>>> addition to that of Drill. Each time we change something in the writer,
>>>> increment the version number. If we number changes, we can easily handle
>>>> two changes in the same Drill release, or differentiate between the
>> “early
>>>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>>>> 
>>>> Since we have no version now, start it at some arbitrary point (2?).
>>>> 
>>>> Now, if the Parquet file has a Drill Writer version in the header, and
>>>> that version is 2 or greater, the date is in the “correct” format.
>> Anything
>>>> written by Drill before writer version 2, the date is wrong. The “check
>> the
>>>> data to see if it is sane” approach is needed only for files were we
>> can’t
>>>> tell if an older Drill wrote it.
>>>> 
>>>> Do other tools label the data? Does Hive say that it wrote the file? If
>>>> so, we don’t need to do the sanity check if we can tell the data comes
>> from
>>>> Hive (or Impala, or anything other than old Drill.)
>>>> 
>>>> - Paul
>>>> 
>>>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <[email protected]> wrote:
>>>>> 
>>>>> Vitalii -- are you still planning to open a ticket and pull request for
>>>> the
>>>>> fix you've noted below?
>>>>> 
>>>>> -- Zelaine
>>>>> 
>>>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> @Paul Rogers
>>>>>> It may be the undefined case when the file is generated with
>>>> drill.version
>>>>>> = 1.9-SNAPSHOT.
>>>>>> It is more easy to determine corrupted date with this flag and there
>> is
>>>> no
>>>>>> need to wait the end of release to merge these changes.
>>>>>> 
>>>>>> @Jinfeng NI
>>>>>> It looks like you are right.
>>>>>> With consistent mode (isDateCorrect = true) all tests are passed. So I
>>>> am
>>>>>> going to open a jira ticket for it with next changes
>>>>>> https://github.com/vdiravka/drill/commit/
>> ff8d5c7d601915f760d1b0e9618730
>>>>>> 3410cac5d3
>>>>>> Thanks.
>>>>>> 
>>>>>> Kind regards
>>>>>> Vitalii
>>>>>> 
>>>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <[email protected]>:
>>>>>> 
>>>>>>> I'm not sure if I fully understand your answers. The bottom line is
>>>>>>> quite simple: given a set of parquet files, the ParquetTableMeta
>>>>>>> instance constructed in Drill should have identical value for
>>>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet
>>>>>>> metadata cache, or whether there is partition pruning or not.
>> However,
>>>>>>> the code shows that this flag is not in consistent mode across
>>>>>>> different cases.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>>>>>> <[email protected]> wrote:
>>>>>>>> Hi Jinfeng,
>>>>>>>> 
>>>>>>>> 1.If the parquet files are generated with Drill after Drill-4203
>> these
>>>>>>>> files have "isDateCorrect = true" property.
>>>>>>>> Drill serializes this property from metadata now. When we set this
>>>>>>> property
>>>>>>>> in the first constructor we will hide the value from metadata.
>>>>>>>> IsDateCorrect will be false only if this value equals to the false
>> (no
>>>>>>> case
>>>>>>>> for it now) or absent in parquet metadata footer.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property
>>>>>> when
>>>>>>>> the user disable dates correction.
>>>>>>>> If you have some use case it would be great if you provide it.
>>>>>>>> 
>>>>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>>>>>>>> Here I added the property in the same manner as Jason's new property
>>>>>>>> "drillVersion. So need it a separate unit test?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Kind regards
>>>>>>>> Vitalii
>>>>>>>> 
>>>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <[email protected]>:
>>>>>>>> 
>>>>>>>>> Forgot to copy the link to the code.
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>>>> Metadata.java#L950-L955
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <[email protected]>
>> wrote:
>>>>>>>>>> @Jason, @Vitalli,
>>>>>>>>>> 
>>>>>>>>>> Any thoughts on this question, since both you worked on fix of
>>>>>>>>> DRILL-4203?
>>>>>>>>>> 
>>>>>>>>>> Looking through the code, there is a third case [1], where this
>> flag
>>>>>>>>>> is set to false when Parquet metadata is cloned (after partition
>>>>>>>>>> pruning, etc).  That means, for the 2nd case where the flag is set
>>>>>> to
>>>>>>>>>> true, if there is pruning happening, the new parquet metadata will
>>>>>> see
>>>>>>>>>> the flag is flipped to false. This does not make sense to me.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <[email protected]>
>> wrote:
>>>>>>>>>>> Hello All,
>>>>>>>>>>> 
>>>>>>>>>>> DRILL-4203 addressed the date field issue.  In the fix, it
>>>>>> introduced
>>>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
>>>>>> some
>>>>>>>>>>> difficulty in understanding the meaning of this field.
>>>>>>>>>>> 
>>>>>>>>>>> According to [1], this field is set to false, when Drill gets
>>>>>> parquet
>>>>>>>>>>> metadata from parquet footer.  This field is  set to true in code
>>>>>>> flow
>>>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data
>>>>>>> cache.
>>>>>>>>>>> 
>>>>>>>>>>> Questions I have:
>>>>>>>>>>> 1.  If the parquet files are generated with Drill after
>> DRILL-4203,
>>>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect =
>>>>>> false)?
>>>>>>>>>>> 2.  Why does this filed have nothing to do with "autoCorrection"
>>>>>> flag
>>>>>>>>>>> [4]?  If someone turns off autoCorrection, will it have impact on
>>>>>>> this
>>>>>>>>>>> "isDateCorrect" flag ?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks in advance for any input,
>>>>>>>>>>> 
>>>>>>>>>>> Jinfeng
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>> Metadata.java#L932
>>>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>> Metadata.java#L936
>>>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>> Metadata.java#L187
>>>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>>>> Metadata.java#L354-L355
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to