Re: isDateCorrect field in ParquetTableMetadata

2016-11-03 Thread Vitalii Diravka
I have opened a PR  for the
DRILL-4980 .

@Jinfeng I removed is.date.correct flag from the writer.
And I left the checking of it in detectCorruptDates() method for the reason
if somebody has already generated the parquet files with this property.

@Jason I have added the drill version checking as you suggested to do.

@Paul I have added parquet-writer.version property to parquet metadata
footer.
If you have other thoughts about variables names, please let me know I will
try to respond as fast as possible.
Kind regards
Vitalii

2016-10-31 7:07 GMT+02:00 Paul Rogers :

> Choosing a good property name should solve the confusion issue. Perhaps
> drill.writer as the name.
>
> The writer version is not needed if we feel that we’ll never again change
> the writer or introduce bugs. Since that is hard to predict, adding a
> writer version is very low cost insurance. Indeed, including a writer
> version is a very common technique. The question we should answer is why we
> don’t need to use this technique here…
>
> One can imagine that, as we evolve from 1.x to the 2.1 (or later) format,
> we may do so in fits and starts, perhaps based on community contributions.
> A version will help us know what capabilities were supported in the writer
> that wrote a particular file.
>
> Thanks,
>
> - Paul
>
> > On Oct 28, 2016, at 3:03 PM, Jason Altekruse  wrote:
> >
> > The only worry I have about declaring a writer version is possible
> > confusion with the Parquet format version itself. The format is already
> > defined through version 2.1 or something like that, but we are currently
> > only writing files based on the 1.x version of the format.
> >
> > My preferred solution to this problem would be to just make point
> releases
> > for problems like this (like in this case we could have made a 1.8.1
> > release, and then all of the 1.8.0-SNAPSHOT would all known to be bad and
> > everything after would be 1.8.1-SNAPSHOT and could have been known to be
> > correct).
> >
> > I'm open to to hearing other opinions on this, I just generally feel like
> > these bugs should be rare, and fixing them should be done with a lot of
> > care (and in this case I missed a few things). I don't think it would be
> > crazy to say that we should only merge these kinds of patches if we are
> > willing to say the fix is ready for a release.
> >
> > Jason Altekruse
> > Software Engineer at Dremio
> > Apache Drill Committer
> >
> > On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka <
> vitalii.dira...@gmail.com>
> > wrote:
> >
> >> Jinfeng,
> >>
> >> isDateCorrect will be false in the code when isDateCorrect property is
> >> absent in the parquet metadata.
> >>
> >> Anyway I am going to implement the mentioned approach with the
> >> parquet-writer.version instead of isDateCorrect property.
> >>
>
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-30 Thread Paul Rogers
Choosing a good property name should solve the confusion issue. Perhaps 
drill.writer as the name.

The writer version is not needed if we feel that we’ll never again change the 
writer or introduce bugs. Since that is hard to predict, adding a writer 
version is very low cost insurance. Indeed, including a writer version is a 
very common technique. The question we should answer is why we don’t need to 
use this technique here…

One can imagine that, as we evolve from 1.x to the 2.1 (or later) format, we 
may do so in fits and starts, perhaps based on community contributions. A 
version will help us know what capabilities were supported in the writer that 
wrote a particular file.

Thanks,

- Paul

> On Oct 28, 2016, at 3:03 PM, Jason Altekruse  wrote:
> 
> The only worry I have about declaring a writer version is possible
> confusion with the Parquet format version itself. The format is already
> defined through version 2.1 or something like that, but we are currently
> only writing files based on the 1.x version of the format.
> 
> My preferred solution to this problem would be to just make point releases
> for problems like this (like in this case we could have made a 1.8.1
> release, and then all of the 1.8.0-SNAPSHOT would all known to be bad and
> everything after would be 1.8.1-SNAPSHOT and could have been known to be
> correct).
> 
> I'm open to to hearing other opinions on this, I just generally feel like
> these bugs should be rare, and fixing them should be done with a lot of
> care (and in this case I missed a few things). I don't think it would be
> crazy to say that we should only merge these kinds of patches if we are
> willing to say the fix is ready for a release.
> 
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
> 
> On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka 
> wrote:
> 
>> Jinfeng,
>> 
>> isDateCorrect will be false in the code when isDateCorrect property is
>> absent in the parquet metadata.
>> 
>> Anyway I am going to implement the mentioned approach with the
>> parquet-writer.version instead of isDateCorrect property.
>> 



Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The only worry I have about declaring a writer version is possible
confusion with the Parquet format version itself. The format is already
defined through version 2.1 or something like that, but we are currently
only writing files based on the 1.x version of the format.

My preferred solution to this problem would be to just make point releases
for problems like this (like in this case we could have made a 1.8.1
release, and then all of the 1.8.0-SNAPSHOT would all known to be bad and
everything after would be 1.8.1-SNAPSHOT and could have been known to be
correct).

I'm open to to hearing other opinions on this, I just generally feel like
these bugs should be rare, and fixing them should be done with a lot of
care (and in this case I missed a few things). I don't think it would be
crazy to say that we should only merge these kinds of patches if we are
willing to say the fix is ready for a release.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka 
wrote:

> Jinfeng,
>
> isDateCorrect will be false in the code when isDateCorrect property is
> absent in the parquet metadata.
>
> Anyway I am going to implement the mentioned approach with the
> parquet-writer.version instead of isDateCorrect property.
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Vitalli,

Just to confirm, you will "remove" isDateCorrect flag, and use
parquet-writer version in stead, correct?




On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka
 wrote:
> Jinfeng,
>
> isDateCorrect will be false in the code when isDateCorrect property is
> absent in the parquet metadata.
>
> Anyway I am going to implement the mentioned approach with the
> parquet-writer.version instead of isDateCorrect property.


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
Jinfeng,

isDateCorrect will be false in the code when isDateCorrect property is
absent in the parquet metadata.

Anyway I am going to implement the mentioned approach with the
parquet-writer.version instead of isDateCorrect property.


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
I like the proposal. The Parquet Writer version should just be 2 (no .0.0 as we 
won’t have major or minor versions.) With things like writer versions (or RPC 
versions, etc.) the usual rule is to use increasing integers.

I am surprised that the other tools don’t include more detail about the 
version; if they ever change their writers, they’ll have the same vagueness 
problem that we’re trying to address for Drill…

Thanks,

- Paul

> On Oct 28, 2016, at 1:03 PM, Vitalii Diravka  
> wrote:
> 
> I explored metadata of parquet files generated from different tools:
> 
> * Impala:*
> creator: impala version 2.2.0-cdh5.4.5 (build
> 4a81c1d04c39961ef14ff6131d543dd96ef60e6e)
> 
> *Hive:*
> creator:   parquet-mr version 1.6.0
> 
> *Pig:*
> creator: parquet-mr version 1.5.1-SNAPSHOT
> extra:   pig.schema = recipe: chararray,ingredients: {(name:
> chararray)},inventor: (name: chararray,age: int)
> 
> *Spark:*
> creator:  parquet-mr
> extra:parquet.proto.descriptor = name:
> "ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
> LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
> label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
> ".TestProtobuf.MyComplex" }
> extra:parquet.proto.class =
> parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray
> 
> 
> *Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra:   drill.version = 1.9.0-SNAPSHOT
> extra:   is.date.correct = true
> 
> So we can replace second extra with "parquet-writer.version = 2.0.0".
> Thoughts?
> 
> 
> Kind regards
> Vitalii
> 
> 2016-10-28 16:26 GMT+00:00 Paul Rogers :
> 
>> Thanks Vitalii.
>> 
>> The Parquet Writer solution “just works”. As soon as someone upgrades the
>> writer, files are labeled as having that new version. No fuzziness during a
>> release as in 1.9.
>> 
>> It is fine to also include the Drill version. But, format decisions should
>> be keyed off of the writer version.
>> 
>> By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> 
>> - Paul
>> 
>>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
>> wrote:
>>> 
>>> I agree that it would be good if the approach of parquet date correctness
>>> detection will be upgraded. So I created the jira for it DRILL-4980
>>> .
>>> 
>>> But now we have two ideas:
>>> 1. To add checking of the drill version additionally, so later we can
>>> delete isDateCorrect label from parquet metadata.
>>> 2. To add parquet writer version to the parquet metadata and check this
>>> value instead of isDateCorrect and drillVersion.
>>> 
>>> So which way, we should prefer now?
>>> 
>>> Kind regards
>>> Vitalii
>>> 
>>> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>>> 
 FWIW: back on the magic flag issue…
 
 I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
 grained for our needs.
 
 A typical solution is include the version of the Parquet writer in
 addition to that of Drill. Each time we change something in the writer,
 increment the version number. If we number changes, we can easily handle
 two changes in the same Drill release, or differentiate between the
>> “early
 1.9” files with old-style dates and “late 1.9” files with correct dates.
 
 Since we have no version now, start it at some arbitrary point (2?).
 
 Now, if the Parquet file has a Drill Writer version in the header, and
 that version is 2 or greater, the date is in the “correct” format.
>> Anything
 written by Drill before writer version 2, the date is wrong. The “check
>> the
 data to see if it is sane” approach is needed only for files were we
>> can’t
 tell if an older Drill wrote it.
 
 Do other tools label the data? Does Hive say that it wrote the file? If
 so, we don’t need to do the sanity check if we can tell the data comes
>> from
 Hive (or Impala, or anything other than old Drill.)
 
 - Paul
 
> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> 
> Vitalii -- are you still planning to open a ticket and pull request for
 the
> fix you've noted below?
> 
> -- Zelaine
> 
> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
 vitalii.dira...@gmail.com>
> wrote:
> 
>> @Paul Rogers
>> It may be the undefined case when the file is generated with
 drill.version
>> = 1.9-SNAPSHOT.
>> It is more easy to determine corrupted date with this flag and there
>> is
 no
>> need to wait the end of release to merge these changes.
>> 
>> @Jinfeng NI
>> It looks like you are right.
>> With consistent mode (isDateCorrect = true) all tests are passed. So I
 am
>> 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Thanks for the explanation, Jason.

The three different values for DateCorruptionStatus make sense to me.

The isDataCorrect flag = true, means that the values are known to be correct.
The isDataCorrect flag = false, means that the values are know to be
incorrect, or unclear?


On Fri, Oct 28, 2016 at 12:59 PM, Jason Altekruse  wrote:
> The isDataCorrect flag means that the values are known to be correct, and
> there is no need to auto-detect corruption or correct anything.
>
> META_SHOWS_CORRUPTION can be set either when we have a known old version of
> Drill written in the metadata, or we have older files that might have been
> written by Drill that we have checked the values in the statistics and
> found corrupt looking values. Really old files without any statistics don't
> have information that allows us to identify them as Drill-produced, so we
> have to test the values during actual page reads, this is where
> META_UNCLEAR_TEST_VALUES is used.
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>
> On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni  wrote:
>
>> Hi Vitalli,
>>
>> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
>> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
>> this isDateCorrect flag have for each possiblity, especially for
>> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
>> same things, or different?
>>
>> Thanks.
>>
>> Jinfeng
>>
>>
>>
>> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
>> > Thanks Vitalii.
>> >
>> > The Parquet Writer solution “just works”. As soon as someone upgrades
>> the writer, files are labeled as having that new version. No fuzziness
>> during a release as in 1.9.
>> >
>> > It is fine to also include the Drill version. But, format decisions
>> should be keyed off of the writer version.
>> >
>> > By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> >
>> > - Paul
>> >
>> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
>> wrote:
>> >>
>> >> I agree that it would be good if the approach of parquet date
>> correctness
>> >> detection will be upgraded. So I created the jira for it DRILL-4980
>> >> .
>> >>
>> >> But now we have two ideas:
>> >> 1. To add checking of the drill version additionally, so later we can
>> >> delete isDateCorrect label from parquet metadata.
>> >> 2. To add parquet writer version to the parquet metadata and check this
>> >> value instead of isDateCorrect and drillVersion.
>> >>
>> >> So which way, we should prefer now?
>> >>
>> >> Kind regards
>> >> Vitalii
>> >>
>> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>> >>
>> >>> FWIW: back on the magic flag issue…
>> >>>
>> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too
>> course
>> >>> grained for our needs.
>> >>>
>> >>> A typical solution is include the version of the Parquet writer in
>> >>> addition to that of Drill. Each time we change something in the writer,
>> >>> increment the version number. If we number changes, we can easily
>> handle
>> >>> two changes in the same Drill release, or differentiate between the
>> “early
>> >>> 1.9” files with old-style dates and “late 1.9” files with correct
>> dates.
>> >>>
>> >>> Since we have no version now, start it at some arbitrary point (2?).
>> >>>
>> >>> Now, if the Parquet file has a Drill Writer version in the header, and
>> >>> that version is 2 or greater, the date is in the “correct” format.
>> Anything
>> >>> written by Drill before writer version 2, the date is wrong. The
>> “check the
>> >>> data to see if it is sane” approach is needed only for files were we
>> can’t
>> >>> tell if an older Drill wrote it.
>> >>>
>> >>> Do other tools label the data? Does Hive say that it wrote the file? If
>> >>> so, we don’t need to do the sanity check if we can tell the data comes
>> from
>> >>> Hive (or Impala, or anything other than old Drill.)
>> >>>
>> >>> - Paul
>> >>>
>>  On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
>> 
>>  Vitalii -- are you still planning to open a ticket and pull request
>> for
>> >>> the
>>  fix you've noted below?
>> 
>>  -- Zelaine
>> 
>>  On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> >>> vitalii.dira...@gmail.com>
>>  wrote:
>> 
>> > @Paul Rogers
>> > It may be the undefined case when the file is generated with
>> >>> drill.version
>> > = 1.9-SNAPSHOT.
>> > It is more easy to determine corrupted date with this flag and there
>> is
>> >>> no
>> > need to wait the end of release to merge these changes.
>> >
>> > @Jinfeng NI
>> > It looks like you are right.
>> > With consistent mode (isDateCorrect = true) all tests are passed. So
>> I
>> >>> am
>> > going to open a jira ticket for it with 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I explored metadata of parquet files generated from different tools:

* Impala:*
creator: impala version 2.2.0-cdh5.4.5 (build
4a81c1d04c39961ef14ff6131d543dd96ef60e6e)

*Hive:*
creator:   parquet-mr version 1.6.0

*Pig:*
creator: parquet-mr version 1.5.1-SNAPSHOT
extra:   pig.schema = recipe: chararray,ingredients: {(name:
chararray)},inventor: (name: chararray,age: int)

*Spark:*
creator:  parquet-mr
extra:parquet.proto.descriptor = name:
"ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
".TestProtobuf.MyComplex" }
extra:parquet.proto.class =
parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray


*Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build
6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
extra:   drill.version = 1.9.0-SNAPSHOT
extra:   is.date.correct = true

So we can replace second extra with "parquet-writer.version = 2.0.0".
Thoughts?


Kind regards
Vitalii

2016-10-28 16:26 GMT+00:00 Paul Rogers :

> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the
> writer, files are labeled as having that new version. No fuzziness during a
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should
> be keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
>
> - Paul
>
> > On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
> wrote:
> >
> > I agree that it would be good if the approach of parquet date correctness
> > detection will be upgraded. So I created the jira for it DRILL-4980
> > .
> >
> > But now we have two ideas:
> > 1. To add checking of the drill version additionally, so later we can
> > delete isDateCorrect label from parquet metadata.
> > 2. To add parquet writer version to the parquet metadata and check this
> > value instead of isDateCorrect and drillVersion.
> >
> > So which way, we should prefer now?
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> >
> >> FWIW: back on the magic flag issue…
> >>
> >> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
> >> grained for our needs.
> >>
> >> A typical solution is include the version of the Parquet writer in
> >> addition to that of Drill. Each time we change something in the writer,
> >> increment the version number. If we number changes, we can easily handle
> >> two changes in the same Drill release, or differentiate between the
> “early
> >> 1.9” files with old-style dates and “late 1.9” files with correct dates.
> >>
> >> Since we have no version now, start it at some arbitrary point (2?).
> >>
> >> Now, if the Parquet file has a Drill Writer version in the header, and
> >> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >> written by Drill before writer version 2, the date is wrong. The “check
> the
> >> data to see if it is sane” approach is needed only for files were we
> can’t
> >> tell if an older Drill wrote it.
> >>
> >> Do other tools label the data? Does Hive say that it wrote the file? If
> >> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >> Hive (or Impala, or anything other than old Drill.)
> >>
> >> - Paul
> >>
> >>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> >>>
> >>> Vitalii -- are you still planning to open a ticket and pull request for
> >> the
> >>> fix you've noted below?
> >>>
> >>> -- Zelaine
> >>>
> >>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >> vitalii.dira...@gmail.com>
> >>> wrote:
> >>>
>  @Paul Rogers
>  It may be the undefined case when the file is generated with
> >> drill.version
>  = 1.9-SNAPSHOT.
>  It is more easy to determine corrupted date with this flag and there
> is
> >> no
>  need to wait the end of release to merge these changes.
> 
>  @Jinfeng NI
>  It looks like you are right.
>  With consistent mode (isDateCorrect = true) all tests are passed. So I
> >> am
>  going to open a jira ticket for it with next changes
>  https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
>  3410cac5d3
>  Thanks.
> 
>  Kind regards
>  Vitalii
> 
>  2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> 
> > I'm not sure if I fully understand your answers. The bottom line is
> > quite simple: given a set of parquet files, the ParquetTableMeta
> > instance constructed in Drill should have identical value for
> > "isDateCorrect", whether it comes from parquet footer, or parquet
> > metadata cache, or whether there is partition pruning or 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The isDataCorrect flag means that the values are known to be correct, and
there is no need to auto-detect corruption or correct anything.

META_SHOWS_CORRUPTION can be set either when we have a known old version of
Drill written in the metadata, or we have older files that might have been
written by Drill that we have checked the values in the statistics and
found corrupt looking values. Really old files without any statistics don't
have information that allows us to identify them as Drill-produced, so we
have to test the values during actual page reads, this is where
META_UNCLEAR_TEST_VALUES is used.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni  wrote:

> Hi Vitalli,
>
> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
> this isDateCorrect flag have for each possiblity, especially for
> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
> same things, or different?
>
> Thanks.
>
> Jinfeng
>
>
>
> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
> > Thanks Vitalii.
> >
> > The Parquet Writer solution “just works”. As soon as someone upgrades
> the writer, files are labeled as having that new version. No fuzziness
> during a release as in 1.9.
> >
> > It is fine to also include the Drill version. But, format decisions
> should be keyed off of the writer version.
> >
> > By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
> >
> > - Paul
> >
> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
> wrote:
> >>
> >> I agree that it would be good if the approach of parquet date
> correctness
> >> detection will be upgraded. So I created the jira for it DRILL-4980
> >> .
> >>
> >> But now we have two ideas:
> >> 1. To add checking of the drill version additionally, so later we can
> >> delete isDateCorrect label from parquet metadata.
> >> 2. To add parquet writer version to the parquet metadata and check this
> >> value instead of isDateCorrect and drillVersion.
> >>
> >> So which way, we should prefer now?
> >>
> >> Kind regards
> >> Vitalii
> >>
> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> >>
> >>> FWIW: back on the magic flag issue…
> >>>
> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too
> course
> >>> grained for our needs.
> >>>
> >>> A typical solution is include the version of the Parquet writer in
> >>> addition to that of Drill. Each time we change something in the writer,
> >>> increment the version number. If we number changes, we can easily
> handle
> >>> two changes in the same Drill release, or differentiate between the
> “early
> >>> 1.9” files with old-style dates and “late 1.9” files with correct
> dates.
> >>>
> >>> Since we have no version now, start it at some arbitrary point (2?).
> >>>
> >>> Now, if the Parquet file has a Drill Writer version in the header, and
> >>> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >>> written by Drill before writer version 2, the date is wrong. The
> “check the
> >>> data to see if it is sane” approach is needed only for files were we
> can’t
> >>> tell if an older Drill wrote it.
> >>>
> >>> Do other tools label the data? Does Hive say that it wrote the file? If
> >>> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >>> Hive (or Impala, or anything other than old Drill.)
> >>>
> >>> - Paul
> >>>
>  On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> 
>  Vitalii -- are you still planning to open a ticket and pull request
> for
> >>> the
>  fix you've noted below?
> 
>  -- Zelaine
> 
>  On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >>> vitalii.dira...@gmail.com>
>  wrote:
> 
> > @Paul Rogers
> > It may be the undefined case when the file is generated with
> >>> drill.version
> > = 1.9-SNAPSHOT.
> > It is more easy to determine corrupted date with this flag and there
> is
> >>> no
> > need to wait the end of release to merge these changes.
> >
> > @Jinfeng NI
> > It looks like you are right.
> > With consistent mode (isDateCorrect = true) all tests are passed. So
> I
> >>> am
> > going to open a jira ticket for it with next changes
> > https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
> > 3410cac5d3
> > Thanks.
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> >
> >> I'm not sure if I fully understand your answers. The bottom line is
> >> quite simple: given a set of parquet files, the ParquetTableMeta
> >> instance constructed in Drill should have identical value for
> >> 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Hi Vitalli,

DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
this isDateCorrect flag have for each possiblity, especially for
META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
same things, or different?

Thanks.

Jinfeng



On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the 
> writer, files are labeled as having that new version. No fuzziness during a 
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should be 
> keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather 
> surprising if they didn’t.
>
> - Paul
>
>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka  
>> wrote:
>>
>> I agree that it would be good if the approach of parquet date correctness
>> detection will be upgraded. So I created the jira for it DRILL-4980
>> .
>>
>> But now we have two ideas:
>> 1. To add checking of the drill version additionally, so later we can
>> delete isDateCorrect label from parquet metadata.
>> 2. To add parquet writer version to the parquet metadata and check this
>> value instead of isDateCorrect and drillVersion.
>>
>> So which way, we should prefer now?
>>
>> Kind regards
>> Vitalii
>>
>> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>>
>>> FWIW: back on the magic flag issue…
>>>
>>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>>> grained for our needs.
>>>
>>> A typical solution is include the version of the Parquet writer in
>>> addition to that of Drill. Each time we change something in the writer,
>>> increment the version number. If we number changes, we can easily handle
>>> two changes in the same Drill release, or differentiate between the “early
>>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>>>
>>> Since we have no version now, start it at some arbitrary point (2?).
>>>
>>> Now, if the Parquet file has a Drill Writer version in the header, and
>>> that version is 2 or greater, the date is in the “correct” format. Anything
>>> written by Drill before writer version 2, the date is wrong. The “check the
>>> data to see if it is sane” approach is needed only for files were we can’t
>>> tell if an older Drill wrote it.
>>>
>>> Do other tools label the data? Does Hive say that it wrote the file? If
>>> so, we don’t need to do the sanity check if we can tell the data comes from
>>> Hive (or Impala, or anything other than old Drill.)
>>>
>>> - Paul
>>>
 On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:

 Vitalii -- are you still planning to open a ticket and pull request for
>>> the
 fix you've noted below?

 -- Zelaine

 On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>>> vitalii.dira...@gmail.com>
 wrote:

> @Paul Rogers
> It may be the undefined case when the file is generated with
>>> drill.version
> = 1.9-SNAPSHOT.
> It is more easy to determine corrupted date with this flag and there is
>>> no
> need to wait the end of release to merge these changes.
>
> @Jinfeng NI
> It looks like you are right.
> With consistent mode (isDateCorrect = true) all tests are passed. So I
>>> am
> going to open a jira ticket for it with next changes
> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> 3410cac5d3
> Thanks.
>
> Kind regards
> Vitalii
>
> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
>
>> I'm not sure if I fully understand your answers. The bottom line is
>> quite simple: given a set of parquet files, the ParquetTableMeta
>> instance constructed in Drill should have identical value for
>> "isDateCorrect", whether it comes from parquet footer, or parquet
>> metadata cache, or whether there is partition pruning or not. However,
>> the code shows that this flag is not in consistent mode across
>> different cases.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>  wrote:
>>> Hi Jinfeng,
>>>
>>> 1.If the parquet files are generated with Drill after Drill-4203 these
>>> files have "isDateCorrect = true" property.
>>> Drill serializes this property from metadata now. When we set this
>> property
>>> in the first constructor we will hide the value from metadata.
>>> IsDateCorrect will be false only if this value equals to the false (no
>> case
>>> for it now) or absent in parquet metadata footer.
>>>
>>>
>>> 2. I'm not sure the reason to change isDateCorrect metadata property
> when
>>> the user disable dates correction.
>>> If 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
Thanks Vitalii.

The Parquet Writer solution “just works”. As soon as someone upgrades the 
writer, files are labeled as having that new version. No fuzziness during a 
release as in 1.9.

It is fine to also include the Drill version. But, format decisions should be 
keyed off of the writer version.

By the way, do other tools happen to already do this? It would be rather 
surprising if they didn’t.

- Paul

> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka  
> wrote:
> 
> I agree that it would be good if the approach of parquet date correctness
> detection will be upgraded. So I created the jira for it DRILL-4980
> .
> 
> But now we have two ideas:
> 1. To add checking of the drill version additionally, so later we can
> delete isDateCorrect label from parquet metadata.
> 2. To add parquet writer version to the parquet metadata and check this
> value instead of isDateCorrect and drillVersion.
> 
> So which way, we should prefer now?
> 
> Kind regards
> Vitalii
> 
> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> 
>> FWIW: back on the magic flag issue…
>> 
>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>> grained for our needs.
>> 
>> A typical solution is include the version of the Parquet writer in
>> addition to that of Drill. Each time we change something in the writer,
>> increment the version number. If we number changes, we can easily handle
>> two changes in the same Drill release, or differentiate between the “early
>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>> 
>> Since we have no version now, start it at some arbitrary point (2?).
>> 
>> Now, if the Parquet file has a Drill Writer version in the header, and
>> that version is 2 or greater, the date is in the “correct” format. Anything
>> written by Drill before writer version 2, the date is wrong. The “check the
>> data to see if it is sane” approach is needed only for files were we can’t
>> tell if an older Drill wrote it.
>> 
>> Do other tools label the data? Does Hive say that it wrote the file? If
>> so, we don’t need to do the sanity check if we can tell the data comes from
>> Hive (or Impala, or anything other than old Drill.)
>> 
>> - Paul
>> 
>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
>>> 
>>> Vitalii -- are you still planning to open a ticket and pull request for
>> the
>>> fix you've noted below?
>>> 
>>> -- Zelaine
>>> 
>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> vitalii.dira...@gmail.com>
>>> wrote:
>>> 
 @Paul Rogers
 It may be the undefined case when the file is generated with
>> drill.version
 = 1.9-SNAPSHOT.
 It is more easy to determine corrupted date with this flag and there is
>> no
 need to wait the end of release to merge these changes.
 
 @Jinfeng NI
 It looks like you are right.
 With consistent mode (isDateCorrect = true) all tests are passed. So I
>> am
 going to open a jira ticket for it with next changes
 https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
 3410cac5d3
 Thanks.
 
 Kind regards
 Vitalii
 
 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
 
> I'm not sure if I fully understand your answers. The bottom line is
> quite simple: given a set of parquet files, the ParquetTableMeta
> instance constructed in Drill should have identical value for
> "isDateCorrect", whether it comes from parquet footer, or parquet
> metadata cache, or whether there is partition pruning or not. However,
> the code shows that this flag is not in consistent mode across
> different cases.
> 
> 
> 
> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>  wrote:
>> Hi Jinfeng,
>> 
>> 1.If the parquet files are generated with Drill after Drill-4203 these
>> files have "isDateCorrect = true" property.
>> Drill serializes this property from metadata now. When we set this
> property
>> in the first constructor we will hide the value from metadata.
>> IsDateCorrect will be false only if this value equals to the false (no
> case
>> for it now) or absent in parquet metadata footer.
>> 
>> 
>> 2. I'm not sure the reason to change isDateCorrect metadata property
 when
>> the user disable dates correction.
>> If you have some use case it would be great if you provide it.
>> 
>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>> Here I added the property in the same manner as Jason's new property
>> "drillVersion. So need it a separate unit test?
>> 
>> 
>> Kind regards
>> Vitalii
>> 
>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
>> 
>>> Forgot to copy the link to the code.
>>> 
>>> [1] 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I agree that it would be good if the approach of parquet date correctness
detection will be upgraded. So I created the jira for it DRILL-4980
.

But now we have two ideas:
1. To add checking of the drill version additionally, so later we can
delete isDateCorrect label from parquet metadata.
2. To add parquet writer version to the parquet metadata and check this
value instead of isDateCorrect and drillVersion.

So which way, we should prefer now?

Kind regards
Vitalii

2016-10-27 23:54 GMT+00:00 Paul Rogers :

> FWIW: back on the magic flag issue…
>
> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
> grained for our needs.
>
> A typical solution is include the version of the Parquet writer in
> addition to that of Drill. Each time we change something in the writer,
> increment the version number. If we number changes, we can easily handle
> two changes in the same Drill release, or differentiate between the “early
> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>
> Since we have no version now, start it at some arbitrary point (2?).
>
> Now, if the Parquet file has a Drill Writer version in the header, and
> that version is 2 or greater, the date is in the “correct” format. Anything
> written by Drill before writer version 2, the date is wrong. The “check the
> data to see if it is sane” approach is needed only for files were we can’t
> tell if an older Drill wrote it.
>
> Do other tools label the data? Does Hive say that it wrote the file? If
> so, we don’t need to do the sanity check if we can tell the data comes from
> Hive (or Impala, or anything other than old Drill.)
>
> - Paul
>
> > On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> >
> > Vitalii -- are you still planning to open a ticket and pull request for
> the
> > fix you've noted below?
> >
> > -- Zelaine
> >
> > On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> vitalii.dira...@gmail.com>
> > wrote:
> >
> >> @Paul Rogers
> >> It may be the undefined case when the file is generated with
> drill.version
> >> = 1.9-SNAPSHOT.
> >> It is more easy to determine corrupted date with this flag and there is
> no
> >> need to wait the end of release to merge these changes.
> >>
> >> @Jinfeng NI
> >> It looks like you are right.
> >> With consistent mode (isDateCorrect = true) all tests are passed. So I
> am
> >> going to open a jira ticket for it with next changes
> >> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> >> 3410cac5d3
> >> Thanks.
> >>
> >> Kind regards
> >> Vitalii
> >>
> >> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> >>
> >>> I'm not sure if I fully understand your answers. The bottom line is
> >>> quite simple: given a set of parquet files, the ParquetTableMeta
> >>> instance constructed in Drill should have identical value for
> >>> "isDateCorrect", whether it comes from parquet footer, or parquet
> >>> metadata cache, or whether there is partition pruning or not. However,
> >>> the code shows that this flag is not in consistent mode across
> >>> different cases.
> >>>
> >>>
> >>>
> >>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> >>>  wrote:
>  Hi Jinfeng,
> 
>  1.If the parquet files are generated with Drill after Drill-4203 these
>  files have "isDateCorrect = true" property.
>  Drill serializes this property from metadata now. When we set this
> >>> property
>  in the first constructor we will hide the value from metadata.
>  IsDateCorrect will be false only if this value equals to the false (no
> >>> case
>  for it now) or absent in parquet metadata footer.
> 
> 
>  2. I'm not sure the reason to change isDateCorrect metadata property
> >> when
>  the user disable dates correction.
>  If you have some use case it would be great if you provide it.
> 
>  3. Maybe you are right regarding to when Parquet metadata is cloned.
>  Here I added the property in the same manner as Jason's new property
>  "drillVersion. So need it a separate unit test?
> 
> 
>  Kind regards
>  Vitalii
> 
>  2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> 
> > Forgot to copy the link to the code.
> >
> > [1] https://github.com/apache/drill/blob/master/exec/java-
> > exec/src/main/java/org/apache/drill/exec/store/parquet/
> > Metadata.java#L950-L955
> >
> > On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> >> @Jason, @Vitalli,
> >>
> >> Any thoughts on this question, since both you worked on fix of
> > DRILL-4203?
> >>
> >> Looking through the code, there is a third case [1], where this flag
> >> is set to false when Parquet metadata is cloned (after partition
> >> pruning, etc).  That means, for the 2nd case where the flag is set
> >> to
> >> true, if there 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Paul Rogers
FWIW: back on the magic flag issue…

I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course 
grained for our needs.

A typical solution is include the version of the Parquet writer in addition to 
that of Drill. Each time we change something in the writer, increment the 
version number. If we number changes, we can easily handle two changes in the 
same Drill release, or differentiate between the “early 1.9” files with 
old-style dates and “late 1.9” files with correct dates.

Since we have no version now, start it at some arbitrary point (2?).

Now, if the Parquet file has a Drill Writer version in the header, and that 
version is 2 or greater, the date is in the “correct” format. Anything written 
by Drill before writer version 2, the date is wrong. The “check the data to see 
if it is sane” approach is needed only for files were we can’t tell if an older 
Drill wrote it.

Do other tools label the data? Does Hive say that it wrote the file? If so, we 
don’t need to do the sanity check if we can tell the data comes from Hive (or 
Impala, or anything other than old Drill.)

- Paul
 
> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> 
> Vitalii -- are you still planning to open a ticket and pull request for the
> fix you've noted below?
> 
> -- Zelaine
> 
> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka 
> wrote:
> 
>> @Paul Rogers
>> It may be the undefined case when the file is generated with drill.version
>> = 1.9-SNAPSHOT.
>> It is more easy to determine corrupted date with this flag and there is no
>> need to wait the end of release to merge these changes.
>> 
>> @Jinfeng NI
>> It looks like you are right.
>> With consistent mode (isDateCorrect = true) all tests are passed. So I am
>> going to open a jira ticket for it with next changes
>> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
>> 3410cac5d3
>> Thanks.
>> 
>> Kind regards
>> Vitalii
>> 
>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
>> 
>>> I'm not sure if I fully understand your answers. The bottom line is
>>> quite simple: given a set of parquet files, the ParquetTableMeta
>>> instance constructed in Drill should have identical value for
>>> "isDateCorrect", whether it comes from parquet footer, or parquet
>>> metadata cache, or whether there is partition pruning or not. However,
>>> the code shows that this flag is not in consistent mode across
>>> different cases.
>>> 
>>> 
>>> 
>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>>  wrote:
 Hi Jinfeng,
 
 1.If the parquet files are generated with Drill after Drill-4203 these
 files have "isDateCorrect = true" property.
 Drill serializes this property from metadata now. When we set this
>>> property
 in the first constructor we will hide the value from metadata.
 IsDateCorrect will be false only if this value equals to the false (no
>>> case
 for it now) or absent in parquet metadata footer.
 
 
 2. I'm not sure the reason to change isDateCorrect metadata property
>> when
 the user disable dates correction.
 If you have some use case it would be great if you provide it.
 
 3. Maybe you are right regarding to when Parquet metadata is cloned.
 Here I added the property in the same manner as Jason's new property
 "drillVersion. So need it a separate unit test?
 
 
 Kind regards
 Vitalii
 
 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
 
> Forgot to copy the link to the code.
> 
> [1] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L950-L955
> 
> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
>> @Jason, @Vitalli,
>> 
>> Any thoughts on this question, since both you worked on fix of
> DRILL-4203?
>> 
>> Looking through the code, there is a third case [1], where this flag
>> is set to false when Parquet metadata is cloned (after partition
>> pruning, etc).  That means, for the 2nd case where the flag is set
>> to
>> true, if there is pruning happening, the new parquet metadata will
>> see
>> the flag is flipped to false. This does not make sense to me.
>> 
>> 
>> 
>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
>>> Hello All,
>>> 
>>> DRILL-4203 addressed the date field issue.  In the fix, it
>> introduced
>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
>> some
>>> difficulty in understanding the meaning of this field.
>>> 
>>> According to [1], this field is set to false, when Drill gets
>> parquet
>>> metadata from parquet footer.  This field is  set to true in code
>>> flow
>>> of [2] and [3], when Drill gets parquet metadata from meta data
>>> cache.
>>> 
>>> Questions I have:

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Jason Altekruse
Vitalli,

Thank you for looking into this, sorry I missed it in the review.  When you
open up a request to fix this issue could you update the check for
correctness in the metadata to check for the is.date.correct flag, or a
version greater than or equal to 1.9.0 (no snapshot)? This will allow us to
stop writing the flag into the metadata at the release or shortly
thereafter.

It might be worth looking at how we can catch issues like this related to
plan serialization. There were pretty thorough tests with the patch, but we
still have code paths that only come up with remote fragment usage that we
could test in other ways to avoid bugs like this.

- Jason

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Thu, Oct 27, 2016 at 4:03 PM, Zelaine Fong  wrote:

> Vitalii -- are you still planning to open a ticket and pull request for the
> fix you've noted below?
>
> -- Zelaine
>
> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> vitalii.dira...@gmail.com>
> wrote:
>
> > @Paul Rogers
> > It may be the undefined case when the file is generated with
> drill.version
> > = 1.9-SNAPSHOT.
> > It is more easy to determine corrupted date with this flag and there is
> no
> > need to wait the end of release to merge these changes.
> >
> > @Jinfeng NI
> > It looks like you are right.
> > With consistent mode (isDateCorrect = true) all tests are passed. So I am
> > going to open a jira ticket for it with next changes
> > https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> > 3410cac5d3
> > Thanks.
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> >
> > > I'm not sure if I fully understand your answers. The bottom line is
> > > quite simple: given a set of parquet files, the ParquetTableMeta
> > > instance constructed in Drill should have identical value for
> > > "isDateCorrect", whether it comes from parquet footer, or parquet
> > > metadata cache, or whether there is partition pruning or not. However,
> > > the code shows that this flag is not in consistent mode across
> > > different cases.
> > >
> > >
> > >
> > > On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> > >  wrote:
> > > > Hi Jinfeng,
> > > >
> > > > 1.If the parquet files are generated with Drill after Drill-4203
> these
> > > > files have "isDateCorrect = true" property.
> > > > Drill serializes this property from metadata now. When we set this
> > > property
> > > > in the first constructor we will hide the value from metadata.
> > > > IsDateCorrect will be false only if this value equals to the false
> (no
> > > case
> > > > for it now) or absent in parquet metadata footer.
> > > >
> > > >
> > > > 2. I'm not sure the reason to change isDateCorrect metadata property
> > when
> > > > the user disable dates correction.
> > > > If you have some use case it would be great if you provide it.
> > > >
> > > > 3. Maybe you are right regarding to when Parquet metadata is cloned.
> > > > Here I added the property in the same manner as Jason's new property
> > > > "drillVersion. So need it a separate unit test?
> > > >
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > > 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> > > >
> > > >> Forgot to copy the link to the code.
> > > >>
> > > >> [1] https://github.com/apache/drill/blob/master/exec/java-
> > > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > > >> Metadata.java#L950-L955
> > > >>
> > > >> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> > > >> > @Jason, @Vitalli,
> > > >> >
> > > >> > Any thoughts on this question, since both you worked on fix of
> > > >> DRILL-4203?
> > > >> >
> > > >> > Looking through the code, there is a third case [1], where this
> flag
> > > >> > is set to false when Parquet metadata is cloned (after partition
> > > >> > pruning, etc).  That means, for the 2nd case where the flag is set
> > to
> > > >> > true, if there is pruning happening, the new parquet metadata will
> > see
> > > >> > the flag is flipped to false. This does not make sense to me.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni 
> wrote:
> > > >> >> Hello All,
> > > >> >>
> > > >> >> DRILL-4203 addressed the date field issue.  In the fix, it
> > introduced
> > > >> >> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
> > some
> > > >> >> difficulty in understanding the meaning of this field.
> > > >> >>
> > > >> >> According to [1], this field is set to false, when Drill gets
> > parquet
> > > >> >> metadata from parquet footer.  This field is  set to true in code
> > > flow
> > > >> >> of [2] and [3], when Drill gets parquet metadata from meta data
> > > cache.
> > > >> >>
> > > >> >> Questions I have:
> > > >> >> 1.  If the parquet files are generated with Drill after
> DRILL-4203,
> > > >> >> Drill still thinks date field is NOT correct (isDateCorrect 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Zelaine Fong
Vitalii -- are you still planning to open a ticket and pull request for the
fix you've noted below?

-- Zelaine

On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka 
wrote:

> @Paul Rogers
> It may be the undefined case when the file is generated with drill.version
> = 1.9-SNAPSHOT.
> It is more easy to determine corrupted date with this flag and there is no
> need to wait the end of release to merge these changes.
>
> @Jinfeng NI
> It looks like you are right.
> With consistent mode (isDateCorrect = true) all tests are passed. So I am
> going to open a jira ticket for it with next changes
> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> 3410cac5d3
> Thanks.
>
> Kind regards
> Vitalii
>
> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
>
> > I'm not sure if I fully understand your answers. The bottom line is
> > quite simple: given a set of parquet files, the ParquetTableMeta
> > instance constructed in Drill should have identical value for
> > "isDateCorrect", whether it comes from parquet footer, or parquet
> > metadata cache, or whether there is partition pruning or not. However,
> > the code shows that this flag is not in consistent mode across
> > different cases.
> >
> >
> >
> > On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> >  wrote:
> > > Hi Jinfeng,
> > >
> > > 1.If the parquet files are generated with Drill after Drill-4203 these
> > > files have "isDateCorrect = true" property.
> > > Drill serializes this property from metadata now. When we set this
> > property
> > > in the first constructor we will hide the value from metadata.
> > > IsDateCorrect will be false only if this value equals to the false (no
> > case
> > > for it now) or absent in parquet metadata footer.
> > >
> > >
> > > 2. I'm not sure the reason to change isDateCorrect metadata property
> when
> > > the user disable dates correction.
> > > If you have some use case it would be great if you provide it.
> > >
> > > 3. Maybe you are right regarding to when Parquet metadata is cloned.
> > > Here I added the property in the same manner as Jason's new property
> > > "drillVersion. So need it a separate unit test?
> > >
> > >
> > > Kind regards
> > > Vitalii
> > >
> > > 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> > >
> > >> Forgot to copy the link to the code.
> > >>
> > >> [1] https://github.com/apache/drill/blob/master/exec/java-
> > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > >> Metadata.java#L950-L955
> > >>
> > >> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> > >> > @Jason, @Vitalli,
> > >> >
> > >> > Any thoughts on this question, since both you worked on fix of
> > >> DRILL-4203?
> > >> >
> > >> > Looking through the code, there is a third case [1], where this flag
> > >> > is set to false when Parquet metadata is cloned (after partition
> > >> > pruning, etc).  That means, for the 2nd case where the flag is set
> to
> > >> > true, if there is pruning happening, the new parquet metadata will
> see
> > >> > the flag is flipped to false. This does not make sense to me.
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
> > >> >> Hello All,
> > >> >>
> > >> >> DRILL-4203 addressed the date field issue.  In the fix, it
> introduced
> > >> >> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
> some
> > >> >> difficulty in understanding the meaning of this field.
> > >> >>
> > >> >> According to [1], this field is set to false, when Drill gets
> parquet
> > >> >> metadata from parquet footer.  This field is  set to true in code
> > flow
> > >> >> of [2] and [3], when Drill gets parquet metadata from meta data
> > cache.
> > >> >>
> > >> >> Questions I have:
> > >> >> 1.  If the parquet files are generated with Drill after DRILL-4203,
> > >> >> Drill still thinks date field is NOT correct (isDateCorrect =
> false)?
> > >> >> 2.  Why does this filed have nothing to do with "autoCorrection"
> flag
> > >> >> [4]?  If someone turns off autoCorrection, will it have impact on
> > this
> > >> >> "isDateCorrect" flag ?
> > >> >>
> > >> >> Thanks in advance for any input,
> > >> >>
> > >> >> Jinfeng
> > >> >>
> > >> >>
> > >> >> [1] https://github.com/apache/drill/blob/master/exec/java-
> > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > Metadata.java#L932
> > >> >> [2] https://github.com/apache/drill/blob/master/exec/java-
> > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > Metadata.java#L936
> > >> >> [3] https://github.com/apache/drill/blob/master/exec/java-
> > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > Metadata.java#L187
> > >> >> [4] https://github.com/apache/drill/blob/master/exec/java-
> > >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> > >> Metadata.java#L354-L355
> > >>
> >
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-26 Thread Vitalii Diravka
@Paul Rogers
It may be the undefined case when the file is generated with drill.version
= 1.9-SNAPSHOT.
It is more easy to determine corrupted date with this flag and there is no
need to wait the end of release to merge these changes.

@Jinfeng NI
It looks like you are right.
With consistent mode (isDateCorrect = true) all tests are passed. So I am
going to open a jira ticket for it with next changes
https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e96187303410cac5d3
Thanks.

Kind regards
Vitalii

2016-10-25 18:36 GMT+00:00 Jinfeng Ni :

> I'm not sure if I fully understand your answers. The bottom line is
> quite simple: given a set of parquet files, the ParquetTableMeta
> instance constructed in Drill should have identical value for
> "isDateCorrect", whether it comes from parquet footer, or parquet
> metadata cache, or whether there is partition pruning or not. However,
> the code shows that this flag is not in consistent mode across
> different cases.
>
>
>
> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>  wrote:
> > Hi Jinfeng,
> >
> > 1.If the parquet files are generated with Drill after Drill-4203 these
> > files have "isDateCorrect = true" property.
> > Drill serializes this property from metadata now. When we set this
> property
> > in the first constructor we will hide the value from metadata.
> > IsDateCorrect will be false only if this value equals to the false (no
> case
> > for it now) or absent in parquet metadata footer.
> >
> >
> > 2. I'm not sure the reason to change isDateCorrect metadata property when
> > the user disable dates correction.
> > If you have some use case it would be great if you provide it.
> >
> > 3. Maybe you are right regarding to when Parquet metadata is cloned.
> > Here I added the property in the same manner as Jason's new property
> > "drillVersion. So need it a separate unit test?
> >
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> >
> >> Forgot to copy the link to the code.
> >>
> >> [1] https://github.com/apache/drill/blob/master/exec/java-
> >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >> Metadata.java#L950-L955
> >>
> >> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> >> > @Jason, @Vitalli,
> >> >
> >> > Any thoughts on this question, since both you worked on fix of
> >> DRILL-4203?
> >> >
> >> > Looking through the code, there is a third case [1], where this flag
> >> > is set to false when Parquet metadata is cloned (after partition
> >> > pruning, etc).  That means, for the 2nd case where the flag is set to
> >> > true, if there is pruning happening, the new parquet metadata will see
> >> > the flag is flipped to false. This does not make sense to me.
> >> >
> >> >
> >> >
> >> > On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
> >> >> Hello All,
> >> >>
> >> >> DRILL-4203 addressed the date field issue.  In the fix, it introduced
> >> >> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
> >> >> difficulty in understanding the meaning of this field.
> >> >>
> >> >> According to [1], this field is set to false, when Drill gets parquet
> >> >> metadata from parquet footer.  This field is  set to true in code
> flow
> >> >> of [2] and [3], when Drill gets parquet metadata from meta data
> cache.
> >> >>
> >> >> Questions I have:
> >> >> 1.  If the parquet files are generated with Drill after DRILL-4203,
> >> >> Drill still thinks date field is NOT correct (isDateCorrect = false)?
> >> >> 2.  Why does this filed have nothing to do with "autoCorrection" flag
> >> >> [4]?  If someone turns off autoCorrection, will it have impact on
> this
> >> >> "isDateCorrect" flag ?
> >> >>
> >> >> Thanks in advance for any input,
> >> >>
> >> >> Jinfeng
> >> >>
> >> >>
> >> >> [1] https://github.com/apache/drill/blob/master/exec/java-
> >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L932
> >> >> [2] https://github.com/apache/drill/blob/master/exec/java-
> >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L936
> >> >> [3] https://github.com/apache/drill/blob/master/exec/java-
> >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L187
> >> >> [4] https://github.com/apache/drill/blob/master/exec/java-
> >> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >> Metadata.java#L354-L355
> >>
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
I'm not sure if I fully understand your answers. The bottom line is
quite simple: given a set of parquet files, the ParquetTableMeta
instance constructed in Drill should have identical value for
"isDateCorrect", whether it comes from parquet footer, or parquet
metadata cache, or whether there is partition pruning or not. However,
the code shows that this flag is not in consistent mode across
different cases.



On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
 wrote:
> Hi Jinfeng,
>
> 1.If the parquet files are generated with Drill after Drill-4203 these
> files have "isDateCorrect = true" property.
> Drill serializes this property from metadata now. When we set this property
> in the first constructor we will hide the value from metadata.
> IsDateCorrect will be false only if this value equals to the false (no case
> for it now) or absent in parquet metadata footer.
>
>
> 2. I'm not sure the reason to change isDateCorrect metadata property when
> the user disable dates correction.
> If you have some use case it would be great if you provide it.
>
> 3. Maybe you are right regarding to when Parquet metadata is cloned.
> Here I added the property in the same manner as Jason's new property
> "drillVersion. So need it a separate unit test?
>
>
> Kind regards
> Vitalii
>
> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
>
>> Forgot to copy the link to the code.
>>
>> [1] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> Metadata.java#L950-L955
>>
>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
>> > @Jason, @Vitalli,
>> >
>> > Any thoughts on this question, since both you worked on fix of
>> DRILL-4203?
>> >
>> > Looking through the code, there is a third case [1], where this flag
>> > is set to false when Parquet metadata is cloned (after partition
>> > pruning, etc).  That means, for the 2nd case where the flag is set to
>> > true, if there is pruning happening, the new parquet metadata will see
>> > the flag is flipped to false. This does not make sense to me.
>> >
>> >
>> >
>> > On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
>> >> Hello All,
>> >>
>> >> DRILL-4203 addressed the date field issue.  In the fix, it introduced
>> >> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
>> >> difficulty in understanding the meaning of this field.
>> >>
>> >> According to [1], this field is set to false, when Drill gets parquet
>> >> metadata from parquet footer.  This field is  set to true in code flow
>> >> of [2] and [3], when Drill gets parquet metadata from meta data cache.
>> >>
>> >> Questions I have:
>> >> 1.  If the parquet files are generated with Drill after DRILL-4203,
>> >> Drill still thinks date field is NOT correct (isDateCorrect = false)?
>> >> 2.  Why does this filed have nothing to do with "autoCorrection" flag
>> >> [4]?  If someone turns off autoCorrection, will it have impact on this
>> >> "isDateCorrect" flag ?
>> >>
>> >> Thanks in advance for any input,
>> >>
>> >> Jinfeng
>> >>
>> >>
>> >> [1] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
>> >> [2] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
>> >> [3] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
>> >> [4] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> Metadata.java#L354-L355
>>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Paul Rogers
Would it make sense to write the Drill version into the Parquet metadata, and 
decide on date format based on the Drill version? This works if Drill versions 
after, say, 1.9 has the “correct” format and anything with an earlier version 
has “incorrect” dates. This is the typical way that folks handle format changes 
across versions.

- Paul

> On Oct 25, 2016, at 11:24 AM, Vitalii Diravka  
> wrote:
> 
> Hi Jinfeng,
> 
> 1.If the parquet files are generated with Drill after Drill-4203 these
> files have "isDateCorrect = true" property.
> Drill serializes this property from metadata now. When we set this property
> in the first constructor we will hide the value from metadata.
> IsDateCorrect will be false only if this value equals to the false (no case
> for it now) or absent in parquet metadata footer.
> 
> 
> 2. I'm not sure the reason to change isDateCorrect metadata property when
> the user disable dates correction.
> If you have some use case it would be great if you provide it.
> 
> 3. Maybe you are right regarding to when Parquet metadata is cloned.
> Here I added the property in the same manner as Jason's new property
> "drillVersion. So need it a separate unit test?
> 
> 
> Kind regards
> Vitalii
> 
> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> 
>> Forgot to copy the link to the code.
>> 
>> [1] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> Metadata.java#L950-L955
>> 
>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
>>> @Jason, @Vitalli,
>>> 
>>> Any thoughts on this question, since both you worked on fix of
>> DRILL-4203?
>>> 
>>> Looking through the code, there is a third case [1], where this flag
>>> is set to false when Parquet metadata is cloned (after partition
>>> pruning, etc).  That means, for the 2nd case where the flag is set to
>>> true, if there is pruning happening, the new parquet metadata will see
>>> the flag is flipped to false. This does not make sense to me.
>>> 
>>> 
>>> 
>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
 Hello All,
 
 DRILL-4203 addressed the date field issue.  In the fix, it introduced
 a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
 difficulty in understanding the meaning of this field.
 
 According to [1], this field is set to false, when Drill gets parquet
 metadata from parquet footer.  This field is  set to true in code flow
 of [2] and [3], when Drill gets parquet metadata from meta data cache.
 
 Questions I have:
 1.  If the parquet files are generated with Drill after DRILL-4203,
 Drill still thinks date field is NOT correct (isDateCorrect = false)?
 2.  Why does this filed have nothing to do with "autoCorrection" flag
 [4]?  If someone turns off autoCorrection, will it have impact on this
 "isDateCorrect" flag ?
 
 Thanks in advance for any input,
 
 Jinfeng
 
 
 [1] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
 [2] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
 [3] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
 [4] https://github.com/apache/drill/blob/master/exec/java-
>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> Metadata.java#L354-L355
>> 



Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Vitalii Diravka
Hi Jinfeng,

1.If the parquet files are generated with Drill after Drill-4203 these
files have "isDateCorrect = true" property.
Drill serializes this property from metadata now. When we set this property
in the first constructor we will hide the value from metadata.
IsDateCorrect will be false only if this value equals to the false (no case
for it now) or absent in parquet metadata footer.


2. I'm not sure the reason to change isDateCorrect metadata property when
the user disable dates correction.
If you have some use case it would be great if you provide it.

3. Maybe you are right regarding to when Parquet metadata is cloned.
Here I added the property in the same manner as Jason's new property
"drillVersion. So need it a separate unit test?


Kind regards
Vitalii

2016-10-25 16:23 GMT+00:00 Jinfeng Ni :

> Forgot to copy the link to the code.
>
> [1] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L950-L955
>
> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> > @Jason, @Vitalli,
> >
> > Any thoughts on this question, since both you worked on fix of
> DRILL-4203?
> >
> > Looking through the code, there is a third case [1], where this flag
> > is set to false when Parquet metadata is cloned (after partition
> > pruning, etc).  That means, for the 2nd case where the flag is set to
> > true, if there is pruning happening, the new parquet metadata will see
> > the flag is flipped to false. This does not make sense to me.
> >
> >
> >
> > On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
> >> Hello All,
> >>
> >> DRILL-4203 addressed the date field issue.  In the fix, it introduced
> >> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
> >> difficulty in understanding the meaning of this field.
> >>
> >> According to [1], this field is set to false, when Drill gets parquet
> >> metadata from parquet footer.  This field is  set to true in code flow
> >> of [2] and [3], when Drill gets parquet metadata from meta data cache.
> >>
> >> Questions I have:
> >> 1.  If the parquet files are generated with Drill after DRILL-4203,
> >> Drill still thinks date field is NOT correct (isDateCorrect = false)?
> >> 2.  Why does this filed have nothing to do with "autoCorrection" flag
> >> [4]?  If someone turns off autoCorrection, will it have impact on this
> >> "isDateCorrect" flag ?
> >>
> >> Thanks in advance for any input,
> >>
> >> Jinfeng
> >>
> >>
> >> [1] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
> >> [2] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
> >> [3] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
> >> [4] https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> Metadata.java#L354-L355
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
Forgot to copy the link to the code.

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L950-L955

On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> @Jason, @Vitalli,
>
> Any thoughts on this question, since both you worked on fix of DRILL-4203?
>
> Looking through the code, there is a third case [1], where this flag
> is set to false when Parquet metadata is cloned (after partition
> pruning, etc).  That means, for the 2nd case where the flag is set to
> true, if there is pruning happening, the new parquet metadata will see
> the flag is flipped to false. This does not make sense to me.
>
>
>
> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
>> Hello All,
>>
>> DRILL-4203 addressed the date field issue.  In the fix, it introduced
>> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
>> difficulty in understanding the meaning of this field.
>>
>> According to [1], this field is set to false, when Drill gets parquet
>> metadata from parquet footer.  This field is  set to true in code flow
>> of [2] and [3], when Drill gets parquet metadata from meta data cache.
>>
>> Questions I have:
>> 1.  If the parquet files are generated with Drill after DRILL-4203,
>> Drill still thinks date field is NOT correct (isDateCorrect = false)?
>> 2.  Why does this filed have nothing to do with "autoCorrection" flag
>> [4]?  If someone turns off autoCorrection, will it have impact on this
>> "isDateCorrect" flag ?
>>
>> Thanks in advance for any input,
>>
>> Jinfeng
>>
>>
>> [1] 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
>> [2] 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
>> [3] 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
>> [4] 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L354-L355


Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
@Jason, @Vitalli,

Any thoughts on this question, since both you worked on fix of DRILL-4203?

Looking through the code, there is a third case [1], where this flag
is set to false when Parquet metadata is cloned (after partition
pruning, etc).  That means, for the 2nd case where the flag is set to
true, if there is pruning happening, the new parquet metadata will see
the flag is flipped to false. This does not make sense to me.



On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni  wrote:
> Hello All,
>
> DRILL-4203 addressed the date field issue.  In the fix, it introduced
> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
> difficulty in understanding the meaning of this field.
>
> According to [1], this field is set to false, when Drill gets parquet
> metadata from parquet footer.  This field is  set to true in code flow
> of [2] and [3], when Drill gets parquet metadata from meta data cache.
>
> Questions I have:
> 1.  If the parquet files are generated with Drill after DRILL-4203,
> Drill still thinks date field is NOT correct (isDateCorrect = false)?
> 2.  Why does this filed have nothing to do with "autoCorrection" flag
> [4]?  If someone turns off autoCorrection, will it have impact on this
> "isDateCorrect" flag ?
>
> Thanks in advance for any input,
>
> Jinfeng
>
>
> [1] 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
> [2] 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
> [3] 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
> [4] 
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L354-L355


isDateCorrect field in ParquetTableMetadata

2016-10-24 Thread Jinfeng Ni
Hello All,

DRILL-4203 addressed the date field issue.  In the fix, it introduced
a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have some
difficulty in understanding the meaning of this field.

According to [1], this field is set to false, when Drill gets parquet
metadata from parquet footer.  This field is  set to true in code flow
of [2] and [3], when Drill gets parquet metadata from meta data cache.

Questions I have:
1.  If the parquet files are generated with Drill after DRILL-4203,
Drill still thinks date field is NOT correct (isDateCorrect = false)?
2.  Why does this filed have nothing to do with "autoCorrection" flag
[4]?  If someone turns off autoCorrection, will it have impact on this
"isDateCorrect" flag ?

Thanks in advance for any input,

Jinfeng


[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L932
[2] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L936
[3] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L187
[4] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L354-L355