> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java
> > Line 72 (original), 72 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695509#file1695509line72>
> >
> >     How does this work? I don't understand this change.

The user.timezone system property is used to set the default timezone of the 
JVM. If this is set on the HS2 instance then we need to propagate it to the 
child VM spawned by a local task or timestamps read by the local task will be 
incorrect.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java
> > Line 181 (original), 181 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695511#file1695511line181>
> >
> >     Is this compatible with old parquet tables? if the property is not set, 
> > then the validateTimeZone    might fail, right? If so, do we want to fail 
> > reading tables that do not have a property set?
> >     
> >     Something else to consider, if a user sets a timezone improperly in a 
> > different tool or something      happened that we got an invalid timezone, 
> > then do we want to fail when reading those files? Just      wondering this 
> > scenario, no need to fix it right away.

At this point the timezone property had to be set by 
ParquetTableUtils#setParquetTimeZoneIfAbsent either from the table properties 
or using the default value TimeZone#getDefault. The core problem is that I 
found it very difficult to make sure that <every> execution path will check the 
table property.
- The FetchOperator works when we have a local task, but the 
MapRedParquetInputFormat does not (MapWork is null). 
- The FetchOperator will not work with a complex query or an order by clause, 
but the InputFormat should work in this case. 
- For statistics gathering only the StatNoJobTask is executed.
I wanted to make sure that if we have an execution path I forgot about, then we 
should rather fail than to read incorrect timestamp values silently.
Similarly in my opinion if the timezone value is incorrect (because it was set 
by another tool) then we should fail instead of reading illadjusted values.


> On May 2, 2017, 5:27 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java
> > Lines 35 (patched)
> > <https://reviews.apache.org/r/58501/diff/3/?file=1695512#file1695512line35>
> >
> >     Why is Map<?, ?> used instead of Map<String, String>? Aren't all table 
> > properties key, value string pairs?
> >     
> >     Also, the ensureTablePropertySet() name seems not related to what we 
> > want to do. I thought it was going to throw an exception if the property 
> > was not set, but it is setting the value on the JobConf. Should we use a 
> > different name, such as setParquetTimeZoneIfNotSet(),      
> > setParquetTimeZoneIfAbsent() or something like that helps us understand 
> > quickly without looking at the javadoc.

We are calling this method with Properties objects (i.e. from the 
FetchOperator) and using Map<String, String> objects (i.e. from the 
StatsNoJobTask) and the common ancestor for these two is the Map<?,?>. While it 
is true that the table properties can only be Strings so the Properties should 
only contain String pairs I wanted to avoid the explicit cast.


- Barna Zsombor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58501/#review173610
-----------------------------------------------------------


On May 3, 2017, 12:59 p.m., Barna Zsombor Klara wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58501/
> -----------------------------------------------------------
> 
> (Updated May 3, 2017, 12:59 p.m.)
> 
> 
> Review request for hive, Sergio Pena and Zoltan Ivanfi.
> 
> 
> Bugs: HIVE-16469
>     https://issues.apache.org/jira/browse/HIVE-16469
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-16469: Parquet timestamp table property is not always taken into account
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 
> 757b7fc0eaa39c956014aa446ab1b07fc4abf8d3 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java 
> 13750cdc34711d22f2adf2f483a6773ad05fb8d2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/StatsNoJobTask.java 
> 9c3a664b9aea2d6e050ffe2d7626127827dbc52a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapRedTask.java 
> 1bd4db7805689ae1f91921ffbb5ff7da59f4bf60 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java
>  f4fadbb61bf45f62945700284c0b050f0984b696 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java 
> 2954601ce5bb25905cdb29ca0ca4551c2ca12b95 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 
> 6413c5add6db2e8c9298285b15dba33ee74379a8 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
> b339cc4347eea143dca2f6d98f9aa7777afdc427 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
> dbd6fb3d0bc8c753abf86e99b52377617f248b5a 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/AbstractTestParquetDirect.java
>  c81499a91c84af3ba33f335506c1c44e7085f13d 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRowGroupFilter.java
>  bf363f32a3ac0a4d790e2925d802c6e210adfb4b 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java
>  f2d79cf9d215e9a6e2a5e88cfc78378be860fd1f 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestNanoTimeUtils.java
>  1e10dbf18742524982606f1e6c6d447d683b2dc3 
>   ql/src/test/queries/clientnegative/parquet_int96_alter_invalid_timezone.q 
> PRE-CREATION 
>   ql/src/test/queries/clientnegative/parquet_int96_create_invalid_timezone.q 
> PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q 
> 6eadd1b0a3313cbba7a798890b802baae302749e 
>   
> ql/src/test/results/clientnegative/parquet_int96_alter_invalid_timezone.q.out 
> PRE-CREATION 
>   
> ql/src/test/results/clientnegative/parquet_int96_create_invalid_timezone.q.out
>  PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out 
> b9a3664458a83f1856e4bc59eba5d56665df61cc 
>   ql/src/test/results/clientpositive/spark/parquet_int96_timestamp.q.out 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58501/diff/4/
> 
> 
> Testing
> -------
> 
> Added qtests for the following cases:
> - order by clause
> - selfjoin
> - calling UDFs with the timestamp values
> - where clause with a constant cast as timestamp
> - test for HoS
> - implicit and explicit timestamp conversions in insert clause
> 
> Tested manually but no qtests:
> - join between 3 tables all parquet but with different/no timezone property
> - subselect in from/where clauses
> - exists / union / no exists
> 
> 
> Thanks,
> 
> Barna Zsombor Klara
> 
>

Reply via email to