Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Sergio Pena Thu, 28 Jan 2016 13:00:17 -0800


> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java,
> >  line 53
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195960#file1195960line53>
> >
> >     This is a little odd. Normally, I would consider GMT/UTC to be the 
> > calendar that applies no adjustment. But what is happening in this code is 
> > that the UTC millisecond value is being constructed using values in that 
> > zone and then a new timestamp is created from it that is in the local zone. 
> > So local->local produces no adjustment, while GMT->local does.
> >     
> >     I was thinking that GMT would always be used to create the 
> > milliseconds, then the calendar would be used to adjust the value by some 
> > amount. UTC by 0, EST by -5, and PST by -8.
> >     
> >     I think it may be easier to have a method that does a single conversion 
> > to UTC in milliseconds. Then, adjust the value using a static offset like 
> > the one from `TimeZone.getOffset(utc_ms)`. That would result in a bit less 
> > work done here and I think would make the adjustment easier to reason about.


I can fix this in a follow JIRA to keep the current code without adding extra 
bugs or tests.


> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java,
> >  line 232
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195963#file1195963line232>
> >
> >     It looks like this tests that the value written is the result of 
> > `getNanoTime(Timestamp,Calendar).toBinary()`, but doesn't test what that 
> > value is or should be. That is probably the intent, but I wanted to make 
> > sure.

It is testing that the Binary from of the Timestamp passed to addBinary() is 
the same of the one found on the ArrayWritable.

This is the ArrayWritable, the one that Hive makes when writing to Parquet:
    ArrayWritable hiveRecord = createGroup(
      createTimestamp(Timestamp.valueOf("2016-01-01 01:01:01"))
    );
    
DataWritableWriter then gets the Timestamp value, and converts it to a Binary 
form. Here we convert the same Timestamp to Binary
to check that the passed value to addBinary() matches.
    startMessage();
      startField("ts", 0);
        addBinary(NanoTimeUtils.getNanoTime(Timestamp.valueOf("2016-01-01 
01:01:01"),
            Calendar.getInstance(TimeZone.getTimeZone("CST"))).toBinary());
      endField("ts", 0);
    endMessage();


> On Jan. 22, 2016, 11:15 p.m., Ryan Blue wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java,
> >  line 214
> > <https://reviews.apache.org/r/41821/diff/3/?file=1195966#file1195966line214>
> >
> >     This is a great test for round trip, but I think we should also have a 
> > test with a set of corrupt values from a file written with 5.5 in 
> > America/Los_Angeles (since that's convenient).

I added the negative tests to parquet_int96_timestamp_errors.q


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/#review115731
-----------------------------------------------------------


On Jan. 13, 2016, 8:36 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41821/
> -----------------------------------------------------------
> 
> (Updated Jan. 13, 2016, 8:36 p.m.)
> 
> 
> Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and 
> Szehon Ho.
> 
> 
> Bugs: HIVE-12767
>     https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following exit criteria is addressed in this patch:
> 
> * Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> 
> * Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> 
> * New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
>   * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
>   * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the 
> property of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties 
> ('parquet.mr.int96.write.zone'='PST');
>   
> To set UTC as default timezone table property on new tables created, use 
> this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
> 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 
> deec8bba45c130c5dfdc482522c0825a71af9d2c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  bfb48a987ce89a373f3da63c9162546c6eda43a9 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java 
> ec0dd818f688ab92feb46be4fb6040ede5ac756a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  53f3b72b790d87a75a7cd1d77d8f011c29c41188 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
>  74a1a82047613189678716f765bfaa9ac39b7618 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
> aace48ee7d145d199163286d21e4ee7694140d6f 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  69272dc41dbc5fe29ab4c98e730b591c28f3a297 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> 70491390ba2b90f32ef9963be7b19e57672241f3 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>  ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41821/diff/
> 
> 
> Testing
> -------
> 
> Added unit and q-tests:
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Reply via email to