Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

Sergio Pena Sat, 27 Jun 2015 17:30:07 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
-----------------------------------------------------------


(Updated June 28, 2015, 12:29 a.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Bugs: HIVE-11131
    https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
-------

Implemented data type writers that will be created before the first Hive row is 
written to Parquet. These writers contain information about object inspectors 
and schema of a specific data type, and calls the specific addXXXX() method 
used by Parquet for each data type.


Diffs
-----

  
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 
c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing (updated)
-------

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from 
this new implemntation:

Using repeated rows across the file, this is the throughput increase using 1 
million records:

bigint  boolean double  float   int         string
7.598   7.491   7.488   7.588   7.53    0.270     (before)
10.137  11.511  10.155  10.297  10.242  0.286     (after)

Using random rows across the file, the is the throughput increase using 1 
million records:

bigint  boolean double  float   int         string
5.268   7.723   4.107   4.173   4.729   0.20       (before)
6.236   10.466  5.944   4.749   5.234   0.22       (after)


Thanks,

Sergio Pena

Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

Reply via email to