Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Sergio Pena Wed, 28 Jan 2015 07:54:06 -0800


> On Ene. 28, 2015, 5:23 a.m., cheng xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java,
> >  lines 218-225
> > <https://reviews.apache.org/r/30281/diff/2-3/?file=835466#file835466line218>
> >
> >     How about the following code snippet?
> >     recordConsummer.startField(fieldName, i);
> >     if(i % 2 == 0){
> >       writeValue(keyElement, KeyInspector, fieldType);
> >     }else{
> >       writeValue(valueElement, valueInspector, fieldType);
> >     }
> >     recordConsumer.endField(fieldName, i);


The parquet API does not accept NULL values inside startField/endField. This is 
why I had to check if key or value are nulls before starting the field. Or in 
the change I did, we check for null values everywhere, and then call 
startField/endField on writePrimitive. You can see the 
TestDataWritableWriter.testMapType() method for how null values should work. 

This is how Parquet adds map value 'key3 = null'

startGroup();
  startField("key", 0);
    addString("key3");
  endField("key", 0);
endGroup();


> On Ene. 28, 2015, 5:23 a.m., cheng xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java,
> >  line 76
> > <https://reviews.apache.org/r/30281/diff/2-3/?file=835466#file835466line76>
> >
> >     Hi Sergio, I am a little confused about the purpose of pushing 
> > startFiled& endField down. As the method name "writeGroupFields" indicates, 
> > it will write fields of group one by one. My suggestion is moving back 
> > these two lines. If I missed anything, please tell me your consideration 
> > about this change.

See the comment regarind thte writeMap() method. We can go back to the original 
implemenation to make it look better, but writeMap() will look not very clean. 
The thing is that we cannot add null values inside startField/endField.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review69935
-----------------------------------------------------------


On Ene. 27, 2015, 6:47 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Ene. 27, 2015, 6:47 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to 
> DataWritableWriter class in order to save time in materializing data on 
> serialize().
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  ea4109d358f7c48d1e2042e5da299475de4a0a29 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 
> 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  060b1b722d32f3b2f88304a1a73eb249e150294b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  41b5f1c3b0ab43f734f8a211e3e03d5060c75434 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java
>  e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> a693aff18516d133abf0aae4847d3fe00b9f1c96 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java
>  667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 
> 007a665529857bcec612f638a157aa5043562a15 
>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable 
> objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s) 
>       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  
> 19x speed increase
> DataWritableWriter.write:             5,033                     5,201   ->  
> 3.34% speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format 
> using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 30281: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Reply via email to