> On Ene. 28, 2015, 5:23 a.m., cheng xu wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java, > > lines 218-225 > > <https://reviews.apache.org/r/30281/diff/2-3/?file=835466#file835466line218> > > > > How about the following code snippet? > > recordConsummer.startField(fieldName, i); > > if(i % 2 == 0){ > > writeValue(keyElement, KeyInspector, fieldType); > > }else{ > > writeValue(valueElement, valueInspector, fieldType); > > } > > recordConsumer.endField(fieldName, i);
The parquet API does not accept NULL values inside startField/endField. This is why I had to check if key or value are nulls before starting the field. Or in the change I did, we check for null values everywhere, and then call startField/endField on writePrimitive. You can see the TestDataWritableWriter.testMapType() method for how null values should work. This is how Parquet adds map value 'key3 = null' startGroup(); startField("key", 0); addString("key3"); endField("key", 0); endGroup(); > On Ene. 28, 2015, 5:23 a.m., cheng xu wrote: > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java, > > line 76 > > <https://reviews.apache.org/r/30281/diff/2-3/?file=835466#file835466line76> > > > > Hi Sergio, I am a little confused about the purpose of pushing > > startFiled& endField down. As the method name "writeGroupFields" indicates, > > it will write fields of group one by one. My suggestion is moving back > > these two lines. If I missed anything, please tell me your consideration > > about this change. See the comment regarind thte writeMap() method. We can go back to the original implemenation to make it look better, but writeMap() will look not very clean. The thing is that we cannot add null values inside startField/endField. - Sergio ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30281/#review69935 ----------------------------------------------------------- On Ene. 27, 2015, 6:47 p.m., Sergio Pena wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30281/ > ----------------------------------------------------------- > > (Updated Ene. 27, 2015, 6:47 p.m.) > > > Review request for hive, Ryan Blue, cheng xu, and Dong Chen. > > > Bugs: HIVE-9333 > https://issues.apache.org/jira/browse/HIVE-9333 > > > Repository: hive-git > > > Description > ------- > > This patch moves the ParquetHiveSerDe.serialize() implementation to > DataWritableWriter class in order to save time in materializing data on > serialize(). > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java > ea4109d358f7c48d1e2042e5da299475de4a0a29 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java > 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java > 060b1b722d32f3b2f88304a1a73eb249e150294b > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java > 41b5f1c3b0ab43f734f8a211e3e03d5060c75434 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java > e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java > a693aff18516d133abf0aae4847d3fe00b9f1c96 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java > 667d3671547190d363107019cd9a2d105d26d336 > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java > 007a665529857bcec612f638a157aa5043562a15 > serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/30281/diff/ > > > Testing > ------- > > The tests run were the following: > > 1. JMH (Java microbenchmark) > > This benchmark called parquet serialize/write methods using text writable > objects. > > Class.method Before Change (ops/s) After Change (ops/s) > > ------------------------------------------------------------------------------- > ParquetHiveSerDe.serialize: 19,113 249,528 -> > 19x speed increase > DataWritableWriter.write: 5,033 5,201 -> > 3.34% speed increase > > > 2. Write 20 million rows (~1GB file) from Text to Parquet > > I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format > using the following > statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text; > > Time (s) it took to write the whole file BEFORE changes: 93.758 s > Time (s) it took to write the whole file AFTER changes: 83.903 s > > It got a 10% of speed inscrease. > > > Thanks, > > Sergio Pena > >