> On Nov. 11, 2014, 11:02 p.m., Ryan Blue wrote: > > Great job! I think this is really close. I found one bug and some minor > > stuff, all of which is noted below. Other than those, I have two major > > comments: > > > > First, the organization seems to duplicate code from writeFields in > > writeMap and writeArray, where each field of the inner group is written. As > > it is, this duplicates the code from writeFields and writeFields could be > > used instead. This also deviates from what the other writer implementations > > do, which is to require the schema to match a certain pattern and write > > exactly that pattern. See [1] for an example of what I'm talking about. But > > I think this was actually structured this way to work with multiple write > > schemas... > > > > The second problem is that this works with multiple write schemas. This was > > probably a mis-communication on my part, but the backward-compatibility > > rules in PARQUET-113 are for the read side, not the write side. Hive is > > already producing the correct structure when writing lists and maps, so we > > don't need to change that. The problem that PARQUET-113 addresses is > > reading data written by parquet-avro or parquet-thrift (or similar > > variants) that uses the 2-level representation. We still need to fix > > compatibility with that data, but it probably should be done as a separate > > issue. > > > > [1] - > > https://github.com/rdblue/incubator-parquet-mr/blob/PARQUET-113-nested-types-avro/parquet-avro/src/main/java/parquet/avro/AvroWriteSupport.java#L118
Thanks for the comments. I got confused about writing different schema formats. At some point during coding I treated Hive as Mysql, where you can insert data to tables, so I imagined that schemas would be different. But you're correct, this will never happen. So, here's my plan for this: - remove unit-tests that have different schemas that will never be used by Hive when writing - fix DataWritableWriter.java to detect only one type of schema for the different types - check if I can call writeFields or writeGroup from writeMap and writeList functions to reuse code. Btw, writeMap can call writeGroup with no problems, but writeList has a slight difference with writeFields. I'll let you know in the next patch - Sergio ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27404/#review60863 ----------------------------------------------------------- On Nov. 10, 2014, 7:04 p.m., Sergio Pena wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/27404/ > ----------------------------------------------------------- > > (Updated Nov. 10, 2014, 7:04 p.m.) > > > Review request for hive, Ryan Blue and Brock Noland. > > > Bugs: HIVE-8359 > https://issues.apache.org/jira/browse/HIVE-8359 > > > Repository: hive-git > > > Description > ------- > > The patch changes the way DataWritableWriter class writes an array of > elements to the Parquet record. > It wraps each array record into a new startGroup/endGroup block so that > Parquet can detect null values on those optional fields. > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java > c7078efe27482df0a11dd68ac068da27dbcf51b3 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java > PRE-CREATION > ql/src/test/queries/clientpositive/parquet_map_null.q PRE-CREATION > ql/src/test/results/clientpositive/parquet_map_null.q.out PRE-CREATION > > Diff: https://reviews.apache.org/r/27404/diff/ > > > Testing > ------- > > > Thanks, > > Sergio Pena > >