Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Sergio Pena Fri, 14 Nov 2014 09:04:13 -0800


> On Nov. 11, 2014, 11:02 p.m., Ryan Blue wrote:
> > Great job! I think this is really close. I found one bug and some minor 
> > stuff, all of which is noted below. Other than those, I have two major 
> > comments:
> > 
> > First, the organization seems to duplicate code from writeFields in 
> > writeMap and writeArray, where each field of the inner group is written. As 
> > it is, this duplicates the code from writeFields and writeFields could be 
> > used instead. This also deviates from what the other writer implementations 
> > do, which is to require the schema to match a certain pattern and write 
> > exactly that pattern. See [1] for an example of what I'm talking about. But 
> > I think this was actually structured this way to work with multiple write 
> > schemas...
> > 
> > The second problem is that this works with multiple write schemas. This was 
> > probably a mis-communication on my part, but the backward-compatibility 
> > rules in PARQUET-113 are for the read side, not the write side. Hive is 
> > already producing the correct structure when writing lists and maps, so we 
> > don't need to change that. The problem that PARQUET-113 addresses is 
> > reading data written by parquet-avro or parquet-thrift (or similar 
> > variants) that uses the 2-level representation. We still need to fix 
> > compatibility with that data, but it probably should be done as a separate 
> > issue.
> > 
> > [1] - 
> > https://github.com/rdblue/incubator-parquet-mr/blob/PARQUET-113-nested-types-avro/parquet-avro/src/main/java/parquet/avro/AvroWriteSupport.java#L118


Thanks for the comments. I got confused about writing different schema formats. 
At some point during coding I treated Hive as Mysql, where you can insert data 
to tables, so I imagined that schemas would be different. But you're correct, 
this will never happen.

So, here's my plan for this:
- remove unit-tests that have different schemas that will never be used by Hive 
when writing
- fix DataWritableWriter.java to detect only one type of schema for the 
different types
- check if I can call writeFields or writeGroup from writeMap and writeList 
functions to reuse code. Btw, writeMap can call writeGroup with no problems, 
but writeList has a slight difference with writeFields. I'll let you know in 
the next patch


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27404/#review60863
-----------------------------------------------------------


On Nov. 10, 2014, 7:04 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27404/
> -----------------------------------------------------------
> 
> (Updated Nov. 10, 2014, 7:04 p.m.)
> 
> 
> Review request for hive, Ryan Blue and Brock Noland.
> 
> 
> Bugs: HIVE-8359
>     https://issues.apache.org/jira/browse/HIVE-8359
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch changes the way DataWritableWriter class writes an array of 
> elements to the Parquet record. 
> It wraps each array record into a new startGroup/endGroup block so that 
> Parquet can detect null values on those optional fields.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  c7078efe27482df0a11dd68ac068da27dbcf51b3 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_map_null.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_map_null.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27404/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 27404: HIVE-8359 Map containing null values are not correctly written in Parquet files

Reply via email to