Hi,
I can implement the method addNull is the recordConsumer(public void
addNull()), and
But If I do this, I have an issue when I'm reading the value again. This is
normal because I'm trying to read an INT where I have an EOF (because I didn't
have a way to say : skip it, it's null)
Caused by: parquet.io.ParquetDecodingException: Can't read value in column
[lstint, bag, array_element] INT32 at value 2 out of 2, 2 out of 2 in
currentPage. repetition level: 1, definition level: 3
at
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:466)
at
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:368)
at
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:400)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:173)
... 23 more
Caused by: parquet.io.ParquetDecodingException: could not read int
[...]
Caused by: java.io.EOFException
at
parquet.bytes.LittleEndianDataInputStream.readInt(LittleEndianDataInputStream.java:352)
The thing is , how I am suppose to read a non existing value ? Do you think we
could add this feature ? (having null value inside an array) ?
--
Mickaël Lacour
Senior Software Engineer
Analytics Infrastructure team @Scalability
________________________________________
De : Ryan Blue <[email protected]>
Envoyé : vendredi 3 octobre 2014 18:46
À : [email protected]
Cc : Julien Le Dem; Justin Coffey; Remy Pecqueur
Objet : Re: How to handle null values in an array and keeping the right size of
this field ?
Does it work to add a null value?
startField("lstint", 0)
startField("bag", 0)
addValue(7)
addValue(null)
endField("bag", 0)
endField("lstint", 0)
rb
On 10/03/2014 07:54 AM, Mickaël Lacour wrote:
> Hello all :)
>
>
> I'm working on this issue : https://issues.apache.org/jira/browse/HIVE-6994
>
> My dataset is very simple : 3 columns. Here the schema (hadoop-tools schema):
>
>
> message hive_schema {
> optional int32 id;
> optional group lstint (LIST) {
> repeated group bag {
> optional int32 array_element;
> }
> }
> optional group lststr (LIST) {
> repeated group bag {
> optional binary array_element;
> }
> }
> }
>
> And the content (hadoop-tools cat)
>
> id = 2
> lstint:
> .bag:
> ..array_element = 7
> lststr:
> .bag:
> ..array_element = e
> .bag:
> ..array_element = e
>
> And the original data that I wanted to write ("|" is the column delimiter,
> and "," is the elements delimiter inside an array) :
>
> 1|7,|e,e
>
> Here my issue: the size of my array (the first one called lstint) should be
> 2, but parquet is only keeping one field (the other is null). So for Parquet
> the size of my array is 1.
> I want to keep this information and I don't know how to do it. Basically I
> cannot ask my recordConsumer to startField if I have no value to add. If I
> do this, when I ask the recordConsumer to endField, I'm having this error :
>
> throw new ParquetEncodingException("empty fields are illegal, the field
> should be ommitted completely instead");
>
> So I can't do this, and I don't have any method inside the recordConsumer to
> add an empty field inside a "column". Of course If my array is null, parquet
> is going to add the null field for this missing column.
>
> And another issue I have (related to this one). I cannot write an array with
> only null fields (|,,,|) I'm getting the previous exception.
>
> Any advice ? (should we add a new method to be able to have empty fields?).
>
> @Julien : I'm adding you in CC because I didn't see the last mail I sent to
> the mailing list. Can you forward it in case I don't have the right
> permission ? Thx !
>
> --
>
> Mickaël Lacour
>
> Senior Software Engineer
>
> Analytics Infrastructure team @Scalability
>
> Criteo
--
Ryan Blue
Software Engineer
Cloudera, Inc.