[jira] [Comment Edited] (PARQUET-459) Improve handling of null values

Wes McKinney (JIRA) Sat, 23 Jan 2016 08:51:11 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113835#comment-15113835
 ]


Wes McKinney edited comment on PARQUET-459 at 1/23/16 4:50 PM:
---------------------------------------------------------------

Do you have a patch for PARQUET-428 somewhere? 

Re: PARQUET-435: all of the data in Parquet is stored flat, so these lowest 
level classes should probably just concern themselves with interacting with the 
file format itself: i.e. decoding the rep/def levels and values as fast as 
possible. The idea with PARQUET-435 was reading the raw decoded data in batches 
into C arrays, and you can read the def/rep levels and values separately rather 
than together (so if you only want the values, you don't have to decode the 
levels necessarily). Determining the null values will be up to you in that 
case. 

Separately, I agree we should have code for interpreting the rep / def levels 
as a nested data structure. There are many different choices of data structures 
(record-oriented, column-oriented), so we should implement various options. I 
know that [~nongli] and others on the Spark team are working on columnar 
Parquet performance right now, so we should collaborate there on algorithms.


was (Author: wesmckinn):
Do you have a patch for PARQUET-428 somewhere? 

Re: PARQUET-435: all of the data in Parquet is stored flat, so these lowest 
level classes should probably just concern themselves with interacting with the 
file format itself: i.e. decoding the rep/def levels and values as fast as 
possible. The idea was reading the raw decoded data in batches into C arrays, 
and you can read the def/rep levels and values separately rather than together 
(so if you only want the values, you don't have to decode the levels 
necessarily). Determining the null values will be up to you in that case. 

Separately, I agree we should have code for interpreting the rep / def levels 
as a nested data structure. There are many different choices of data structures 
(record-oriented, column-oriented), so we should implement various options. I 
know that [~nongli] and others on the Spark team are working on columnar 
Parquet performance right now, so we should collaborate there on algorithms.

> Improve handling of null values
> -------------------------------
>
>                 Key: PARQUET-459
>                 URL: https://issues.apache.org/jira/browse/PARQUET-459
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Deepak Majeti
>
> Currently, the default value of the type is returned for NULL values and is 
> incorrect.
> This JIRA will correctly identify a NULL value with the help of an additional 
> variable that will be set for NULL values. 
> This feature depends on reading the repetition level (PARQUET-169).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PARQUET-459) Improve handling of null values

Reply via email to