[ https://issues.apache.org/jira/browse/PARQUET-459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113835#comment-15113835 ]
Wes McKinney edited comment on PARQUET-459 at 1/23/16 4:50 PM: --------------------------------------------------------------- Do you have a patch for PARQUET-428 somewhere? Re: PARQUET-435: all of the data in Parquet is stored flat, so these lowest level classes should probably just concern themselves with interacting with the file format itself: i.e. decoding the rep/def levels and values as fast as possible. The idea with PARQUET-435 was reading the raw decoded data in batches into C arrays, and you can read the def/rep levels and values separately rather than together (so if you only want the values, you don't have to decode the levels necessarily). Determining the null values will be up to you in that case. Separately, I agree we should have code for interpreting the rep / def levels as a nested data structure. There are many different choices of data structures (record-oriented, column-oriented), so we should implement various options. I know that [~nongli] and others on the Spark team are working on columnar Parquet performance right now, so we should collaborate there on algorithms. was (Author: wesmckinn): Do you have a patch for PARQUET-428 somewhere? Re: PARQUET-435: all of the data in Parquet is stored flat, so these lowest level classes should probably just concern themselves with interacting with the file format itself: i.e. decoding the rep/def levels and values as fast as possible. The idea was reading the raw decoded data in batches into C arrays, and you can read the def/rep levels and values separately rather than together (so if you only want the values, you don't have to decode the levels necessarily). Determining the null values will be up to you in that case. Separately, I agree we should have code for interpreting the rep / def levels as a nested data structure. There are many different choices of data structures (record-oriented, column-oriented), so we should implement various options. I know that [~nongli] and others on the Spark team are working on columnar Parquet performance right now, so we should collaborate there on algorithms. > Improve handling of null values > ------------------------------- > > Key: PARQUET-459 > URL: https://issues.apache.org/jira/browse/PARQUET-459 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: Deepak Majeti > > Currently, the default value of the type is returned for NULL values and is > incorrect. > This JIRA will correctly identify a NULL value with the help of an additional > variable that will be set for NULL values. > This feature depends on reading the repetition level (PARQUET-169). -- This message was sent by Atlassian JIRA (v6.3.4#6332)