[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

Arthur Passos (Jira) Mon, 22 Aug 2022 04:27:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582918#comment-17582918
 ]


Arthur Passos commented on ARROW-17459:
---------------------------------------

[~willjones127] at a first glance, it seems to be working. The client code I 
had was something like the below:


{code:java}
std::shared_ptr<arrow::Table> table;
arrow::Status read_status = file_reader->ReadRowGroup(row_group_current, 
column_indices, &table);
if (!read_status.ok())
    throw ParsingException{"Error while reading Parquet data: " + 
read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
++row_group_current;
{code}
 

Now it's the below:
{code:java}
std::shared_ptr<arrow::Table> table;

std::unique_ptr<::arrow::RecordBatchReader> rbr;
std::vector<int> row_group_indices { row_group_current };
arrow::Status get_batch_reader_status = 
file_reader->GetRecordBatchReader(row_group_indices, column_indices, &rbr);

if (!get_batch_reader_status.ok())
throw ParsingException{"Error while reading Parquet data: " + 
get_batch_reader_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};

arrow::Status read_status = rbr->ReadAll(&table);

if (!read_status.ok())
throw ParsingException{"Error while reading Parquet data: " + 
read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};

++row_group_current;{code}
 

*Question: Should I expect any regressions or different behaviour by changing 
the code path to the latter?*

 

> [C++] Support nested data conversions for chunked array
> -------------------------------------------------------
>
>                 Key: ARROW-17459
>                 URL: https://issues.apache.org/jira/browse/ARROW-17459
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Arthur Passos
>            Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

Reply via email to