[jira] [Updated] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks
[ https://issues.apache.org/jira/browse/ARROW-18307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arthur Passos updated ARROW-18307: -- Description: I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr getNestedArrowColumn(std::shared_ptr & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i))); std::shared_ptr chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. One pattern I noticed is that if I read only the Array(Int64) column, I get only one chunk. If I read both columns, I get two chunks. It looks like all columns will, inevitably, have the same number of chunks, even though its buffer is not chunked accordingly. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example assumes there is only on chunk and ignores the possibility of it having multiple chunks. It's probably just a detail and the test wasn't actually intended to cover multiple chunks. I managed to get the expected output doing something like the below: {code:java} auto & list_chunk1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))); auto & list_chunk2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))); auto l1_offset = *list_chunk1.raw_value_offsets(); auto l2_offset = *list_chunk2.raw_value_offsets(); auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length); auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length); auto lcv1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - l1_offset).ValueOrDie(); auto lcv2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - l2_offset).ValueOrDie();{code} This looks too hackish and I feel like there is a much better way. Hence, my question: How do I properly extract the data & offsets out of such column? A more generic version of this is: how to extract the data out of ChunkedArrays with multiple chunks? was: I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr getNestedArrowColumn(std::shared_ptr & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i))); std::shared_ptr chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example
[jira] [Created] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks
Arthur Passos created ARROW-18307: - Summary: [C++] Read list/array data from ChunkedArray with multiple chunks Key: ARROW-18307 URL: https://issues.apache.org/jira/browse/ARROW-18307 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Arthur Passos I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr getNestedArrowColumn(std::shared_ptr & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i))); std::shared_ptr chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example assumes there is only on chunk and ignores the possibility of it having multiple chunks. It's probably just a detail and the test wasn't actually intended to cover multiple chunks. I managed to get the expected output doing something like the below: {code:java} auto & list_chunk1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))); auto & list_chunk2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))); auto l1_offset = *list_chunk1.raw_value_offsets(); auto l2_offset = *list_chunk2.raw_value_offsets(); auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length); auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length); auto lcv1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - l1_offset).ValueOrDie(); auto lcv2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - l2_offset).ValueOrDie();{code} This looks too hackish and I feel like there is a much better way. Hence, my question: How do I properly extract the data & offsets out of such column? A more generic version of this is: how to extract the data out of ChunkedArrays with multiple chunks? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629192#comment-17629192 ] Arthur Passos commented on ARROW-17459: --- Hi [~willjones127] . I have implemented your suggestion of GetRecordBatchReader and, at first, things seemed to work as expected. Recently, an issue regarding parquet data has been reported and reverting it to the ReadRowGroup solution seems to address this. This might be a misuse of the arrow library on my side, even though I have read the API docs and it looks correct. My question is pretty much: should there be difference in the output when using the two APIs? > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Assignee: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599082#comment-17599082 ] Arthur Passos commented on ARROW-17459: --- I see. That seems like a long journey for a non arrow developer / parquet expert to go through. Given the timeline I am working on, in the short term, I think I'll resort to the first suggestion by [~willjones127]. While it doesn't fix the second case, it fixes the one I originally shared. Which makes me curious, why does that fix the Map but doesn't fix the one generated by the above script? > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Assignee: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598993#comment-17598993 ] Arthur Passos commented on ARROW-17459: --- [~emkornfield] I have changed a few places to use LargeBinary/LargeString and also commented out [this type assertion|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L301]. After that, I am able to read the parquet file. Would a PR that forces the use of LargeBinary/LargeString by default be acceptable? Plus, if you have any tips on how to work around that assertion without commenting it out, that would be great. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Assignee: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arthur Passos reassigned ARROW-17459: - Assignee: Arthur Passos > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Assignee: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598581#comment-17598581 ] Arthur Passos commented on ARROW-17459: --- I am a bit lost rn. I have made some changes to use LargeBinaryBuilder, but there is always an incosistency that throws an exception. Are you aware of any place in the code where instead of taking the String path it would take the LargeString path? I went all the way back to where it reads the schema in the hope of finding a place I could change the DataType from STRING to LARGE_STRING. Couldn't do so. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598383#comment-17598383 ] Arthur Passos edited comment on ARROW-17459 at 8/31/22 2:31 PM: [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. edit: I have just tested the snippet shared by Will Jones using `type = pa.map_(pa.large_string(), pa.int64())` instead of `type = pa.map_(pa.string(), pa.int32())` and the issue persists. was (Author: JIRAUSER294600): [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598383#comment-17598383 ] Arthur Passos commented on ARROW-17459: --- [~emkornfield] if I understand correctly, this could help with the original case I shared. In the case [~willjones127] shared, where he creates a ChunkedArray and then serializes it, it wouldn't help. Is that correct? I am stating this based on my current understanding of the inner workings of `arrow`: The ChunkedArray data structure will be used in two or more situations: 1. The data in a row group exceeds the limit of INT_MAX (Case I initially shared) 2. The serialized data/ table is a chunked array, thus it makes sense to use a chunked array. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598050#comment-17598050 ] Arthur Passos commented on ARROW-17459: --- [~emkornfield] thank you for your answer. Can you clarify what you mean by "read back arrays to always use the Large* variant"? I don't know what "back array" and "large variant" refer to, tho I can especulate what the latter means. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598019#comment-17598019 ] Arthur Passos commented on ARROW-17459: --- Hi [~emkornfield]. I see you are one of the authors of [https://github.com/apache/arrow/pull/8177|https://github.com/apache/arrow/pull/8177.]. I see the following snippet was introduced on that PR: {code:java} // ARROW-3762(wesm): If item reader yields a chunked array, we reject as // this is not yet implemented return Status::NotImplemented( "Nested data conversions not implemented for chunked array outputs");{code} I wonder why this wasn't implemented. Is there a techinical limitation or the approach wasn't very well defined? I am pretty new to Parquet and to `arrow` library, so it's very hard to me to reason about all of these concepts and code. From the top of my head, I got a couple of silly ideas: # Find a way to convert a ChunkedArray into a single Array. That requires a processing step that allocates a contiguous chunk of memory big enough to hold all chunks. Plus, there is no clear interface to do so. # Create a new ChunkedArray class that can hold ChunkedArrays. As of now, it can only hold raw Arrays. That would require a LOT of changes in other {{arrow}} classes and, of course, it's not guaranteed to work. # Make the chunk memory limit configurable (not sure it's feasible) Do you see any of these as a path forward? If not, what would be the path forward? > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597846#comment-17597846 ] Arthur Passos commented on ARROW-17459: --- [~willjones127] Thank you for sharing this! While your `GetRecordBatchReader` suggestion works for the use case I shared, it won't work for this one. Are there any docs I could read to understand the internals of arrow lib in order to implement it? Any tips would be appreciated.. The only thing that comes to mind right now is to somehow build a giant array with all the chunks, but it certainly has a set of implications. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584808#comment-17584808 ] Arthur Passos commented on ARROW-17459: --- I am also trying to write test to cover this case, but failing to do so. For some reason, the files I generate with the very same schema and size don't get chunked while reading it. The original file was provided by a customer and it's confidential data, so it can't be used. All the files I generated contain the above mentioned schema. The differences are in the data length. Some had maps of 50~300 elements with keys of random strings of 20~50 characters and values of random strings of 50~5000 characters. I also tried a low cardinality example and a large string example (2^30 characters). I'd be very thankful if someone could give me some tips on how to generate a file that will trigger the exception. > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582918#comment-17582918 ] Arthur Passos commented on ARROW-17459: --- [~willjones127] at a first glance, it seems to be working. The client code I had was something like the below: {code:java} std::shared_ptr table; arrow::Status read_status = file_reader->ReadRowGroup(row_group_current, column_indices, ); if (!read_status.ok()) throw ParsingException{"Error while reading Parquet data: " + read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA}; ++row_group_current; {code} Now it's the below: {code:java} std::shared_ptr table; std::unique_ptr<::arrow::RecordBatchReader> rbr; std::vector row_group_indices { row_group_current }; arrow::Status get_batch_reader_status = file_reader->GetRecordBatchReader(row_group_indices, column_indices, ); if (!get_batch_reader_status.ok()) throw ParsingException{"Error while reading Parquet data: " + get_batch_reader_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA}; arrow::Status read_status = rbr->ReadAll(); if (!read_status.ok()) throw ParsingException{"Error while reading Parquet data: " + read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA}; ++row_group_current;{code} *Question: Should I expect any regressions or different behaviour by changing the code path to the latter?* > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17459) [C++] Support nested data conversions for chunked array
[ https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arthur Passos updated ARROW-17459: -- Description: `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not implemented for chunked array outputs". It fails on [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) Data schema is: {code:java} optional group fields_map (MAP) = 217 { repeated group key_value { required binary key (STRING) = 218; optional binary value (STRING) = 219; } } fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 {code} Is there a way to work around this issue in the cpp lib? In any case, I am willing to implement this, but I need some guidance. I am very new to parquet (as in started reading about it yesterday). Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 was: `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not implemented for chunked array outputs". It fails on [ChunksToSingle](https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95) Data schema is: {code:java} optional group fields_map (MAP) = 217 { repeated group key_value { required binary key (STRING) = 218; optional binary value (STRING) = 219; } } fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 {code} Is there a way to work around this issue in the cpp lib? In any case, I am willing to implement this, but I need some guidance. I am very new to parquet (as in started reading about it yesterday). > [C++] Support nested data conversions for chunked array > --- > > Key: ARROW-17459 > URL: https://issues.apache.org/jira/browse/ARROW-17459 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Arthur Passos >Priority: Blocker > > `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not > implemented for chunked array outputs". It fails on > [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95]) > Data schema is: > {code:java} > optional group fields_map (MAP) = 217 { > repeated group key_value { > required binary key (STRING) = 218; > optional binary value (STRING) = 219; > } > } > fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 > fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 > {code} > Is there a way to work around this issue in the cpp lib? > In any case, I am willing to implement this, but I need some guidance. I am > very new to parquet (as in started reading about it yesterday). > > Probably related to: https://issues.apache.org/jira/browse/ARROW-10958 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17459) [C++] Support nested data conversions for chunked array
Arthur Passos created ARROW-17459: - Summary: [C++] Support nested data conversions for chunked array Key: ARROW-17459 URL: https://issues.apache.org/jira/browse/ARROW-17459 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Arthur Passos `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not implemented for chunked array outputs". It fails on [ChunksToSingle](https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95) Data schema is: {code:java} optional group fields_map (MAP) = 217 { repeated group key_value { required binary key (STRING) = 218; optional binary value (STRING) = 219; } } fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047 fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963 {code} Is there a way to work around this issue in the cpp lib? In any case, I am willing to implement this, but I need some guidance. I am very new to parquet (as in started reading about it yesterday). -- This message was sent by Atlassian Jira (v8.20.10#820010)