[jira] [Updated] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks

2022-11-11 Thread Arthur Passos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Passos updated ARROW-18307:
--
Description: 
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.

 
{code:java}
static std::shared_ptr 
getNestedArrowColumn(std::shared_ptr & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i)));
std::shared_ptr chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared(array_vector);
}{code}
This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side. One pattern I noticed is that if I 
read only the Array(Int64) column, I get only one chunk. If I read both 
columns, I get two chunks. It looks like all columns will, inevitably, have the 
same number of chunks, even though its buffer is not chunked accordingly.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example assumes there is only on chunk and ignores the 
possibility of it having multiple chunks. It's probably just a detail and the 
test wasn't actually intended to cover multiple chunks.

I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1)));

auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();

auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);

auto lcv1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - 
l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - 
l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.

Hence, my question: How do I properly extract the data & offsets out of such 
column? A more generic version of this is: how to extract the data out of 
ChunkedArrays with multiple chunks?

  was:
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.



 
{code:java}
static std::shared_ptr 
getNestedArrowColumn(std::shared_ptr & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i)));
std::shared_ptr chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared(array_vector);
}{code}

This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example 

[jira] [Created] (ARROW-18307) [C++] Read list/array data from ChunkedArray with multiple chunks

2022-11-10 Thread Arthur Passos (Jira)
Arthur Passos created ARROW-18307:
-

 Summary: [C++] Read list/array data from ChunkedArray with 
multiple chunks
 Key: ARROW-18307
 URL: https://issues.apache.org/jira/browse/ARROW-18307
 Project: Apache Arrow
  Issue Type: Test
  Components: C++
Reporter: Arthur Passos


I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.



 
{code:java}
static std::shared_ptr 
getNestedArrowColumn(std::shared_ptr & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast(*(arrow_column->chunk(chunk_i)));
std::shared_ptr chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared(array_vector);
}{code}

This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example assumes there is only on chunk and ignores the 
possibility of it having multiple chunks. It's probably just a detail and the 
test wasn't actually intended to cover multiple chunks.

I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1)));

auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();

auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);

auto lcv1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - 
l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - 
l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.

Hence, my question: How do I properly extract the data & offsets out of such 
column? A more generic version of this is: how to extract the data out of 
ChunkedArrays with multiple chunks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-11-04 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629192#comment-17629192
 ] 

Arthur Passos commented on ARROW-17459:
---

Hi [~willjones127] . I have implemented your suggestion of GetRecordBatchReader 
and, at first, things seemed to work as expected. Recently, an issue regarding 
parquet data has been reported and reverting it to the ReadRowGroup solution 
seems to address this. This might be a misuse of the arrow library on my side, 
even though I have read the API docs and it looks correct.

 

My question is pretty much: should there be difference in the output when using 
the two APIs?

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-09-01 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599082#comment-17599082
 ] 

Arthur Passos commented on ARROW-17459:
---

I see. That seems like a long journey for a non arrow developer / parquet 
expert to go through. Given the timeline I am working on, in the short term, I 
think I'll resort to the first suggestion by [~willjones127]. While it doesn't 
fix the second case, it fixes the one I originally shared. Which makes me 
curious, why does that fix the Map but doesn't fix the one 
generated by the above script?

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-09-01 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598993#comment-17598993
 ] 

Arthur Passos commented on ARROW-17459:
---

[~emkornfield] I have changed a few places to use LargeBinary/LargeString and 
also commented out [this type 
assertion|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L301].
 After that, I am able to read the parquet file. Would a PR that forces the use 
of LargeBinary/LargeString by default be acceptable? Plus, if you have any tips 
on how to work around that assertion without commenting it out, that would be 
great.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-09-01 Thread Arthur Passos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Passos reassigned ARROW-17459:
-

Assignee: Arthur Passos

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598581#comment-17598581
 ] 

Arthur Passos commented on ARROW-17459:
---

I am a bit lost rn. I have made some changes to use LargeBinaryBuilder, but 
there is always an incosistency that throws an exception. Are you aware of any 
place in the code where instead of taking the String path it would take the 
LargeString path? I went all the way back to where it reads the schema in the 
hope of finding a place I could change the DataType from STRING to 
LARGE_STRING. Couldn't do so.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598383#comment-17598383
 ] 

Arthur Passos edited comment on ARROW-17459 at 8/31/22 2:31 PM:


[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

edit:

I have just tested the snippet shared by Will Jones using `type = 
pa.map_(pa.large_string(), pa.int64())` instead of `type = pa.map_(pa.string(), 
pa.int32())` and the issue persists. 

 


was (Author: JIRAUSER294600):
[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598383#comment-17598383
 ] 

Arthur Passos commented on ARROW-17459:
---

[~emkornfield] if I understand correctly, this could help with the original 
case I shared. In the case [~willjones127] shared, where he creates a 
ChunkedArray and then serializes it, it wouldn't help. Is that correct?

I am stating this based on my current understanding of the inner workings of 
`arrow`: The ChunkedArray data structure will be used in two or more 
situations: 
1. The data in a row group exceeds the limit of INT_MAX (Case I initially 
shared)
2. The serialized data/ table is a chunked array, thus it makes sense to use a 
chunked array.

 

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598050#comment-17598050
 ] 

Arthur Passos commented on ARROW-17459:
---

[~emkornfield] thank you for your answer. Can you clarify what you mean by 
"read back arrays to always use the Large* variant"? I don't know what "back 
array" and "large variant" refer to, tho I can especulate what the latter means.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598019#comment-17598019
 ] 

Arthur Passos commented on ARROW-17459:
---

Hi [~emkornfield]. I see you are one of the authors of 
[https://github.com/apache/arrow/pull/8177|https://github.com/apache/arrow/pull/8177.].
 I see the following snippet was introduced on that PR:


{code:java}
      // ARROW-3762(wesm): If item reader yields a chunked array, we reject as
      // this is not yet implemented
      return Status::NotImplemented(
          "Nested data conversions not implemented for chunked array 
outputs");{code}
I wonder why this wasn't implemented. Is there a techinical limitation or the 
approach wasn't very well defined?

I am pretty new to Parquet and to `arrow` library, so it's very hard to me to 
reason about all of these concepts and code. From the top of my head, I got a 
couple of silly ideas:
 # Find a way to convert a ChunkedArray into a single Array. That requires a 
processing step that allocates a contiguous chunk of memory big enough to hold 
all chunks. Plus, there is no clear interface to do so.
 # Create a new ChunkedArray class that can hold ChunkedArrays. As of now, it 
can only hold raw Arrays. That would require a LOT of changes in other 
{{arrow}}  classes and, of course, it's not guaranteed to work.
 # Make the chunk memory limit configurable (not sure it's feasible)

Do you see any of these as a path forward? If not, what would be the path 
forward?

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597846#comment-17597846
 ] 

Arthur Passos commented on ARROW-17459:
---

[~willjones127] Thank you for sharing this!

 

While your `GetRecordBatchReader` suggestion works for the use case I shared, 
it won't work for this one. Are there any docs I could read to understand the 
internals of arrow lib in order to implement it? Any tips would be 
appreciated.. The only thing that comes to mind right now is to somehow build a 
giant array with all the chunks, but it certainly has a set of implications.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-25 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584808#comment-17584808
 ] 

Arthur Passos commented on ARROW-17459:
---

I am also trying to write test to cover this case, but failing to do so. For 
some reason, the files I generate with the very same schema and size don't get 
chunked while reading it. The original file was provided by a customer and it's 
confidential data, so it can't be used.

 

All the files I generated contain the above mentioned schema. The differences 
are in the data length. Some had maps of 50~300 elements with keys of random 
strings of 20~50 characters and values of random strings of 50~5000 characters. 
I also tried a low cardinality example and a large string example (2^30 
characters).

 

I'd be very thankful if someone could give me some tips on how to generate a 
file that will trigger the exception.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-22 Thread Arthur Passos (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582918#comment-17582918
 ] 

Arthur Passos commented on ARROW-17459:
---

[~willjones127] at a first glance, it seems to be working. The client code I 
had was something like the below:


{code:java}
std::shared_ptr table;
arrow::Status read_status = file_reader->ReadRowGroup(row_group_current, 
column_indices, );
if (!read_status.ok())
    throw ParsingException{"Error while reading Parquet data: " + 
read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};
++row_group_current;
{code}
 

Now it's the below:
{code:java}
std::shared_ptr table;

std::unique_ptr<::arrow::RecordBatchReader> rbr;
std::vector row_group_indices { row_group_current };
arrow::Status get_batch_reader_status = 
file_reader->GetRecordBatchReader(row_group_indices, column_indices, );

if (!get_batch_reader_status.ok())
throw ParsingException{"Error while reading Parquet data: " + 
get_batch_reader_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};

arrow::Status read_status = rbr->ReadAll();

if (!read_status.ok())
throw ParsingException{"Error while reading Parquet data: " + 
read_status.ToString(), ErrorCodes::CANNOT_READ_ALL_DATA};

++row_group_current;{code}
 

*Question: Should I expect any regressions or different behaviour by changing 
the code path to the latter?*

 

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-18 Thread Arthur Passos (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Passos updated ARROW-17459:
--
Description: 
`FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
implemented for chunked array outputs". It fails on 
[ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])

Data schema is: 
{code:java}
  optional group fields_map (MAP) = 217 {
    repeated group key_value {
      required binary key (STRING) = 218;
      optional binary value (STRING) = 219;
    }
  }
fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
{code}
Is there a way to work around this issue in the cpp lib?

In any case, I am willing to implement this, but I need some guidance. I am 
very new to parquet (as in started reading about it yesterday).

 

Probably related to: https://issues.apache.org/jira/browse/ARROW-10958

  was:
`FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
implemented for chunked array outputs". It fails on 
[ChunksToSingle](https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95)

Data schema is: 

 
{code:java}
  optional group fields_map (MAP) = 217 {
    repeated group key_value {
      required binary key (STRING) = 218;
      optional binary value (STRING) = 219;
    }
  }
fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
{code}
 

Is there a way to work around this issue in the cpp lib?

In any case, I am willing to implement this, but I need some guidance. I am 
very new to parquet (as in started reading about it yesterday).

 


> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-18 Thread Arthur Passos (Jira)
Arthur Passos created ARROW-17459:
-

 Summary: [C++] Support nested data conversions for chunked array
 Key: ARROW-17459
 URL: https://issues.apache.org/jira/browse/ARROW-17459
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Arthur Passos


`FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
implemented for chunked array outputs". It fails on 
[ChunksToSingle](https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95)

Data schema is: 

 
{code:java}
  optional group fields_map (MAP) = 217 {
    repeated group key_value {
      required binary key (STRING) = 218;
      optional binary value (STRING) = 219;
    }
  }
fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
{code}
 

Is there a way to work around this issue in the cpp lib?

In any case, I am willing to implement this, but I need some guidance. I am 
very new to parquet (as in started reading about it yesterday).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)