[C++] [Parquet] Questions about batch reading byte arrays

McDonald, Ben Fri, 03 Nov 2023 11:30:25 -0700

Hello,



I have been using the C++ Parquet low-level interface to read Parquet files 
into regular C arrays. This has not been a problem when reading types supported 
by C, say, `int64` columns, but with string columns, I am running into 
difficulty having to read into the Arrow `ByteArray` type.



Rather than reading the results into a `ByteArray`, I would like to read the 
results directly into an already created `uint8` character array. As it stands, 
I am first reading into a `ByteArray` and then copying into the `uint8` array, 
which is causing some unfortunate overhead. Is there a way to read directly 
into a byte array using the low level Parquet API? For reference, here is the 
portion of code for how I am currently reading Arrow strings into my `uint8` 
array: 
https://github.com/Bears-R-Us/arkouda/blob/a3419dd6774923d6ff6f75bdf62fb6e225d1a584/src/ArrowFunctions.cpp#L797-L814.



Additionally, when attempting to optimize my string reading approach, I was 
looking into using the `ReadBatch` function into a vector of `ByteArray`s to 
read in multiple values, instead of one at a time, like I am currently doing. 
When attempting this, I have been hitting a segfault with any batch size 
greater than 16, but am still achieving a significant speedup that way as 
opposed to reading in single values. Is there any reason why a larger batch 
size than 16 would be causing a segfault with the `ReadBatch` function reading 
into a vector of `ByteArray`s on a `parquet::ByteArrayReader`?



One additional question is that, since I need to create my array prior to 
storing the values, I am having to calculate the required number of bytes that 
my array will need to be in order to store the column in advance. From the 
metadata, I am able to get the number of strings in the column, but I am unable 
to get the number of characters in the column, so have been reading in the 
entire file once and summing the `len` of each `ByteArray` to get the total 
number of characters that will be needed to store all of the values. Is there a 
simpler way to do that, possibly through the metadata?



Thank you!



Best,

Ben McDonald

[C++] [Parquet] Questions about batch reading byte arrays

Reply via email to