It seems that these options require reading into `ArrayData`. I have been using 
`ReadBatch` to read directly into a malloced C buffer to avoid having to create 
the additional copy, which is why I was hoping there would be a way to get this 
from the file metadata or some operation on the file rather than from the data 
that has already been read into an Arrow data structure.

So, the only way that I could do this today would be to read into an 
`ArrayData` and then call an `arrow::compute` function? There is no way to get 
the info from the file?

Best,
Ben McDonald

From: Niranda Perera <[email protected]>
Date: Friday, April 15, 2022 at 5:43 PM
To: [email protected] <[email protected]>
Subject: Re: [C++] Null indices and byte lengths of string columns
Hi Ben,

I believe you could use arrow::compute for this.

On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I have been writing some code to read Parquet files and it would be useful if 
there was an easy way to get the number of bytes in a string column as well as 
the null indices of that column. I would have expected this to be available in 
metadata somewhere, but I have not seen any way to query that from the API and 
don’t see anything like this using `parquet-tools` to inspect the files.

Is there any way to get the null indices of a Parquet string column besides 
reading the whole file and manually checking for nulls?
There is an internal method for this [1]. But unfortunately I don't this is 
exposed to the outside. One possible solution is, calling compute::is_null and 
pass the result to compute::indices_nonzero.


Is there any way to get the byte lengths of string columns without reading each 
string and summing the number of bytes of each string?
Do you want the non-null byte length?
If not, you can simply take the offsets int64 buffer from ArrayData and take 
the last value. That would be the full bytesize of the string array.
If yes, I believe you can achieve this by using VisitArrayDataInline/ 
VisitNullBitmapInline methods [2].


Thank you.

Best,
Ben McDonald

[1] 
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
[2] 
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224

--
Niranda Perera
https://niranda.dev/<https://niranda.dev/>
@n1r44<https://twitter.com/N1R44>

Reply via email to