Hi Ben,

I believe you could use arrow::compute for this.

On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]> wrote:

> Hello,
>
>
>
> I have been writing some code to read Parquet files and it would be useful
> if there was an easy way to get the number of bytes in a string column as
> well as the null indices of that column. I would have expected this to be
> available in metadata somewhere, but I have not seen any way to query that
> from the API and don’t see anything like this using `parquet-tools` to
> inspect the files.
>
>
>
> Is there any way to get the null indices of a Parquet string column
> besides reading the whole file and manually checking for nulls?
>
There is an internal method for this [1]. But unfortunately I don't this is
exposed to the outside. One possible solution is, calling compute::is_null
and pass the result to compute::indices_nonzero.


>
>
> Is there any way to get the byte lengths of string columns without reading
> each string and summing the number of bytes of each string?
>
Do you want the non-null byte length?
If not, you can simply take the offsets int64 buffer from ArrayData and
take the last value. That would be the full bytesize of the string array.
If yes, I believe you can achieve this by using VisitArrayDataInline/
VisitNullBitmapInline methods [2].


>
> Thank you.
>
>
>
> Best,
>
> Ben McDonald
>

[1]
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
[2]
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to