It seems that these options require reading into `ArrayData`. I have been using `ReadBatch` to read directly into a malloced C buffer to avoid having to create the additional copy, which is why I was hoping there would be a way to get this from the file metadata or some operation on the file rather than from the data that has already been read into an Arrow data structure.
So, the only way that I could do this today would be to read into an `ArrayData` and then call an `arrow::compute` function? There is no way to get the info from the file? Best, Ben McDonald From: Niranda Perera <[email protected]> Date: Friday, April 15, 2022 at 5:43 PM To: [email protected] <[email protected]> Subject: Re: [C++] Null indices and byte lengths of string columns Hi Ben, I believe you could use arrow::compute for this. On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]<mailto:[email protected]>> wrote: Hello, I have been writing some code to read Parquet files and it would be useful if there was an easy way to get the number of bytes in a string column as well as the null indices of that column. I would have expected this to be available in metadata somewhere, but I have not seen any way to query that from the API and don’t see anything like this using `parquet-tools` to inspect the files. Is there any way to get the null indices of a Parquet string column besides reading the whole file and manually checking for nulls? There is an internal method for this [1]. But unfortunately I don't this is exposed to the outside. One possible solution is, calling compute::is_null and pass the result to compute::indices_nonzero. Is there any way to get the byte lengths of string columns without reading each string and summing the number of bytes of each string? Do you want the non-null byte length? If not, you can simply take the offsets int64 buffer from ArrayData and take the last value. That would be the full bytesize of the string array. If yes, I believe you can achieve this by using VisitArrayDataInline/ VisitNullBitmapInline methods [2]. Thank you. Best, Ben McDonald [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226 [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224 -- Niranda Perera https://niranda.dev/<https://niranda.dev/> @n1r44<https://twitter.com/N1R44>
