>From a pure metadata-only perspective you should be able to get the
size of the column and possibly a null count (for parquet files where
statistics are stored). However, you will not be able to get the
indices of the nulls.
The null count and column size are going to come from the parquet
metadata and you will need to use the parquet APIs to get this
information. In pyarrow this would be:
```
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).statistics.null_count
1
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).total_compressed_size
122
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).total_uncompressed_size
119
```
In the C++ API you will want to look at `parquet::ParquetFileReader::metadata`
On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <[email protected]> wrote:
>
> It seems that these options require reading into `ArrayData`. I have been
> using `ReadBatch` to read directly into a malloced C buffer to avoid having
> to create the additional copy, which is why I was hoping there would be a way
> to get this from the file metadata or some operation on the file rather than
> from the data that has already been read into an Arrow data structure.
>
>
>
> So, the only way that I could do this today would be to read into an
> `ArrayData` and then call an `arrow::compute` function? There is no way to
> get the info from the file?
>
>
>
> Best,
>
> Ben McDonald
>
>
>
> From: Niranda Perera <[email protected]>
> Date: Friday, April 15, 2022 at 5:43 PM
> To: [email protected] <[email protected]>
> Subject: Re: [C++] Null indices and byte lengths of string columns
>
> Hi Ben,
>
>
>
> I believe you could use arrow::compute for this.
>
>
>
> On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]> wrote:
>
> Hello,
>
>
>
> I have been writing some code to read Parquet files and it would be useful if
> there was an easy way to get the number of bytes in a string column as well
> as the null indices of that column. I would have expected this to be
> available in metadata somewhere, but I have not seen any way to query that
> from the API and don’t see anything like this using `parquet-tools` to
> inspect the files.
>
>
>
> Is there any way to get the null indices of a Parquet string column besides
> reading the whole file and manually checking for nulls?
>
> There is an internal method for this [1]. But unfortunately I don't this is
> exposed to the outside. One possible solution is, calling compute::is_null
> and pass the result to compute::indices_nonzero.
>
>
>
>
>
> Is there any way to get the byte lengths of string columns without reading
> each string and summing the number of bytes of each string?
>
> Do you want the non-null byte length?
>
> If not, you can simply take the offsets int64 buffer from ArrayData and take
> the last value. That would be the full bytesize of the string array.
>
> If yes, I believe you can achieve this by using VisitArrayDataInline/
> VisitNullBitmapInline methods [2].
>
>
>
>
>
> Thank you.
>
>
>
> Best,
>
> Ben McDonald
>
>
>
> [1]
> https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
>
> [2]
> https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224
>
>
> --
>
> Niranda Perera
> https://niranda.dev/
>
> @n1r44
>
>