Note that uncompressed size is encoded size so can be substantially smaller
then a simple concatenated string buffer

On Monday, April 18, 2022, Weston Pace <[email protected]> wrote:

> From a pure metadata-only perspective you should be able to get the
> size of the column and possibly a null count (for parquet files where
> statistics are stored).  However, you will not be able to get the
> indices of the nulls.
>
> The null count and column size are going to come from the parquet
> metadata and you will need to use the parquet APIs to get this
> information.  In pyarrow this would be:
>
> ```
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).statistics.null_count
> 1
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).total_compressed_size
> 122
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).total_uncompressed_size
> 119
> ```
>
> In the C++ API you will want to look at `parquet::ParquetFileReader::
> metadata`
>
> On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <[email protected]>
> wrote:
> >
> > It seems that these options require reading into `ArrayData`. I have
> been using `ReadBatch` to read directly into a malloced C buffer to avoid
> having to create the additional copy, which is why I was hoping there would
> be a way to get this from the file metadata or some operation on the file
> rather than from the data that has already been read into an Arrow data
> structure.
> >
> >
> >
> > So, the only way that I could do this today would be to read into an
> `ArrayData` and then call an `arrow::compute` function? There is no way to
> get the info from the file?
> >
> >
> >
> > Best,
> >
> > Ben McDonald
> >
> >
> >
> > From: Niranda Perera <[email protected]>
> > Date: Friday, April 15, 2022 at 5:43 PM
> > To: [email protected] <[email protected]>
> > Subject: Re: [C++] Null indices and byte lengths of string columns
> >
> > Hi Ben,
> >
> >
> >
> > I believe you could use arrow::compute for this.
> >
> >
> >
> > On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <[email protected]>
> wrote:
> >
> > Hello,
> >
> >
> >
> > I have been writing some code to read Parquet files and it would be
> useful if there was an easy way to get the number of bytes in a string
> column as well as the null indices of that column. I would have expected
> this to be available in metadata somewhere, but I have not seen any way to
> query that from the API and don’t see anything like this using
> `parquet-tools` to inspect the files.
> >
> >
> >
> > Is there any way to get the null indices of a Parquet string column
> besides reading the whole file and manually checking for nulls?
> >
> > There is an internal method for this [1]. But unfortunately I don't this
> is exposed to the outside. One possible solution is, calling
> compute::is_null and pass the result to compute::indices_nonzero.
> >
> >
> >
> >
> >
> > Is there any way to get the byte lengths of string columns without
> reading each string and summing the number of bytes of each string?
> >
> > Do you want the non-null byte length?
> >
> > If not, you can simply take the offsets int64 buffer from ArrayData and
> take the last value. That would be the full bytesize of the string array.
> >
> > If yes, I believe you can achieve this by using VisitArrayDataInline/
> VisitNullBitmapInline methods [2].
> >
> >
> >
> >
> >
> > Thank you.
> >
> >
> >
> > Best,
> >
> > Ben McDonald
> >
> >
> >
> > [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32
> eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
> >
> > [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32
> eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224
> >
> >
> > --
> >
> > Niranda Perera
> > https://niranda.dev/
> >
> > @n1r44
> >
> >
>

Reply via email to