Hi Adam,

> "ReadBatchSpaced() in a loop isfaster than reading an entire record
> batch."


Could you elaborate on this?  What code path were you using for reading
record batches that was slower?  Did you try adjusting the batch size with
ArrowReaderProperties [1] to be ~1000 rows also (by default it is 64 K so I
would imagine a higher memory overhead).  There could also be some other
places where memory efficiency could be improved.

As I understand it, the function is deprecated because it has bugs
> concerning nested values. These bugs didn't affect me because I don't use
> nested values.


This is correct.  Even if they don't affect you I think having this API is
dangerous to keep around if it is not maintained and has potential bugs.

Does the C++ parquet reader support reading a batch of values and their
> validity bitmap?


No, but see below for using ReadBatch, reconstructing the null bitmap is
trivial for non-nested data (and probably isn't even necessary if you read
back the definition levels).


There are several potential options for the CSV use-case:
1.  The stream-reader API (
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/stream_reader.h
)

2.  Using ReadBatch.  The logic of determining nulls for non-nested data is
trivial.  You simply need to compare definition levels returned to the max
definition level (
https://github.com/apache/arrow/blob/d0de88d8384c7593fac1b1e82b276d4a0d364767/cpp/src/parquet/schema.h#L368).
Any definition level less than the max indicates a null.  This also has the
nice side effect of requiring less memory for when data is null.

3.  Using a record batch reader (
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L179)
and the Arrow to CSV writer  (
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/writer.h).
The CSV writer code doesn't support all types yet, they require having a
cast to string kernel available.   If extreme memory efficiency is your
aim, this is probably not the best option.  Speed wise it is probably going
to be pretty competitive and will likely see the most improvements for
"free" in the long run.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L571


On Tue, Jul 20, 2021 at 11:07 AM Adam Hooper <a...@adamhooper.com> wrote:

> Hi list,
>
> Updating some code to Arrow 4.0, I noticed
> https://issues.apache.org/jira/browse/PARQUET-1899 deprecated
> parquet::TypedColumnReader<T>::ReadBatchSpaced().
>
> I use this function in a parquet-to-csv converter. It reads batches of
> 1,000 values at a time, allowing nulls. ReadBatchSpaced() in a loop is
> faster than reading an entire record batch. It's also more RAM-friendly (so
> the program costs only a few megabytes, regardless of Parquet file
> size). I've spawned hundreds of concurrent parquet-to-csv processes,
> streaming to slow clients via Python+ASGI, with response times in the
> milliseconds. I commented my findings:
>
> https://github.com/CJWorkbench/parquet-to-arrow/blob/70253c7fdf0fc778e51f50b992c98b16e8864723/src/parquet-to-text-stream.cc#L73
>
> As I understand it, the function is deprecated because it has bugs
> concerning nested values. These bugs didn't affect me because I don't use
> nested values.
>
> Does the C++ parquet reader support reading a batch of values and their
> validity bitmap?
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>

Reply via email to