[C++] [Arrow IPC] Efficient Multiple Reads

Ishbir Singh Tue, 07 Nov 2023 10:27:53 -0800

Apologies if this is the wrong place for this, but I'm looking to
repeatedly select a subset of columns from a wide feather file (which has
~200k columns). What I find is that if I use RecordBatchReader::Open with
the requisite arguments asking it to select the particular columns, it
reads the schema over and over (once per Open call). Now that is to be
expected as there doesn't seem to be a way to pass a pre-existing schema.


However, in my use case, I want the smaller queries to be fast and can't
have it re-parse the schema for every call. The input file thus has to be a
io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that
can serve this purpose seems to be:

Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
    const Buffer& metadata, const std::shared_ptr<Schema>& schema,
    const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
    io::RandomAccessFile* file);

How do I efficiently read the file once to get the schema and metadata in
this case? My file does not have any dictionaries. Am I thinking about this
incorrectly?

Would appreciate any pointers.

Thanks,
Ishbir Singh

[C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to