In a probably too short answer, I think you want to do one of the following:

- write a single feather file with many batches
- write many feather files but using the dataset API to hopefully have arrow do some multi-file optimization for you (and hopefully still have multiple batches per file)
- write the schema in one file (or as few files as there are schemas) and write many (N) recordbatches to fewer files (M) using the stream interface (instead of file)

I do the 3rd one and I do it because I made assumptions about data accesses but I have not validated those assumptions. The main assumption being that writing a RecordBatch with the stream API is not rewriting the schema each time (or having equivalent amplification on the read side).

Let me know if there's any approach you want more info on and I can follow up or maybe someone else can chime in/correct me.

Sent from Proton Mail for iOS


On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <ish...@ishbir.com> wrote:
Apologies if this is the wrong place for this, but I'm looking to repeatedly select a subset of columns from a wide feather file (which has ~200k columns). What I find is that if I use RecordBatchReader::Open with the requisite arguments asking it to select the particular columns, it reads the schema over and over (once per Open call). Now that is to be expected as there doesn't seem to be a way to pass a pre-existing schema.

However, in my use case, I want the smaller queries to be fast and can't have it re-parse the schema for every call. The input file thus has to be a io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that can serve this purpose seems to be:

Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
const Buffer& metadata, const std::shared_ptr<Schema>& schema,
const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
io::RandomAccessFile* file);

How do I efficiently read the file once to get the schema and metadata in this case? My file does not have any dictionaries. Am I thinking about this incorrectly?

Would appreciate any pointers.

Thanks,
Ishbir Singh

Attachment: publicKey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to