Re: [C++] [Arrow IPC] Efficient Multiple Reads

Aldrin Wed, 08 Nov 2023 10:27:42 -0800

well, [1] shows a way to get the metadata, but you'll have to follow the 
function chain to figure out if there's a way to just get the metadata for a 
RecordBatch without reading the data for it (I couldn't do it in ~5 min).



-   I forgot to mention that the subsets of columns chosen is dynamic... It 
wouldn’t make sense to rewrite the files for each query.
    



I'm just talking about writing the files to be less wide, and/or writing files 
that contain only the metadata and no actual data (schema and schema metadata) 
to initialize a RecordBatchStreamReader [2] from. Once you initialize a 
RecordBatchStreamReader, you can feed it binary data and the process looks like 
a reader with a pre-existing schema but you're managing the file access (so you 
have to be more intentional in your file management).

To Weston's point, if you have wide feather files and processing the (schema, 
recordbatch, or feather file metadata, I'm not sure in particular which one 
you're both referring to) is costly then you probably need to change something 
in your process to get speed-ups.


[1]: 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/reader.cc#L867-L876
[2]: 
https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc23RecordBatchStreamReaderE



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Wednesday, November 8th, 2023 at 09:45, Weston Pace <weston.p...@gmail.com> 
wrote:


> You are correct that there is no existing capability to create an IPC reader 
> with precomputed metadata. I don't think anyone would be opposed to this 
> feature, it just hasn't been a priority.
> 

> If you wanted to avoid changing arrow then you could create your own 
> implementation of `RandomAccessFile` which is partially backed by an 
> in-memory buffer and fetches from file when the reads go out of the buffered 
> range. However, I'm not sure that I/O is the culprit. Are you reading from a 
> local file? If so, then the future reads would probably already be cached by 
> the OS (unless maybe you are under memory pressure).
> 

> Perhaps it is the CPU cost of processing the metadata that is slowing down 
> your reads. If that is the case then I think a code change is inevitable.
> 

> 

> On Wed, Nov 8, 2023 at 6:43 AM Ishbir Singh <ish...@ishbir.com> wrote:
> 

> > Thanks for the info, Aldrin. I forgot to mention that the subsets of 
> > columns chosen is dynamic. Basically, I have a web server serving columns 
> > from the file. It wouldn’t make sense to rewrite the files for each query.
> > 

> > I’m just looking for the easiest way to read the metadata as a buffer so I 
> > can pass it to the function below because I believe that should accomplish 
> > what I want.
> > 

> > Thanks,
> > Ishbir Singh
> > 

> > W dniu wt., 7.11.2023 o 20:56 Aldrin <octalene....@pm.me> napisał(a):
> > 

> > > In a probably too short answer, I think you want to do one of the 
> > > following:
> > > 

> > > - write a single feather file with many batches
> > > - write many feather files but using the dataset API to hopefully have 
> > > arrow do some multi-file optimization for you (and hopefully still have 
> > > multiple batches per file)
> > > - write the schema in one file (or as few files as there are schemas) and 
> > > write many (N) recordbatches to fewer files (M) using the stream 
> > > interface (instead of file)
> > > 

> > > I do the 3rd one and I do it because I made assumptions about data 
> > > accesses but I have not validated those assumptions. The main assumption 
> > > being that writing a RecordBatch with the stream API is not rewriting the 
> > > schema each time (or having equivalent amplification on the read side).
> > > 

> > > Let me know if there's any approach you want more info on and I can 
> > > follow up or maybe someone else can chime in/correct me.
> > > 

> > > Sent from Proton Mail for iOS
> > > 

> > > 

> > > On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <ish...@ishbir.com> wrote:
> > > 

> > > > Apologies if this is the wrong place for this, but I'm looking to 
> > > > repeatedly select a subset of columns from a wide feather file (which 
> > > > has ~200k columns). What I find is that if I use 
> > > > RecordBatchReader::Open with the requisite arguments asking it to 
> > > > select the particular columns, it reads the schema over and over (once 
> > > > per Open call). Now that is to be expected as there doesn't seem to be 
> > > > a way to pass a pre-existing schema.
> > > > 

> > > > 

> > > > However, in my use case, I want the smaller queries to be fast and 
> > > > can't have it re-parse the schema for every call. The input file thus 
> > > > has to be a io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the 
> > > > only method that can serve this purpose seems to be:
> > > > 

> > > > 

> > > > Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
> > > > const Buffer& metadata, const std::shared_ptr<Schema>& schema,
> > > > const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
> > > > io::RandomAccessFile* file);
> > > > 

> > > > 

> > > > How do I efficiently read the file once to get the schema and metadata 
> > > > in this case? My file does not have any dictionaries. Am I thinking 
> > > > about this incorrectly?
> > > > 

> > > > 

> > > > Would appreciate any pointers.
> > > > 

> > > > 

> > > > Thanks,
> > > > Ishbir Singh

publickey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to