Re: [C++] [Arrow IPC] Efficient Multiple Reads

Ishbir Singh Wed, 08 Nov 2023 13:42:41 -0800

I’m not opposed to making changes to Arrow itself, but just wanted to know
if it was possible to do what I want without that.


The bottleneck is definitely the repeated metadata processing (profiled
it). In my mind, the file can be opened once and we should be able to just
seek within the file to read whichever columns we want without doing any
further work. I’m surprised that it’s not possible to do that with IPC
files yet.

Could you make recommendations as to where I could start making such a
change (read IPC files with precomputed metadata)? My initial thought was
that I could add a new method “SelectFields(std::vector<int> fieldIndices)”
which would update the field_inclusion_mask_ and out_schema_ based on the
passed in vector. How does that seem to you guys?

Thank you for both your assistance so far!

Ishbir Singh

W dniu śr., 8.11.2023 o 12:27 Aldrin <octalene....@pm.me> napisał(a):

> well, [1] shows a way to get the metadata, but you'll have to follow the
> function chain to figure out if there's a way to just get the metadata for
> a RecordBatch without reading the data for it (I couldn't do it in ~5 min).
>
>
>    - I forgot to mention that the subsets of columns chosen is dynamic... It
>    wouldn’t make sense to rewrite the files for each query.
>
>
> I'm just talking about writing the files to be less wide, and/or writing
> files that contain only the metadata and no actual data (schema and schema
> metadata) to initialize a RecordBatchStreamReader [2] from. Once you
> initialize a RecordBatchStreamReader, you can feed it binary data and the
> process looks like a reader with a pre-existing schema but you're managing
> the file access (so you have to be more intentional in your file
> management).
>
> To Weston's point, if you have wide feather files and processing the
> (schema, recordbatch, or feather file metadata, I'm not sure in particular
> which one you're both referring to) is costly then you probably need to
> change something in your process to get speed-ups.
>
>
> [1]:
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/reader.cc#L867-L876
> [2]:
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc23RecordBatchStreamReaderE
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
> https://keybase.io/octalene
>
> On Wednesday, November 8th, 2023 at 09:45, Weston Pace <
> weston.p...@gmail.com> wrote:
>
> You are correct that there is no existing capability to create an IPC
> reader with precomputed metadata. I don't think anyone would be opposed to
> this feature, it just hasn't been a priority.
>
> If you wanted to avoid changing arrow then you could create your own
> implementation of `RandomAccessFile` which is partially backed by an
> in-memory buffer and fetches from file when the reads go out of the
> buffered range. However, I'm not sure that I/O is the culprit. Are you
> reading from a local file? If so, then the future reads would probably
> already be cached by the OS (unless maybe you are under memory pressure).
>
> Perhaps it is the CPU cost of processing the metadata that is slowing down
> your reads. If that is the case then I think a code change is inevitable.
>
>
> On Wed, Nov 8, 2023 at 6:43 AM Ishbir Singh <ish...@ishbir.com> wrote:
>
>> Thanks for the info, Aldrin. I forgot to mention that the subsets of
>> columns chosen is dynamic. Basically, I have a web server serving columns
>> from the file. It wouldn’t make sense to rewrite the files for each query.
>>
>> I’m just looking for the easiest way to read the metadata as a buffer so
>> I can pass it to the function below because I believe that should
>> accomplish what I want.
>>
>> Thanks,
>> Ishbir Singh
>>
>> W dniu wt., 7.11.2023 o 20:56 Aldrin <octalene....@pm.me> napisał(a):
>>
>>> In a probably too short answer, I think you want to do one of the
>>> following:
>>>
>>> - write a single feather file with many batches
>>> - write many feather files but using the dataset API to hopefully have
>>> arrow do some multi-file optimization for you (and hopefully still have
>>> multiple batches per file)
>>> - write the schema in one file (or as few files as there are schemas)
>>> and write many (N) recordbatches to fewer files (M) using the stream
>>> interface (instead of file)
>>>
>>> I do the 3rd one and I do it because I made assumptions about data
>>> accesses but I have not validated those assumptions. The main assumption
>>> being that writing a RecordBatch with the stream API is not rewriting the
>>> schema each time (or having equivalent amplification on the read side).
>>>
>>> Let me know if there's any approach you want more info on and I can
>>> follow up or maybe someone else can chime in/correct me.
>>>
>>> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>>>
>>>
>>> On Tue, Nov 7, 2023 at 10:27, Ishbir Singh <ish...@ishbir.com
>>> <On+Tue,+Nov+7,+2023+at+10:27,+Ishbir+Singh+%3C%3Ca+href=>> wrote:
>>>
>>> Apologies if this is the wrong place for this, but I'm looking to
>>> repeatedly select a subset of columns from a wide feather file (which has
>>> ~200k columns). What I find is that if I use RecordBatchReader::Open with
>>> the requisite arguments asking it to select the particular columns, it
>>> reads the schema over and over (once per Open call). Now that is to be
>>> expected as there doesn't seem to be a way to pass a pre-existing schema.
>>>
>>> However, in my use case, I want the smaller queries to be fast and can't
>>> have it re-parse the schema for every call. The input file thus has to be a
>>> io::RandomAccesssFile. Looking at arrow/ipc/reader.h, the only method that
>>> can serve this purpose seems to be:
>>>
>>> Result<std::shared_ptr<RecordBatch>> ReadRecordBatch(
>>> const Buffer& metadata, const std::shared_ptr<Schema>& schema,
>>> const DictionaryMemo* dictionary_memo, const IpcReadOptions& options,
>>> io::RandomAccessFile* file);
>>>
>>> How do I efficiently read the file once to get the schema and metadata
>>> in this case? My file does not have any dictionaries. Am I thinking about
>>> this incorrectly?
>>>
>>> Would appreciate any pointers.
>>>
>>> Thanks,
>>> Ishbir Singh
>>>
>>>
>

Re: [C++] [Arrow IPC] Efficient Multiple Reads

Reply via email to