Re: storing per record batch metadata in arrow IPC file

Weston Pace Tue, 05 Apr 2022 23:41:16 -0700

Correct, the "ground truth" so to speak for these things is probably
the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in
this case). There is a per-message custom metadata field that could be
used as you describe.  The C++ implementation does not expose this
today that I can tell.  So if you want to use this then some C++
changes will be needed.  There is already a JIRA ticket for this at
[2].


> the metadata may
> vary from batch to batch in an IPC file, and I can filter these batches
> quickly simply using metadata without looking into data in the arrays.

> There is a similar effort I can find on the web [1], but it stores all the
> record batches metadata in the IPC file footer's schema. I think the footer
> will be fully loaded for every access, which will introduce some
> unnecessary IO if only a few of the record batches are read each time.

I'm not sure the two above statements work together well.  If you want
to use the metadata to determine which batches to read then you will
need to read the metadata for every single batch.  So it doesn't make
sense to spread this information throughout the file.

On the other hand, if you already know what subset of batches you are
interested in, then I could maybe see some advantage in storing the
metadatas separately but only if the metadata is quite large.  If the
metadata is relatively small (KBs) then I still think you'd be better
off storing it all in the footer in most cases (or there wouldn't be
much difference).

If you're doing streaming processing of the entire file then it
probably doesn't matter much either way.

So there might be some potential here but I wouldn't say it is a sure thing.

[1] https://github.com/apache/arrow/tree/master/format
[2] https://issues.apache.org/jira/browse/ARROW-6940

On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <[email protected]> wrote:
>
> Hi Aldrin,
>
> Thanks for the pointers. I checked out the C++ source code of this part,
> and I think currently record batch specific metadata is not written into
> the IPC file probably due to a bug in the code. I logged a bug to track
> this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so
> much for the help.
>
> On Wed, Apr 6, 2022 at 12:58 AM Aldrin <[email protected]> wrote:
>
> > Hm, I didn't think it was possible, but it looks like there may be some
> > things you can try?
> >
> > My understanding was that you create a writer for an IPC stream or file and
> > you pass a schema on construction which is used as "the schema" for the IPC
> > stream/file. So, RecordBatches written using that writer should/need to
> > match the given schema. This doesn't check the metadata, I don't think, but
> > it only writes an "IPC payload" if the equality check passes.
> >
> > That being said, I did some checking, and some things seem like it's more
> > flexible now (but I could be wrong). I'm not sure what the dictionary
> > deltas are (maybe it's for dictionary arrays rather than metadata), but
> > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the
> > `WriteRecordBatch` function appears to take a metadata length [2] and the
> > `WriteRecordBatchStream` function [3] seems to only check that a vector of
> > RecordBatches have matching schemas. Also, the `WritePayload` function
> > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how
> > to write metadata that can be leveraged for a seek-based interface [4].
> >
> > But, ultimately, I am not sure these things are exposed at a higher level
> > (e.g. pyarrow), even though they're available for use. They're also not
> > exposed via the feather interface, as far as I know.
> >
> > [1]:
> >
> > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE
> > [2]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644
> > [3]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665
> > [4]:
> >
> > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <[email protected]> wrote:
> >
> > > Hi there,
> > >
> > > I am investigating analyzing time series data using apache arrow. I would
> > > like to store some record batch specific metadata, for example, some
> > > statistics/tags about data in a particular record batch. More
> > specifically,
> > > I may use a single record batch to store metric samples for a certain
> > time
> > > range, and would like to store the min/max time and some dimensional data
> > > like `host` and `aws_region` as metadata for a particular record batch so
> > > that when loading multiple record batches from IPC file, the metadata may
> > > vary from batch to batch in an IPC file, and I can filter these batches
> > > quickly simply using metadata without looking into data in the arrays.
> > And
> > > I would like to know if it is possible to store such per record batch
> > > metadata in an arrow IPC file.
> > >
> > > There is a similar effort I can find on the web [1], but it stores all
> > the
> > > record batches metadata in the IPC file footer's schema. I think the
> > footer
> > > will be fully loaded for every access, which will introduce some
> > > unnecessary IO if only a few of the record batches are read each time.
> > >
> > > I read some docs/source code [2] [3], and if my understanding is correct,
> > > it is technically possible to store different metadata in different
> > record
> > > batches since in the streaming format, each message has a
> > `custom_metadata`
> > > associated with it. But I don't find any API (at least in pyarrow)
> > allowing
> > > me to do this. APIs like `pyarrow.record_batch` does allow users to
> > specify
> > > metadata when constructing a record batch, but it doesn't seem to be used
> > > if `RecordBatchFileWriter` has a schema provided (which of course doesn't
> > > have such record batch specific metadata).
> > >
> > > I haven't looked into the lower level C++ API yet, and it seems the
> > > assumption is that all the batches in the IPC file should share the same
> > > schema, but do we allow them to have different metadata if the schema
> > > (field names and their types) is the same? If we don't allow such usage
> > > currently, do you think it is a valid use case to support this kind of
> > > usage? Thanks.
> > >
> > > [1]
> > >
> > >
> > https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint
> > > [2]
> > >
> > >
> > https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
> > > [3]
> > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
> > >
> >

Re: storing per record batch metadata in arrow IPC file

Reply via email to