Actually, if you are doing streaming processing, you would have to store it with the record batch since there is no footer :)
On Tue, Apr 5, 2022 at 8:40 PM Weston Pace <weston.p...@gmail.com> wrote: > > Correct, the "ground truth" so to speak for these things is probably > the flatbuffers files[1] (Message.fbs, Schema.fbs, and Schema.fbs in > this case). There is a per-message custom metadata field that could be > used as you describe. The C++ implementation does not expose this > today that I can tell. So if you want to use this then some C++ > changes will be needed. There is already a JIRA ticket for this at > [2]. > > > the metadata may > > vary from batch to batch in an IPC file, and I can filter these batches > > quickly simply using metadata without looking into data in the arrays. > > > There is a similar effort I can find on the web [1], but it stores all the > > record batches metadata in the IPC file footer's schema. I think the footer > > will be fully loaded for every access, which will introduce some > > unnecessary IO if only a few of the record batches are read each time. > > I'm not sure the two above statements work together well. If you want > to use the metadata to determine which batches to read then you will > need to read the metadata for every single batch. So it doesn't make > sense to spread this information throughout the file. > > On the other hand, if you already know what subset of batches you are > interested in, then I could maybe see some advantage in storing the > metadatas separately but only if the metadata is quite large. If the > metadata is relatively small (KBs) then I still think you'd be better > off storing it all in the footer in most cases (or there wouldn't be > much difference). > > If you're doing streaming processing of the entire file then it > probably doesn't matter much either way. > > So there might be some potential here but I wouldn't say it is a sure thing. > > [1] https://github.com/apache/arrow/tree/master/format > [2] https://issues.apache.org/jira/browse/ARROW-6940 > > On Tue, Apr 5, 2022 at 7:26 PM Yue Ni <niyue....@gmail.com> wrote: > > > > Hi Aldrin, > > > > Thanks for the pointers. I checked out the C++ source code of this part, > > and I think currently record batch specific metadata is not written into > > the IPC file probably due to a bug in the code. I logged a bug to track > > this issue (https://issues.apache.org/jira/browse/ARROW-16131), thanks so > > much for the help. > > > > On Wed, Apr 6, 2022 at 12:58 AM Aldrin <akmon...@ucsc.edu.invalid> wrote: > > > > > Hm, I didn't think it was possible, but it looks like there may be some > > > things you can try? > > > > > > My understanding was that you create a writer for an IPC stream or file > > > and > > > you pass a schema on construction which is used as "the schema" for the > > > IPC > > > stream/file. So, RecordBatches written using that writer should/need to > > > match the given schema. This doesn't check the metadata, I don't think, > > > but > > > it only writes an "IPC payload" if the equality check passes. > > > > > > That being said, I did some checking, and some things seem like it's more > > > flexible now (but I could be wrong). I'm not sure what the dictionary > > > deltas are (maybe it's for dictionary arrays rather than metadata), but > > > the "emit_dictionary_deltas" IpcOption may be relevant [1]. Otherwise, the > > > `WriteRecordBatch` function appears to take a metadata length [2] and the > > > `WriteRecordBatchStream` function [3] seems to only check that a vector of > > > RecordBatches have matching schemas. Also, the `WritePayload` function > > > (from a RecordBatchWriter via MakeFileWriter) seems to be relevant for how > > > to write metadata that can be leveraged for a seek-based interface [4]. > > > > > > But, ultimately, I am not sure these things are exposed at a higher level > > > (e.g. pyarrow), even though they're available for use. They're also not > > > exposed via the feather interface, as far as I know. > > > > > > [1]: > > > > > > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions22emit_dictionary_deltasE > > > [2]: > > > > > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644 > > > [3]: > > > > > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665 > > > [4]: > > > > > > https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253 > > > > > > Aldrin Montana > > > Computer Science PhD Student > > > UC Santa Cruz > > > > > > > > > On Tue, Apr 5, 2022 at 1:55 AM Yue Ni <niyue....@gmail.com> wrote: > > > > > > > Hi there, > > > > > > > > I am investigating analyzing time series data using apache arrow. I > > > > would > > > > like to store some record batch specific metadata, for example, some > > > > statistics/tags about data in a particular record batch. More > > > specifically, > > > > I may use a single record batch to store metric samples for a certain > > > time > > > > range, and would like to store the min/max time and some dimensional > > > > data > > > > like `host` and `aws_region` as metadata for a particular record batch > > > > so > > > > that when loading multiple record batches from IPC file, the metadata > > > > may > > > > vary from batch to batch in an IPC file, and I can filter these batches > > > > quickly simply using metadata without looking into data in the arrays. > > > And > > > > I would like to know if it is possible to store such per record batch > > > > metadata in an arrow IPC file. > > > > > > > > There is a similar effort I can find on the web [1], but it stores all > > > the > > > > record batches metadata in the IPC file footer's schema. I think the > > > footer > > > > will be fully loaded for every access, which will introduce some > > > > unnecessary IO if only a few of the record batches are read each time. > > > > > > > > I read some docs/source code [2] [3], and if my understanding is > > > > correct, > > > > it is technically possible to store different metadata in different > > > record > > > > batches since in the streaming format, each message has a > > > `custom_metadata` > > > > associated with it. But I don't find any API (at least in pyarrow) > > > allowing > > > > me to do this. APIs like `pyarrow.record_batch` does allow users to > > > specify > > > > metadata when constructing a record batch, but it doesn't seem to be > > > > used > > > > if `RecordBatchFileWriter` has a schema provided (which of course > > > > doesn't > > > > have such record batch specific metadata). > > > > > > > > I haven't looked into the lower level C++ API yet, and it seems the > > > > assumption is that all the batches in the IPC file should share the same > > > > schema, but do we allow them to have different metadata if the schema > > > > (field names and their types) is the same? If we don't allow such usage > > > > currently, do you think it is a valid use case to support this kind of > > > > usage? Thanks. > > > > > > > > [1] > > > > > > > > > > > https://github.com/heterodb/pg-strom/wiki/806%3A-Apache-Arrow-Min-Max-Statistics-Hint > > > > [2] > > > > > > > > > > > https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc > > > > [3] > > > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py > > > > > > >