I think if we tried to tack this on, I think it might be worth trying to go through the design effort to see if something is possible without external files. The stream format also allows more flexibility around dictionaries then the file format does, so there is a possibility of impedance mismatch.
Before we went with our own specification for external metadata it seems that looking at integration with something like Iceberg might make sense. My understanding is that external metadata files are on the path to deprecation or at least no recommended in parquet [1]. [1] https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <[email protected]> wrote: > On Wed, Jul 14, 2021 at 5:40 PM Aldrin <[email protected]> wrote: > > > > Forgive me if I am misunderstanding the context, but my initial > impression would be that this is solved at a higher layer than the file > format. While some approaches make sense at > > the file format level, that approach may not be the best. I suspect that > book-keeping for this type of conversion would be affected by batching > granularity (can you group multiple > > streamed batches) and what type of process/job it is (is the job at the > level of like a bash script? is the job at the level of a copy task?). > > > > Some questions and thoughts below: > > > > > >> One thing that occurs to me is whether we could enable the file > >> footer metadata to live in a "sidecar" file to support this use case. > > > > > > This sounds like a good, simple approach that could serve as a default. > But I feel like this is essentially the same as maintaining an independent > metadata file, that could be described > > in a cookbook or something. Seems odd to me, personally, to include it > in the format definition. > > The problem with this is that it is not compliant with our > specification ([1]), so applications would not be able to hope for any > interoperability. Parquet provides for file footer metadata living > separate from the row groups (akin to our record batches), and this is > formalized in the format ([2]). None of the Arrow projects have any > mechanism to deal with the Footer independently — to do something with > that metadata that is not in the project specification is not > something we could support and provide backward/forward > compatibilities for. > > [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > [2]: > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787 > > > > >> 3. Doing a pass over the record batches to gather the information > required to generate the footer data. > > > > > > Could you maintain footer data incrementally and always write to the > same spot whenever some number of batches are written to the destination? > > > > > >> 2. Writing batches out as they appear. > > > > > > Might batches be received out of order? Is this long running job > streaming over a network connection? Might the source be > distributed/striped over multiple sources/locations? > > > > > >> a use case where there is a long running job producing results as it > goes that may die and therefore must be restarted > > > > > > Would the long running job only be handling independent streams, > concurrently? e.g. is it an asynchronous job that handles a single logical > stream, or does it manage a pool of stream > > for concurrent requesting processes? > > > > Aldrin Montana > > Computer Science PhD Student > > UC Santa Cruz > > > > > > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <[email protected]> > wrote: > >> > >> hi Sam — it's an interesting proposition. Other file formats like > >> Parquet don't make "resuming" particularly easy, either. The magic > >> number at the beginning of an Arrow file means that it's a lot more > >> expensive to turn a stream file into an Arrow-file-file — if we'd > >> thought about this use case, we might have chosen to only put the > >> magic number at the end of the file. > >> > >> It's also not possible to put the file metadata "outside" the stream > >> file. One thing that occurs to me is whether we could enable the file > >> footer metadata to live in a "sidecar" file to support this use case. > >> To enable this, we would have to add a new optional field to Footer in > >> File.fbs that indicates the file path that the Footer references. This > >> would be null when the footer is part of the same file where the data > >> lives. A function could be implemented to produce this "sidecar index" > >> file from a stream file. > >> > >> Not sure on others' thoughts about this. > >> > >> Thanks, > >> Wes > >> > >> > >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <[email protected]> > wrote: > >> > > >> > Hi, > >> > > >> > I'm interested in a use case where there is a long running job > producing results as it goes that may die and therefore must be restarted, > making sure to continue from the last known-good point. > >> > > >> > For this use case, it seems best to use the "IPC Streaming Format" > and write out the batches as they are generated. > >> > > >> > However, once the job is finished it would also be beneficial to have > random access into the file. It seems like this is possible by: > >> > > >> > Manually creating a file with the correct magic number/padding bytes > and then seq'ing past them. > >> > Writing batches out as they appear. > >> > Doing a pass over the record batches to gather the information > required to generate the footer data. > >> > > >> > > >> > Whilst this seems possible, it doesn't seem like it is a use case > that has come up before. However, this does surprise me because adding > index information to a "completed" file seems like a genuinely useful thing > to want to do. > >> > > >> > Has anyone encountered something similar before? > >> > > >> > Is there an easier way to achieve this? i.e. does this functionality, > or parts of, exist in another language that I can bind to in Python? > >> > > >> > Best, > >> > > >> > Sam > >> > > >> > > >> > IMPORTANT NOTICE: The information transmitted is intended only for > the person or entity to which it is addressed and may contain confidential > and/or privileged material. Any review, re-transmission, dissemination or > other use of, or taking of any action in reliance upon, this information by > persons or entities other than the intended recipient is prohibited. If you > received this in error, please contact the sender and delete the material > from any computer. Although we routinely screen for viruses, addressees > should check this e-mail and any attachment for viruses. We make no > warranty as to absence of viruses in this e-mail or any attachments. >
