Forgive me if I am misunderstanding the context, but my initial impression would be that this is solved at a higher layer than the file format. While some approaches make sense at the file format level, that approach may not be the best. I suspect that book-keeping for this type of conversion would be affected by batching granularity (can you group multiple streamed batches) and what type of process/job it is (is the job at the level of like a bash script? is the job at the level of a copy task?).
Some questions and thoughts below: One thing that occurs to me is whether we could enable the file > footer metadata to live in a "sidecar" file to support this use case. > This sounds like a good, simple approach that could serve as a default. But I feel like this is essentially the same as maintaining an independent metadata file, that could be described in a cookbook or something. Seems odd to me, personally, to include it in the format definition. 3. Doing a pass over the record batches to gather the information required > to generate the footer data. > Could you maintain footer data incrementally and always write to the same spot whenever some number of batches are written to the destination? 2. Writing batches out as they appear. > Might batches be received out of order? Is this long running job streaming over a network connection? Might the source be distributed/striped over multiple sources/locations? a use case where there is a long running job producing results as it goes > that may die and therefore must be restarted > Would the long running job only be handling independent streams, concurrently? e.g. is it an asynchronous job that handles a single logical stream, or does it manage a pool of stream for concurrent requesting processes? Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <[email protected]> wrote: > hi Sam — it's an interesting proposition. Other file formats like > Parquet don't make "resuming" particularly easy, either. The magic > number at the beginning of an Arrow file means that it's a lot more > expensive to turn a stream file into an Arrow-file-file — if we'd > thought about this use case, we might have chosen to only put the > magic number at the end of the file. > > It's also not possible to put the file metadata "outside" the stream > file. One thing that occurs to me is whether we could enable the file > footer metadata to live in a "sidecar" file to support this use case. > To enable this, we would have to add a new optional field to Footer in > File.fbs that indicates the file path that the Footer references. This > would be null when the footer is part of the same file where the data > lives. A function could be implemented to produce this "sidecar index" > file from a stream file. > > Not sure on others' thoughts about this. > > Thanks, > Wes > > > On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <[email protected]> > wrote: > > > > Hi, > > > > I'm interested in a use case where there is a long running job producing > results as it goes that may die and therefore must be restarted, making > sure to continue from the last known-good point. > > > > For this use case, it seems best to use the "IPC Streaming Format" and > write out the batches as they are generated. > > > > However, once the job is finished it would also be beneficial to have > random access into the file. It seems like this is possible by: > > > > Manually creating a file with the correct magic number/padding bytes and > then seq'ing past them. > > Writing batches out as they appear. > > Doing a pass over the record batches to gather the information required > to generate the footer data. > > > > > > Whilst this seems possible, it doesn't seem like it is a use case that > has come up before. However, this does surprise me because adding index > information to a "completed" file seems like a genuinely useful thing to > want to do. > > > > Has anyone encountered something similar before? > > > > Is there an easier way to achieve this? i.e. does this functionality, or > parts of, exist in another language that I can bind to in Python? > > > > Best, > > > > Sam > > > > > > IMPORTANT NOTICE: The information transmitted is intended only for the > person or entity to which it is addressed and may contain confidential > and/or privileged material. Any review, re-transmission, dissemination or > other use of, or taking of any action in reliance upon, this information by > persons or entities other than the intended recipient is prohibited. If you > received this in error, please contact the sender and delete the material > from any computer. Although we routinely screen for viruses, addressees > should check this e-mail and any attachment for viruses. We make no > warranty as to absence of viruses in this e-mail or any attachments. >
