Forgive me if I am misunderstanding the context, but my initial impression
would be that this is solved at a higher layer than the file format. While
some approaches make sense at
the file format level, that approach may not be the best. I suspect that
book-keeping for this type of conversion would be affected by batching
granularity (can you group multiple
streamed batches) and what type of process/job it is (is the job at the
level of like a bash script? is the job at the level of a copy task?).

Some questions and thoughts below:


One thing that occurs to me is whether we could enable the file
> footer metadata to live in a "sidecar" file to support this use case.
>

This sounds like a good, simple approach that could serve as a default. But
I feel like this is essentially the same as maintaining an independent
metadata file, that could be described
in a cookbook or something. Seems odd to me, personally, to include it in
the format definition.


3. Doing a pass over the record batches to gather the information required
> to generate the footer data.
>

Could you maintain footer data incrementally and always write to the same
spot whenever some number of batches are written to the destination?


2. Writing batches out as they appear.
>

Might batches be received out of order? Is this long running job streaming
over a network connection? Might the source be distributed/striped over
multiple sources/locations?


a use case where there is a long running job producing results as it goes
> that may die and therefore must be restarted
>

Would the long running job only be handling independent streams,
concurrently? e.g. is it an asynchronous job that handles a single logical
stream, or does it manage a pool of stream
for concurrent requesting processes?

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <[email protected]> wrote:

> hi Sam — it's an interesting proposition. Other file formats like
> Parquet don't make "resuming" particularly easy, either. The magic
> number at the beginning of an Arrow file means that it's a lot more
> expensive to turn a stream file into an Arrow-file-file — if we'd
> thought about this use case, we might have chosen to only put the
> magic number at the end of the file.
>
> It's also not possible to put the file metadata "outside" the stream
> file. One thing that occurs to me is whether we could enable the file
> footer metadata to live in a "sidecar" file to support this use case.
> To enable this, we would have to add a new optional field to Footer in
> File.fbs that indicates the file path that the Footer references. This
> would be null when the footer is part of the same file where the data
> lives. A function could be implemented to produce this "sidecar index"
> file from a stream file.
>
> Not sure on others' thoughts about this.
>
> Thanks,
> Wes
>
>
> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I'm interested in a use case where there is a long running job producing
> results as it goes that may die and therefore must be restarted, making
> sure to continue from the last known-good point.
> >
> > For this use case, it seems best to use the "IPC Streaming Format" and
> write out the batches as they are generated.
> >
> > However, once the job is finished it would also be beneficial to have
> random access into the file. It seems like this is possible by:
> >
> > Manually creating a file with the correct magic number/padding bytes and
> then seq'ing past them.
> > Writing batches out as they appear.
> > Doing a pass over the record batches to gather the information required
> to generate the footer data.
> >
> >
> > Whilst this seems possible, it doesn't seem like it is a use case that
> has come up before. However, this does surprise me because adding index
> information to a "completed" file seems like a genuinely useful thing to
> want to do.
> >
> > Has anyone encountered something similar before?
> >
> > Is there an easier way to achieve this? i.e. does this functionality, or
> parts of, exist in another language that I can bind to in Python?
> >
> > Best,
> >
> > Sam
> >
> >
> > IMPORTANT NOTICE: The information transmitted is intended only for the
> person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>

Reply via email to