Re: [Python/C++] Streaming Format to IPC File Format Conversion

Micah Kornfield Mon, 09 Aug 2021 21:47:55 -0700

>
> but I see the
> value in providing a way to do random access into a stream-file after
> writing it without having to rewrite the file into the file format



Another path forward seems to be what Sam initially called out as a
workflow.  i.e. create a new API that takes a partially written IPC File
formatted file and allows for "finishing it".  I think the complicated part
is likely determining a resumption point (which is maybe an API input and
people can determine their own system for doing this transactionally.

-Micah

[1] https://github.com/apache/arrow/pull/4815

On Fri, Jul 16, 2021 at 12:19 PM Wes McKinney <[email protected]> wrote:

> hi Micah — makes sense. I agree that starting down the path of "table
> management" in Arrow is probably too much scope creep since the
> requirements (e.g. schema evolution) can vary so much, but I see the
> value in providing a way to do random access into a stream-file after
> writing it without having to rewrite the file into the file format
> (which may be tricky given possible issues with dictionary deltas)
>
> On Wed, Jul 14, 2021 at 10:58 PM Micah Kornfield <[email protected]>
> wrote:
> >
> > I think if we tried to tack this on, I think it might be worth trying to
> go through the design effort to see if something is possible without
> external files.  The stream format also allows more flexibility around
> dictionaries then the file format does, so there is a possibility of
> impedance mismatch.
> >
> > Before we went with our own specification for external metadata it seems
> that looking at integration with something like Iceberg might make sense.
> >
> > My understanding is that  external metadata files are on the path to
> deprecation or at least no recommended in parquet [1].
> >
> > [1]
> https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E
> >
> > On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <[email protected]>
> wrote:
> >>
> >> On Wed, Jul 14, 2021 at 5:40 PM Aldrin <[email protected]> wrote:
> >> >
> >> > Forgive me if I am misunderstanding the context, but my initial
> impression would be that this is solved at a higher layer than the file
> format. While some approaches make sense at
> >> > the file format level, that approach may not be the best. I suspect
> that book-keeping for this type of conversion would be affected by batching
> granularity (can you group multiple
> >> > streamed batches) and what type of process/job it is (is the job at
> the level of like a bash script? is the job at the level of a copy task?).
> >> >
> >> > Some questions and thoughts below:
> >> >
> >> >
> >> >> One thing that occurs to me is whether we could enable the file
> >> >> footer metadata to live in a "sidecar" file to support this use case.
> >> >
> >> >
> >> > This sounds like a good, simple approach that could serve as a
> default. But I feel like this is essentially the same as maintaining an
> independent metadata file, that could be described
> >> > in a cookbook or something. Seems odd to me, personally, to include
> it in the format definition.
> >>
> >> The problem with this is that it is not compliant with our
> >> specification ([1]), so applications would not be able to hope for any
> >> interoperability. Parquet provides for file footer metadata living
> >> separate from the row groups (akin to our record batches), and this is
> >> formalized in the format ([2]). None of the Arrow projects have any
> >> mechanism to deal with the Footer independently — to do something with
> >> that metadata that is not in the project specification is not
> >> something we could support and provide backward/forward
> >> compatibilities for.
> >>
> >> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> >> [2]:
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787
> >>
> >> >
> >> >> 3. Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >> >
> >> >
> >> > Could you maintain footer data incrementally and always write to the
> same spot whenever some number of batches are written to the destination?
> >> >
> >> >
> >> >> 2. Writing batches out as they appear.
> >> >
> >> >
> >> > Might batches be received out of order? Is this long running job
> streaming over a network connection? Might the source be
> distributed/striped over multiple sources/locations?
> >> >
> >> >
> >> >> a use case where there is a long running job producing results as it
> goes that may die and therefore must be restarted
> >> >
> >> >
> >> > Would the long running job only be handling independent streams,
> concurrently? e.g. is it an asynchronous job that handles a single logical
> stream, or does it manage a pool of stream
> >> > for concurrent requesting processes?
> >> >
> >> > Aldrin Montana
> >> > Computer Science PhD Student
> >> > UC Santa Cruz
> >> >
> >> >
> >> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <[email protected]>
> wrote:
> >> >>
> >> >> hi Sam — it's an interesting proposition. Other file formats like
> >> >> Parquet don't make "resuming" particularly easy, either. The magic
> >> >> number at the beginning of an Arrow file means that it's a lot more
> >> >> expensive to turn a stream file into an Arrow-file-file — if we'd
> >> >> thought about this use case, we might have chosen to only put the
> >> >> magic number at the end of the file.
> >> >>
> >> >> It's also not possible to put the file metadata "outside" the stream
> >> >> file. One thing that occurs to me is whether we could enable the file
> >> >> footer metadata to live in a "sidecar" file to support this use case.
> >> >> To enable this, we would have to add a new optional field to Footer
> in
> >> >> File.fbs that indicates the file path that the Footer references.
> This
> >> >> would be null when the footer is part of the same file where the data
> >> >> lives. A function could be implemented to produce this "sidecar
> index"
> >> >> file from a stream file.
> >> >>
> >> >> Not sure on others' thoughts about this.
> >> >>
> >> >> Thanks,
> >> >> Wes
> >> >>
> >> >>
> >> >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <
> [email protected]> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > I'm interested in a use case where there is a long running job
> producing results as it goes that may die and therefore must be restarted,
> making sure to continue from the last known-good point.
> >> >> >
> >> >> > For this use case, it seems best to use the "IPC Streaming Format"
> and write out the batches as they are generated.
> >> >> >
> >> >> > However, once the job is finished it would also be beneficial to
> have random access into the file. It seems like this is possible by:
> >> >> >
> >> >> > Manually creating a file with the correct magic number/padding
> bytes and then seq'ing past them.
> >> >> > Writing batches out as they appear.
> >> >> > Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >> >> >
> >> >> >
> >> >> > Whilst this seems possible, it doesn't seem like it is a use case
> that has come up before. However, this does surprise me because adding
> index information to a "completed" file seems like a genuinely useful thing
> to want to do.
> >> >> >
> >> >> > Has anyone encountered something similar before?
> >> >> >
> >> >> > Is there an easier way to achieve this? i.e. does this
> functionality, or parts of, exist in another language that I can bind to in
> Python?
> >> >> >
> >> >> > Best,
> >> >> >
> >> >> > Sam
> >> >> >
> >> >> >
> >> >> > IMPORTANT NOTICE: The information transmitted is intended only for
> the person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Reply via email to