> > but I see the > value in providing a way to do random access into a stream-file after > writing it without having to rewrite the file into the file format
Another path forward seems to be what Sam initially called out as a workflow. i.e. create a new API that takes a partially written IPC File formatted file and allows for "finishing it". I think the complicated part is likely determining a resumption point (which is maybe an API input and people can determine their own system for doing this transactionally. -Micah [1] https://github.com/apache/arrow/pull/4815 On Fri, Jul 16, 2021 at 12:19 PM Wes McKinney <[email protected]> wrote: > hi Micah — makes sense. I agree that starting down the path of "table > management" in Arrow is probably too much scope creep since the > requirements (e.g. schema evolution) can vary so much, but I see the > value in providing a way to do random access into a stream-file after > writing it without having to rewrite the file into the file format > (which may be tricky given possible issues with dictionary deltas) > > On Wed, Jul 14, 2021 at 10:58 PM Micah Kornfield <[email protected]> > wrote: > > > > I think if we tried to tack this on, I think it might be worth trying to > go through the design effort to see if something is possible without > external files. The stream format also allows more flexibility around > dictionaries then the file format does, so there is a possibility of > impedance mismatch. > > > > Before we went with our own specification for external metadata it seems > that looking at integration with something like Iceberg might make sense. > > > > My understanding is that external metadata files are on the path to > deprecation or at least no recommended in parquet [1]. > > > > [1] > https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E > > > > On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <[email protected]> > wrote: > >> > >> On Wed, Jul 14, 2021 at 5:40 PM Aldrin <[email protected]> wrote: > >> > > >> > Forgive me if I am misunderstanding the context, but my initial > impression would be that this is solved at a higher layer than the file > format. While some approaches make sense at > >> > the file format level, that approach may not be the best. I suspect > that book-keeping for this type of conversion would be affected by batching > granularity (can you group multiple > >> > streamed batches) and what type of process/job it is (is the job at > the level of like a bash script? is the job at the level of a copy task?). > >> > > >> > Some questions and thoughts below: > >> > > >> > > >> >> One thing that occurs to me is whether we could enable the file > >> >> footer metadata to live in a "sidecar" file to support this use case. > >> > > >> > > >> > This sounds like a good, simple approach that could serve as a > default. But I feel like this is essentially the same as maintaining an > independent metadata file, that could be described > >> > in a cookbook or something. Seems odd to me, personally, to include > it in the format definition. > >> > >> The problem with this is that it is not compliant with our > >> specification ([1]), so applications would not be able to hope for any > >> interoperability. Parquet provides for file footer metadata living > >> separate from the row groups (akin to our record batches), and this is > >> formalized in the format ([2]). None of the Arrow projects have any > >> mechanism to deal with the Footer independently — to do something with > >> that metadata that is not in the project specification is not > >> something we could support and provide backward/forward > >> compatibilities for. > >> > >> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format > >> [2]: > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787 > >> > >> > > >> >> 3. Doing a pass over the record batches to gather the information > required to generate the footer data. > >> > > >> > > >> > Could you maintain footer data incrementally and always write to the > same spot whenever some number of batches are written to the destination? > >> > > >> > > >> >> 2. Writing batches out as they appear. > >> > > >> > > >> > Might batches be received out of order? Is this long running job > streaming over a network connection? Might the source be > distributed/striped over multiple sources/locations? > >> > > >> > > >> >> a use case where there is a long running job producing results as it > goes that may die and therefore must be restarted > >> > > >> > > >> > Would the long running job only be handling independent streams, > concurrently? e.g. is it an asynchronous job that handles a single logical > stream, or does it manage a pool of stream > >> > for concurrent requesting processes? > >> > > >> > Aldrin Montana > >> > Computer Science PhD Student > >> > UC Santa Cruz > >> > > >> > > >> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <[email protected]> > wrote: > >> >> > >> >> hi Sam — it's an interesting proposition. Other file formats like > >> >> Parquet don't make "resuming" particularly easy, either. The magic > >> >> number at the beginning of an Arrow file means that it's a lot more > >> >> expensive to turn a stream file into an Arrow-file-file — if we'd > >> >> thought about this use case, we might have chosen to only put the > >> >> magic number at the end of the file. > >> >> > >> >> It's also not possible to put the file metadata "outside" the stream > >> >> file. One thing that occurs to me is whether we could enable the file > >> >> footer metadata to live in a "sidecar" file to support this use case. > >> >> To enable this, we would have to add a new optional field to Footer > in > >> >> File.fbs that indicates the file path that the Footer references. > This > >> >> would be null when the footer is part of the same file where the data > >> >> lives. A function could be implemented to produce this "sidecar > index" > >> >> file from a stream file. > >> >> > >> >> Not sure on others' thoughts about this. > >> >> > >> >> Thanks, > >> >> Wes > >> >> > >> >> > >> >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis < > [email protected]> wrote: > >> >> > > >> >> > Hi, > >> >> > > >> >> > I'm interested in a use case where there is a long running job > producing results as it goes that may die and therefore must be restarted, > making sure to continue from the last known-good point. > >> >> > > >> >> > For this use case, it seems best to use the "IPC Streaming Format" > and write out the batches as they are generated. > >> >> > > >> >> > However, once the job is finished it would also be beneficial to > have random access into the file. It seems like this is possible by: > >> >> > > >> >> > Manually creating a file with the correct magic number/padding > bytes and then seq'ing past them. > >> >> > Writing batches out as they appear. > >> >> > Doing a pass over the record batches to gather the information > required to generate the footer data. > >> >> > > >> >> > > >> >> > Whilst this seems possible, it doesn't seem like it is a use case > that has come up before. However, this does surprise me because adding > index information to a "completed" file seems like a genuinely useful thing > to want to do. > >> >> > > >> >> > Has anyone encountered something similar before? > >> >> > > >> >> > Is there an easier way to achieve this? i.e. does this > functionality, or parts of, exist in another language that I can bind to in > Python? > >> >> > > >> >> > Best, > >> >> > > >> >> > Sam > >> >> > > >> >> > > >> >> > IMPORTANT NOTICE: The information transmitted is intended only for > the person or entity to which it is addressed and may contain confidential > and/or privileged material. Any review, re-transmission, dissemination or > other use of, or taking of any action in reliance upon, this information by > persons or entities other than the intended recipient is prohibited. If you > received this in error, please contact the sender and delete the material > from any computer. Although we routinely screen for viruses, addressees > should check this e-mail and any attachment for viruses. We make no > warranty as to absence of viruses in this e-mail or any attachments. >
