That said, the protocol data produced now by `RecordBatchStreamWriter` should be readable in 1.0.0 and beyond. `pyarrow.serialize` is only intended for transient storage. We should add some language to the docstring for this function to explain that it is distinct from the Arrow IPC format (which has a well-defined structure, and has compatibility guarantees)
https://issues.apache.org/jira/browse/ARROW-6336 On Fri, Aug 23, 2019 at 3:05 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Yevgeni, > > I don't think we have ever promised binary stability of the > pyarrow.serialize() protocol. Binary compatibility starting from 1.0.0 > is about the Arrow in-memory format and the Arrow IPC format (i.e. how > Arrow arrays, tables... are laid out and how their metadata is encoded > on the wire). > > So I would not recommend using pa.serialize() for storage. If you want > to store data, you should use a well-known file format (or a combination > thereof), such as Parquet. > > Regards > > Antoine. > > > Le 23/08/2019 à 07:25, Yevgeni Litvin a écrit : > > In our system we are using arrow serialization as it showed excellent > > deserialization speed. However, seems that we made a mistake by persisting > > the streams into a long-term storage as the serialized data appears to be > > incompatible between versions. According to the release notes of 0.14.0 it > > appears that starting 1.0.0 binary compatibility will be maintained. My > > question is whether pyarrow.serialize is also guaranteed to maintain binary > > compatibility starting with arrow 1.0 and it would be safe to persist its > > output then (or maybe even starting now - 0.14)? > > > > (from my quick test the 0.13 is not compatible with 0.12 and before, while > > it is compatible to 0.14) > > > > Thank you, > > > > - Yevgeni > >