On Wed, Oct 7, 2020 at 9:33 PM Jonathan Yu <jonathan.i...@gmail.com> wrote: > > Hello there, > > I am using Arrow to store data on disk temporarily, so disk space is not a > problem (I understand that Parquet is preferable for more efficient disk > storage). It seems that Arrow's memory mapping/zero copy capabilities would > provide better performance given this use case. > > Here are my questions: > > 1. For new applications, should we prefer the pa.ipc.new_file interface over > write_feather? My understanding from reading [0] is that > pa.feather.write_feather is an API provided for backward compatibility, and > with compression disabled, it seems to produce files of the same size (the > files appear to be identical) as the RecordBatchFileWriter. >
You can use either, neither API is deprecated nor planning to be. > 2. Does compression affect the need to make copies? I imagine that > compressing the file means that the code to use the file cannot be zero-copy > anymore. > Right, when using compression by definition zero copy is not possible. > 3. When using pandas to analyze the data, is there a way to load the data > using memory mapping, and if so, would this be expected to improve > deserialization performance and memory utilization if multiple processes are > reading the same table data simultaneously? Assume that I'm running on a > modern server-class SSD. > No, pandas doesn't support memory mapping. > Thank you! > > Jonathan > > [0] https://arrow.apache.org/faq/#what-about-the-feather-file-format