On Wed, Oct 7, 2020 at 9:33 PM Jonathan Yu <jonathan.i...@gmail.com> wrote:
>
> Hello there,
>
> I am using Arrow to store data on disk temporarily, so disk space is not a 
> problem (I understand that Parquet is preferable for more efficient disk 
> storage). It seems that Arrow's memory mapping/zero copy capabilities would 
> provide better performance given this use case.
>
> Here are my questions:
>
> 1. For new applications, should we prefer the pa.ipc.new_file interface over 
> write_feather? My understanding from reading [0] is that 
> pa.feather.write_feather is an API provided for backward compatibility, and 
> with compression disabled, it seems to produce files of the same size (the 
> files appear to be identical) as the RecordBatchFileWriter.
>

You can use either, neither API is deprecated nor planning to be.

> 2. Does compression affect the need to make copies? I imagine that 
> compressing the file means that the code to use the file cannot be zero-copy 
> anymore.
>

Right, when using compression by definition zero copy is not possible.

> 3. When using pandas to analyze the data, is there a way to load the data 
> using memory mapping, and if so, would this be expected to improve 
> deserialization performance and memory utilization if multiple processes are 
> reading the same table data simultaneously? Assume that I'm running on a 
> modern server-class SSD.
>

No, pandas doesn't support memory mapping.

> Thank you!
>
> Jonathan
>
> [0] https://arrow.apache.org/faq/#what-about-the-feather-file-format

Reply via email to