Re: [Python] why does write_feather drop index by default?

Arun Joseph Tue, 13 Jul 2021 15:06:47 -0700

cool, that's good to know. I guess for now I'll just use the older method
until support is exposed for compression_level. I do have an unrelated
question:


Is there a way to reduce the memory overhead when loading a compressed
feather file? I believe right now I decompress the file and then load the
entire thing into memory. Not sure if chunking is something that is
applicable here. I've read this article[1] from a couple of years back.
Would the right approach be to use pyarrow.RecordBatchStreamer to read a
file that was written with chunks and skip chunks that contain series I
don't care about? However, would that even reduce the memory footprint if
the file was compressed in the first place? or is the compression applied
on a per-chunk basis?

[1] https://wesmckinney.com/blog/arrow-streaming-columnar/

On Tue, Jul 13, 2021 at 5:26 PM Weston Pace <[email protected]> wrote:

> Ah, good catch.  Looks like this is missing[1].  The default compression
> level for zstd is 1.
>
> [1] https://issues.apache.org/jira/browse/ARROW-13091
>
> On Tue, Jul 13, 2021 at 10:39 AM Arun Joseph <[email protected]> wrote:
>
>> The IPC API seems to work for the most part, however is there a way to
>> specify compression level with IpcWriteOptions? It doesn't seem to be
>> exposed. I'm currently using zstd, so not sure what level it defaults to
>> otherwise:
>> Additionally, should I be enabling the allow_64bit bool? I have
>> nanosecond timestamps which would be truncated if it this option acts the
>> way I think it does.
>>
>> ```
>> """
>> Serialization options for the IPC format.
>>
>> Parameters
>> ----------
>> metadata_version : MetadataVersion, default MetadataVersion.V5
>> The metadata version to write. V5 is the current and latest,
>> V4 is the pre-1.0 metadata version (with incompatible Union layout).
>> allow_64bit: bool, default False
>> If true, allow field lengths that don't fit in a signed 32-bit int.
>> use_legacy_format : bool, default False
>> Whether to use the pre-Arrow 0.15 IPC format.
>> compression: str or None
>> If not None, compression codec to use for record batch buffers.
>> May only be "lz4", "zstd" or None.
>> use_threads: bool
>> Whether to use the global CPU thread pool to parallelize any
>> computational tasks like compression.
>> emit_dictionary_deltas: bool
>> Whether to emit dictionary deltas. Default is false for maximum
>> stream compatibility.
>> """
>>
>>
>> On Tue, Jul 13, 2021 at 2:41 PM Weston Pace <[email protected]>
>> wrote:
>>
>>> I can't speak to the intent.  Adding a feather.write_table version
>>> (equivalent to feather.read_table) seems like it would be reasonable.
>>>
>>> > Is the best way around this to do the following?
>>>
>>> What you have written does not work for me.  This slightly different
>>> version does:
>>>
>>> ```python3
>>> import pyarrow as pa
>>> import pyarrow._feather as _feather
>>>
>>> table = pa.Table.from_pandas(df)
>>> _feather.write_feather(table, '/tmp/foo.feather',
>>>                          compression=compression,
>>> compression_level=compression_level,
>>>                          chunksize=chunksize, version=version)
>>> ```
>>>
>>> I'm not sure it's a great practice to be relying on pyarrow._feather
>>> though as it is meant to be internal and subject to change without
>>> much consideration.
>>>
>>> You might want to consider using the newer IPC API which should be
>>> equivalent (write_feather is indirectly using a RecordBatchFileWriter
>>> under the hood although it is buried in the C++[1]).  A complete
>>> example:
>>>
>>> ```python3
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.ipc
>>>
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
>>> compression = None
>>>
>>> options = pyarrow.ipc.IpcWriteOptions()
>>> options.compression = compression
>>> writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo2.feather',
>>> schema=table.schema, options=options)
>>> writer.write_table(table)
>>> writer.close()
>>> ```
>>>
>>> If you need chunks it is slightly more work:
>>>
>>> ```python3
>>> options = pyarrow.ipc.IpcWriteOptions()
>>> options.compression = compression
>>> writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo3.feather',
>>> schema=table.schema, options=options)
>>> batches = table.to_batches(chunksize)
>>> for batch in batches:
>>>     writer.write_batch(batch)
>>> writer.close()
>>> ```
>>>
>>> All three versions should be readable by pyarrow.feather.read_feather
>>> and should yield the exact same dataframe.
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/81ff679c47754692224f655dab32cc0936bb5f55/cpp/src/arrow/ipc/feather.cc#L796
>>>
>>> On Tue, Jul 13, 2021 at 7:06 AM Arun Joseph <[email protected]> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I've noticed that if I pass a pandas dataframe to write_feather
>>> (hyperlink to relevant part of code), it will automatically drop the index.
>>> Was this behavior intentionally chosen to only drop the index and not to
>>> allow the user to specify? I assumed the behavior would match the default
>>> behavior of converting from a pandas dataframe to an arrow table as
>>> mentioned in the docs.
>>> >
>>> > Is the best way around this to do the following?
>>> >
>>> > ```python3
>>> > import pyarrow.lib as ext
>>> > from pyarrow.lib import Table
>>> >
>>> > table = Table.from_pandas(df)
>>> > ext.write_feather(table, dest,
>>> >                          compression=compression,
>>> compression_level=compression_level,
>>> >                          chunksize=chunksize, version=version)
>>> > ```
>>> > Thank You,
>>> > --
>>> > Arun Joseph
>>> >
>>>
>>
>>
>> --
>> Arun Joseph
>>
>>

-- 
Arun Joseph

Re: [Python] why does write_feather drop index by default?

Reply via email to