Re: [Python] why does write_feather drop index by default?

Weston Pace Tue, 13 Jul 2021 14:26:34 -0700

Ah, good catch.  Looks like this is missing[1].  The default compression
level for zstd is 1.


[1] https://issues.apache.org/jira/browse/ARROW-13091

On Tue, Jul 13, 2021 at 10:39 AM Arun Joseph <[email protected]> wrote:

> The IPC API seems to work for the most part, however is there a way to
> specify compression level with IpcWriteOptions? It doesn't seem to be
> exposed. I'm currently using zstd, so not sure what level it defaults to
> otherwise:
> Additionally, should I be enabling the allow_64bit bool? I have nanosecond
> timestamps which would be truncated if it this option acts the way I think
> it does.
>
> ```
> """
> Serialization options for the IPC format.
>
> Parameters
> ----------
> metadata_version : MetadataVersion, default MetadataVersion.V5
> The metadata version to write. V5 is the current and latest,
> V4 is the pre-1.0 metadata version (with incompatible Union layout).
> allow_64bit: bool, default False
> If true, allow field lengths that don't fit in a signed 32-bit int.
> use_legacy_format : bool, default False
> Whether to use the pre-Arrow 0.15 IPC format.
> compression: str or None
> If not None, compression codec to use for record batch buffers.
> May only be "lz4", "zstd" or None.
> use_threads: bool
> Whether to use the global CPU thread pool to parallelize any
> computational tasks like compression.
> emit_dictionary_deltas: bool
> Whether to emit dictionary deltas. Default is false for maximum
> stream compatibility.
> """
>
>
> On Tue, Jul 13, 2021 at 2:41 PM Weston Pace <[email protected]> wrote:
>
>> I can't speak to the intent.  Adding a feather.write_table version
>> (equivalent to feather.read_table) seems like it would be reasonable.
>>
>> > Is the best way around this to do the following?
>>
>> What you have written does not work for me.  This slightly different
>> version does:
>>
>> ```python3
>> import pyarrow as pa
>> import pyarrow._feather as _feather
>>
>> table = pa.Table.from_pandas(df)
>> _feather.write_feather(table, '/tmp/foo.feather',
>>                          compression=compression,
>> compression_level=compression_level,
>>                          chunksize=chunksize, version=version)
>> ```
>>
>> I'm not sure it's a great practice to be relying on pyarrow._feather
>> though as it is meant to be internal and subject to change without
>> much consideration.
>>
>> You might want to consider using the newer IPC API which should be
>> equivalent (write_feather is indirectly using a RecordBatchFileWriter
>> under the hood although it is buried in the C++[1]).  A complete
>> example:
>>
>> ```python3
>> import pandas as pd
>> import pyarrow as pa
>> import pyarrow.ipc
>>
>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
>> compression = None
>>
>> options = pyarrow.ipc.IpcWriteOptions()
>> options.compression = compression
>> writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo2.feather',
>> schema=table.schema, options=options)
>> writer.write_table(table)
>> writer.close()
>> ```
>>
>> If you need chunks it is slightly more work:
>>
>> ```python3
>> options = pyarrow.ipc.IpcWriteOptions()
>> options.compression = compression
>> writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo3.feather',
>> schema=table.schema, options=options)
>> batches = table.to_batches(chunksize)
>> for batch in batches:
>>     writer.write_batch(batch)
>> writer.close()
>> ```
>>
>> All three versions should be readable by pyarrow.feather.read_feather
>> and should yield the exact same dataframe.
>>
>> [1]
>> https://github.com/apache/arrow/blob/81ff679c47754692224f655dab32cc0936bb5f55/cpp/src/arrow/ipc/feather.cc#L796
>>
>> On Tue, Jul 13, 2021 at 7:06 AM Arun Joseph <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I've noticed that if I pass a pandas dataframe to write_feather
>> (hyperlink to relevant part of code), it will automatically drop the index.
>> Was this behavior intentionally chosen to only drop the index and not to
>> allow the user to specify? I assumed the behavior would match the default
>> behavior of converting from a pandas dataframe to an arrow table as
>> mentioned in the docs.
>> >
>> > Is the best way around this to do the following?
>> >
>> > ```python3
>> > import pyarrow.lib as ext
>> > from pyarrow.lib import Table
>> >
>> > table = Table.from_pandas(df)
>> > ext.write_feather(table, dest,
>> >                          compression=compression,
>> compression_level=compression_level,
>> >                          chunksize=chunksize, version=version)
>> > ```
>> > Thank You,
>> > --
>> > Arun Joseph
>> >
>>
>
>
> --
> Arun Joseph
>
>

Re: [Python] why does write_feather drop index by default?

Reply via email to