Re: [Python] why does write_feather drop index by default?

Weston Pace Tue, 13 Jul 2021 11:41:56 -0700

I can't speak to the intent.  Adding a feather.write_table version
(equivalent to feather.read_table) seems like it would be reasonable.


> Is the best way around this to do the following?

What you have written does not work for me.  This slightly different
version does:

```python3
import pyarrow as pa
import pyarrow._feather as _feather

table = pa.Table.from_pandas(df)
_feather.write_feather(table, '/tmp/foo.feather',
                         compression=compression,
compression_level=compression_level,
                         chunksize=chunksize, version=version)
```

I'm not sure it's a great practice to be relying on pyarrow._feather
though as it is meant to be internal and subject to change without
much consideration.

You might want to consider using the newer IPC API which should be
equivalent (write_feather is indirectly using a RecordBatchFileWriter
under the hood although it is buried in the C++[1]).  A complete
example:

```python3
import pandas as pd
import pyarrow as pa
import pyarrow.ipc

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
compression = None

options = pyarrow.ipc.IpcWriteOptions()
options.compression = compression
writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo2.feather',
schema=table.schema, options=options)
writer.write_table(table)
writer.close()
```

If you need chunks it is slightly more work:

```python3
options = pyarrow.ipc.IpcWriteOptions()
options.compression = compression
writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo3.feather',
schema=table.schema, options=options)
batches = table.to_batches(chunksize)
for batch in batches:
    writer.write_batch(batch)
writer.close()
```

All three versions should be readable by pyarrow.feather.read_feather
and should yield the exact same dataframe.

[1] 
https://github.com/apache/arrow/blob/81ff679c47754692224f655dab32cc0936bb5f55/cpp/src/arrow/ipc/feather.cc#L796

On Tue, Jul 13, 2021 at 7:06 AM Arun Joseph <[email protected]> wrote:
>
> Hi,
>
> I've noticed that if I pass a pandas dataframe to write_feather (hyperlink to 
> relevant part of code), it will automatically drop the index. Was this 
> behavior intentionally chosen to only drop the index and not to allow the 
> user to specify? I assumed the behavior would match the default behavior of 
> converting from a pandas dataframe to an arrow table as mentioned in the docs.
>
> Is the best way around this to do the following?
>
> ```python3
> import pyarrow.lib as ext
> from pyarrow.lib import Table
>
> table = Table.from_pandas(df)
> ext.write_feather(table, dest,
>                          compression=compression, 
> compression_level=compression_level,
>                          chunksize=chunksize, version=version)
> ```
> Thank You,
> --
> Arun Joseph
>

Re: [Python] why does write_feather drop index by default?

Reply via email to