[ https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ludwik Bielczynski updated ARROW-7782: -------------------------------------- Description: One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} with given partition_cols arguments. Here I have created a minimal example which shows the issue: {code:java} from pathlib import Path import pandas as pd from pyarrow import Table from pyarrow.parquet import write_to_dataset, read_table path = Path('/home/user/trials') file_name = 'local_database.parquet' df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, index=pd.Index(['a', 'b', 'c'], name='idx')) table = Table.from_pandas(df) write_to_dataset(table, str(path / file_name), partition_cols=['B'] ) df_read = read_table(str(path / file_name)) df_read.to_pandas() {code} The issue is rather important for pandas and dask users. was: One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} with given partition_cols arguments. Here I have created a minimal example which shows the issue: {code:java} from pathlib import Path import pandas as pd from pyarrow import Table from pyarrow.parquet import write_to_dataset path = Path('/home/user/trials') file_name = 'local_database.parquet' df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, index=pd.Index(['a', 'b', 'c'], name='idx')) table = Table.from_pandas(df) write_to_dataset(table, str(path / file_name), partition_cols=['B'] ) {code} The issue is rather important for pandas and dask users. > Losing index information when using write_to_dataset with partition_cols > ------------------------------------------------------------------------ > > Key: ARROW-7782 > URL: https://issues.apache.org/jira/browse/ARROW-7782 > Project: Apache Arrow > Issue Type: Bug > Environment: pyarrow==0.15.1 > Reporter: Ludwik Bielczynski > Priority: Major > > One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} > with given partition_cols arguments. Here I have created a minimal example > which shows the issue: > {code:java} > > from pathlib import Path > import pandas as pd > from pyarrow import Table > from pyarrow.parquet import write_to_dataset, read_table > path = Path('/home/user/trials') > file_name = 'local_database.parquet' > df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, > index=pd.Index(['a', 'b', 'c'], > name='idx')) > table = Table.from_pandas(df) > write_to_dataset(table, > str(path / file_name), > partition_cols=['B'] > ) > df_read = read_table(str(path / file_name)) > df_read.to_pandas() > {code} > > The issue is rather important for pandas and dask users. -- This message was sent by Atlassian Jira (v8.3.4#803005)