[ https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621068#comment-17621068 ]
Alenka Frim commented on ARROW-17200: ------------------------------------- This should be possible. When transforming pandas dataframe into Arrow table the multi-index is converted into columns. These columns can then be defined as {{partition_cols}} for writing parquet files into partitions. Also looking at the code in pandas codebase, the correct method is selected if {{partition_cols}} are supplied: [https://github.com/pandas-dev/pandas/blob/56d82a9bd654e91d14596e82e4d9c82215fa5bc8/pandas/io/parquet.py#L195-L209] which is {{write_to_dataset}}. A working example: {code:python} import pandas as pd import numpy as np # Creating a dataframe with MultiIndex arrays = [ ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ["one", "two", "one", "two", "one", "two", "one", "two"], ] tuples = list(zip(*arrays)) index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) df = pd.DataFrame(data={'randn': np.random.randn(8)}, index=index) # writing to a partitioned dataset df.to_parquet(path='dataset_name', partition_cols=["first", "second"]) # inspecting the pieces import pyarrow.parquet as pq dataset = pq.ParquetDataset('dataset_name', use_legacy_dataset=False) dataset.fragments # [<pyarrow.dataset.ParquetFileFragment path=dataset_name/first=bar/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=one, first=bar]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=bar/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=one, first=bar]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=bar/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=two, first=bar]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=bar/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=two, first=bar]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=baz/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=one, first=baz]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=baz/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=one, first=baz]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=baz/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=two, first=baz]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=baz/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=two, first=baz]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=foo/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=one, first=foo]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=foo/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=one, first=foo]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=foo/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=two, first=foo]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=foo/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=two, first=foo]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=qux/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=one, first=qux]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=qux/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=one, first=qux]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=qux/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet partition=[second=two, first=qux]>, # <pyarrow.dataset.ParquetFileFragment path=dataset_name/first=qux/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet partition=[second=two, first=qux]>] {code} > [Python][Parquet] support partitioning by Pandas DataFrame index > ---------------------------------------------------------------- > > Key: ARROW-17200 > URL: https://issues.apache.org/jira/browse/ARROW-17200 > Project: Apache Arrow > Issue Type: New Feature > Components: Parquet, Python > Reporter: Gregory Werbin > Priority: Minor > > In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" > index level, one might want to partition by that index level when saving the > data frame to Parquet format. This is currently not possible; you need to > manually reset the index before writing, and re-add the index after reading. > It would be very useful if you could supply the name of an index level to > {{partition_cols}} instead of (or ideally in addition to) a data column name. > I originally posted this on the Pandas issue tracker > ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke > looked at the code and figured out that the partitioning functionality was > implemented entirely in PyArrow, and that the change would need to happen > within PyArrow itself. -- This message was sent by Atlassian Jira (v8.20.10#820010)