[jira] [Commented] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index

Alenka Frim (Jira) Thu, 20 Oct 2022 05:19:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621068#comment-17621068
 ]


Alenka Frim commented on ARROW-17200:
-------------------------------------

This should be possible.

When transforming pandas dataframe into Arrow table the multi-index is 
converted into columns. These columns can then be defined as {{partition_cols}} 
for writing parquet files into partitions. Also looking at the code in pandas 
codebase, the correct method is selected if {{partition_cols}} are supplied:

[https://github.com/pandas-dev/pandas/blob/56d82a9bd654e91d14596e82e4d9c82215fa5bc8/pandas/io/parquet.py#L195-L209]

which is {{write_to_dataset}}. A working example:

{code:python}
import pandas as pd
import numpy as np

# Creating a dataframe with MultiIndex
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(data={'randn':  np.random.randn(8)}, index=index)

# writing to a partitioned dataset
df.to_parquet(path='dataset_name', partition_cols=["first", "second"])

# inspecting the pieces
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dataset_name', use_legacy_dataset=False)
dataset.fragments
# [<pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=bar/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=one, first=bar]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=bar/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=one, first=bar]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=bar/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=two, first=bar]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=bar/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=two, first=bar]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=baz/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=one, first=baz]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=baz/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=one, first=baz]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=baz/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=two, first=baz]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=baz/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=two, first=baz]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=foo/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=one, first=foo]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=foo/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=one, first=foo]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=foo/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=two, first=foo]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=foo/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=two, first=foo]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=qux/second=one/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=one, first=qux]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=qux/second=one/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=one, first=qux]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=qux/second=two/91796db98d874ef2b170f22a304a7c5e-0.parquet
 partition=[second=two, first=qux]>,
#  <pyarrow.dataset.ParquetFileFragment 
path=dataset_name/first=qux/second=two/c1daafbac5334b9ea230821de383cb93-0.parquet
 partition=[second=two, first=qux]>]

{code}

> [Python][Parquet] support partitioning by Pandas DataFrame index
> ----------------------------------------------------------------
>
>                 Key: ARROW-17200
>                 URL: https://issues.apache.org/jira/browse/ARROW-17200
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Parquet, Python
>            Reporter: Gregory Werbin
>            Priority: Minor
>
> In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" 
> index level, one might want to partition by that index level when saving the 
> data frame to Parquet format. This is currently not possible; you need to 
> manually reset the index before writing, and re-add the index after reading. 
> It would be very useful if you could supply the name of an index level to 
> {{partition_cols}} instead of (or ideally in addition to) a data column name.
> I originally posted this on the Pandas issue tracker 
> ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke 
> looked at the code and figured out that the partitioning functionality was 
> implemented entirely in PyArrow, and that the change would need to happen 
> within PyArrow itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index

Reply via email to