[ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829198#comment-16829198
 ] 

Joris Van den Bossche commented on ARROW-2628:
----------------------------------------------

Similar report in ARROW-2709, which had a PR linked: 
https://github.com/apache/arrow/pull/3344. This PR is closed (without merging), 
but contains some relevant discussion.

The current implementation of {{pyarrow.parquet.write_to_dataset}} converts the 
pyarrow Table to a pandas DataFrame, and then uses pandas' groupby method to 
split it in multiple dataframes (after dropping the partition columns from the 
dataframe, which makes yet another data copy). Each subset pandas DataFrame is 
then converted back to a pyarrow Table. 
In addition, when using this functionality from pandas' {{to_parquet}}, you get 
an additional initial conversion of the pandas DataFrame to arrow Table. 

This clearly is less than optimal. It might be that some of the copies could be 
avoided (e.g. it is not clear to me if {{Table.from_pandas}} always copies 
data). The closed PR tried to circumvent this by using arrow's dictionary 
encoding instead of pandas' groupby, and then reconstructing the subset Tables 
based on those indices). But ideally, a more arrow-native solution is used 
instead of those work-arounds.

To quote [~wesmckinn] from the PR ([github 
comment|https://github.com/apache/arrow/pull/3344#issuecomment-462093173]):

{quote}
I want the proper C++ work completed instead. Here are the steps:

* Struct hash: ARROW-3978
* Integer argsort part of ARROW-1566
* Take function ARROW-772

These are the three things you need to do the groupby-split operation natively 
against an Arrow table. There is a slight complication which is implementing 
"take" against chunked arrays. This could be mitigated by doing the 
groupby-split at the contiguous record batch level
{quote}


> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> ----------------------------------------------------------------------
>
>                 Key: ARROW-2628
>                 URL: https://issues.apache.org/jira/browse/ARROW-2628
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to