[jira] [Commented] (ARROW-15265) [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large

Weston Pace (Jira) Fri, 07 Jan 2022 10:25:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470794#comment-17470794
 ]


Weston Pace commented on ARROW-15265:
-------------------------------------

Thanks for figuring that out [~lidavidm].  Adding a async version of 
DeleteDirContents should probably allow us to fix this.  Fortunately, 
DatasetWriter already expected these calls to become async so it won't be too 
hard to plug it in.


> [C++][Python][Dataset] write_dataset with delete_matching hangs when the 
> number of partitions is too large
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15265
>                 URL: https://issues.apache.org/jira/browse/ARROW-15265
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.1
>            Reporter: Caleb Overman
>            Priority: Major
>
> I'm attempting to use use the {{existing_data_behavior="delete_matching"}} 
> option when using {{ds.write_dataset}} to write a hive partitioned parquet 
> file to S3. This seems to work perfectly fine when the table being written is 
> creating 7 or fewer partitions, but as soon as the partition column in the 
> table has an 8th unique value the write completely hangs.
>  
> {code:java}
> import numpy as np
> import pyarrow as pa
> from pyarrow import fs
> import pyarrow.dataset as ds
> bucket = "my-bucket"
> s3 = fs.S3FileSystem()
> cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
> table_7 = pa.table(
>     {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
> )
> # succeeds
> ds.write_dataset(
>     data=table_7,
>     base_dir=f"{bucket}/test7.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> )
> cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
> table_8 = pa.table(
>     {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
> )
> # this hangs
> ds.write_dataset(
>     data=table_8,
>     base_dir=f"{bucket}/test8.parquet",
>     format="parquet",
>     partitioning=["col1"],
>     partitioning_flavor="hive",
>     filesystem=s3,
>     existing_data_behavior="delete_matching",
> ) {code}
> For the file with 8 partitions, the directory structure is created in S3 but 
> no actual files are written before hanging.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15265) [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large

Reply via email to