[ https://issues.apache.org/jira/browse/ARROW-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470794#comment-17470794 ]
Weston Pace commented on ARROW-15265: ------------------------------------- Thanks for figuring that out [~lidavidm]. Adding a async version of DeleteDirContents should probably allow us to fix this. Fortunately, DatasetWriter already expected these calls to become async so it won't be too hard to plug it in. > [C++][Python][Dataset] write_dataset with delete_matching hangs when the > number of partitions is too large > ---------------------------------------------------------------------------------------------------------- > > Key: ARROW-15265 > URL: https://issues.apache.org/jira/browse/ARROW-15265 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 6.0.1 > Reporter: Caleb Overman > Priority: Major > > I'm attempting to use use the {{existing_data_behavior="delete_matching"}} > option when using {{ds.write_dataset}} to write a hive partitioned parquet > file to S3. This seems to work perfectly fine when the table being written is > creating 7 or fewer partitions, but as soon as the partition column in the > table has an 8th unique value the write completely hangs. > > {code:java} > import numpy as np > import pyarrow as pa > from pyarrow import fs > import pyarrow.dataset as ds > bucket = "my-bucket" > s3 = fs.S3FileSystem() > cols_7 = ["a", "b", "c", "d", "e", "f", "g"] > table_7 = pa.table( > {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)} > ) > # succeeds > ds.write_dataset( > data=table_7, > base_dir=f"{bucket}/test7.parquet", > format="parquet", > partitioning=["col1"], > partitioning_flavor="hive", > filesystem=s3, > existing_data_behavior="delete_matching", > ) > cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"] > table_8 = pa.table( > {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)} > ) > # this hangs > ds.write_dataset( > data=table_8, > base_dir=f"{bucket}/test8.parquet", > format="parquet", > partitioning=["col1"], > partitioning_flavor="hive", > filesystem=s3, > existing_data_behavior="delete_matching", > ) {code} > For the file with 8 partitions, the directory structure is created in S3 but > no actual files are written before hanging. > -- This message was sent by Atlassian Jira (v8.20.1#820001)