[ https://issues.apache.org/jira/browse/ARROW-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999250#comment-16999250 ]
Joris Van den Bossche commented on ARROW-7385: ---------------------------------------------- [~mrmathematica] Thanks for the report! I can confirm this (on linux, with pyarrow master) > [Python] ParquetDataset deadlock with different metadata_nthreads values > ------------------------------------------------------------------------ > > Key: ARROW-7385 > URL: https://issues.apache.org/jira/browse/ARROW-7385 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.12.1, 0.14.1, 0.15.1 > Reporter: Chongkai Zhu > Priority: Major > Labels: parquet > > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > output_folder = "C:\scr\tmp" > weather_df = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1, > 1, 5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1", > "c1"], "d": [32, 32, 32, 32, 32, 32, 32, 32, 32] }) > table = pa.Table.from_pandas(weather_df) > pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b", > "c"]) > {code} > h1. works for 1 thread > {code} > dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1, > validate_schema=False) > {code} > h1. stuck for 2~6 threads (but it may vary from time to time) > {code} > dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2, > validate_schema=False) > dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6, > validate_schema=False) > {code} > h1. works for 60 thread > {code} > dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60, > validate_schema=False) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)