[ 
https://issues.apache.org/jira/browse/ARROW-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7385:
-----------------------------------------
    Description: 
{code}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
output_folder = "C:\scr\tmp"
weather_df = pd.DataFrame(\{"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1, 1, 
5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1", "c1"], 
"d": [32, 32, 32, 32, 32, 32, 32, 32, 32] })
table = pa.Table.from_pandas(weather_df)
pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b", 
"c"])
{code}

h1. works for 1 thread

{code}
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1, 
validate_schema=False)
{code}

h1. stuck for 2~6 threads (but it may vary from time to time)

{code}
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2, 
validate_schema=False)
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6, 
validate_schema=False)
{code}

h1. works for 60 thread

{code}
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60, 
validate_schema=False)
{code}

  was:
import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
 output_folder = "C:\scr\tmp"
 weather_df = pd.DataFrame(\{"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1, 
1, 5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1", 
"c1"], "d": [32, 32, 32, 32, 32, 32, 32, 32, 32] })
 table = pa.Table.from_pandas(weather_df)
 pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b", 
"c"])
h1. works for 1 thread

dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1, 
validate_schema=False)
h1. stuck for 2~6 threads (but it may vary from time to time)

dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2, 
validate_schema=False)
 dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6, 
validate_schema=False)
h1. works for 60 thread

dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60, 
validate_schema=False)


> [Python] ParquetDataset deadlock with different metadata_nthreads values
> ------------------------------------------------------------------------
>
>                 Key: ARROW-7385
>                 URL: https://issues.apache.org/jira/browse/ARROW-7385
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.1, 0.14.1, 0.15.1
>            Reporter: Chongkai Zhu
>            Priority: Major
>              Labels: parquet
>
> {code}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> output_folder = "C:\scr\tmp"
> weather_df = pd.DataFrame(\{"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1, 
> 1, 5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1", 
> "c1"], "d": [32, 32, 32, 32, 32, 32, 32, 32, 32] })
> table = pa.Table.from_pandas(weather_df)
> pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b", 
> "c"])
> {code}
> h1. works for 1 thread
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1, 
> validate_schema=False)
> {code}
> h1. stuck for 2~6 threads (but it may vary from time to time)
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2, 
> validate_schema=False)
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6, 
> validate_schema=False)
> {code}
> h1. works for 60 thread
> {code}
> dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60, 
> validate_schema=False)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to