Daniel Friar created ARROW-16506: ------------------------------------ Summary: Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}} Key: ARROW-16506 URL: https://issues.apache.org/jira/browse/ARROW-16506 Project: Apache Arrow Issue Type: Bug Reporter: Daniel Friar
In the latest (8.0.0) release the following code snippet seems to write out data in a different order for each of the partitions when {{use_threads=True}} vs when {{{}use_threads=False{}}}. Testing the same snippet with pyarrow gives the same order regardless of whether {{use_threads}} is set to True when the data is writen. {code:java} import itertools import numpy as np import pyarrow.dataset as ds import pyarrow as pa n_rows, n_cols = 100_000, 20 def create_dataframe(color, year): arr = np.random.randn(n_rows, n_cols) df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)]) df["color"] = color df["year"] = year df["id"] = np.arange(len(df)) return df partitions = ["red", "green", "blue"] years = [2011, 2012, 2013] dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, years)] df = pd.concat(dataframes) table = pa.Table.from_pandas(df=df) ds.write_dataset( table, "./test", format="parquet", max_rows_per_group=1_000_000, min_rows_per_group=1_000_000, existing_data_behavior="overwrite_or_ignore", partitioning=ds.partitioning(pa.schema([ ("color", pa.string()), ("year", pa.int64()) ]), flavor="hive"), use_threads=True, ) df_read = pd.read_parquet("./test/color=blue/year=2012") df_read.head()[["id"]] {code} Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0. -- This message was sent by Atlassian Jira (v8.20.7#820007)