[ https://issues.apache.org/jira/browse/ARROW-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace closed ARROW-16506. ------------------------------- Resolution: Duplicate > Pyarrow 8.0.0 write_dataset writes data in different order with > use_threads=True > -------------------------------------------------------------------------------- > > Key: ARROW-16506 > URL: https://issues.apache.org/jira/browse/ARROW-16506 > Project: Apache Arrow > Issue Type: Bug > Reporter: Daniel Friar > Priority: Major > Labels: dataset, parquet, pyarrow > > In the latest (8.0.0) release the following code snippet seems to write out > data in a different order for each of the partitions when > {{use_threads=True}} vs when {{{}use_threads=False{}}}. > Testing the same snippet with pyarrow 7.0.0 gives the same order regardless > of whether {{use_threads}} is set to True when the data is written. > > {code:java} > import itertools > import numpy as np > import pyarrow.dataset as ds > import pyarrow as pa > n_rows, n_cols = 100_000, 20 > def create_dataframe(color, year): > arr = np.random.randn(n_rows, n_cols) > df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in > range(n_cols)]) > df["color"] = color > df["year"] = year > df["id"] = np.arange(len(df)) > return df > partitions = ["red", "green", "blue"] > years = [2011, 2012, 2013] > dataframes = [create_dataframe(p, y) for p, y in > itertools.product(partitions, years)] > df = pd.concat(dataframes) > table = pa.Table.from_pandas(df=df) > ds.write_dataset( > table, > "./test", > format="parquet", > max_rows_per_group=1_000_000, > min_rows_per_group=1_000_000, > existing_data_behavior="overwrite_or_ignore", > partitioning=ds.partitioning(pa.schema([ > ("color", pa.string()), > ("year", pa.int64()) > ]), flavor="hive"), > use_threads=True, > ) > df_read = pd.read_parquet("./test/color=blue/year=2012") > df_read.head()[["id"]] > {code} > > Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0. -- This message was sent by Atlassian Jira (v8.20.7#820007)