[jira] [Commented] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True

Weston Pace (Jira) Mon, 09 May 2022 12:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533996#comment-17533996
 ]


Weston Pace commented on ARROW-16506:
-------------------------------------

This is expected behavior at the moment.  I'm guessing at some point between 
7.0.0 and 8.0.0 we switched to reading from a table in a multi-threaded fashion.

Is an ordered write important?  There has been some discussion of adding 
sequencing on the mailing list.  If we have sequencing information on batches 
then we could use that to do an ordered write.

> Pyarrow 8.0.0 write_dataset writes data in different order with 
> use_threads=True
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-16506
>                 URL: https://issues.apache.org/jira/browse/ARROW-16506
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Daniel Friar
>            Priority: Major
>              Labels: dataset, parquet, pyarrow
>
> In the latest (8.0.0) release the following code snippet seems to write out 
> data in a different order for each of the partitions when 
> {{use_threads=True}} vs when {{{}use_threads=False{}}}.
> Testing the same snippet with pyarrow 7.0.0 gives the same order regardless 
> of whether {{use_threads}} is set to True when the data is written.
>  
> {code:java}
> import itertools
> import numpy as np
> import pyarrow.dataset as ds
> import pyarrow as pa
> n_rows, n_cols = 100_000, 20
> def create_dataframe(color, year):
>     arr = np.random.randn(n_rows, n_cols)
>     df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in 
> range(n_cols)])
>     df["color"] = color
>     df["year"] = year
>     df["id"] = np.arange(len(df))
>     return df
> partitions = ["red", "green", "blue"]
> years = [2011, 2012, 2013]
> dataframes = [create_dataframe(p, y) for p, y in 
> itertools.product(partitions, years)]
> df = pd.concat(dataframes)
> table = pa.Table.from_pandas(df=df)
> ds.write_dataset(
>     table,
>     "./test",
>     format="parquet",
>     max_rows_per_group=1_000_000,
>     min_rows_per_group=1_000_000,
>     existing_data_behavior="overwrite_or_ignore",
>     partitioning=ds.partitioning(pa.schema([
>         ("color", pa.string()),
>         ("year", pa.int64())
>     ]), flavor="hive"),
>     use_threads=True,
> )
> df_read = pd.read_parquet("./test/color=blue/year=2012")
> df_read.head()[["id"]]
> {code}
>  
> Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True

Reply via email to