[ https://issues.apache.org/jira/browse/ARROW-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534457#comment-17534457 ]
Weston Pace commented on ARROW-16518: ------------------------------------- I doubt anyone expects or wants their data to be scrambled but many users want maximum performance and can tolerate it. Yes, scrambling within batches will be possible. I'm pretty sure you can observe it today if you have batches with more than 128Ki rows. ARROW-10883 is related. There has been discussion of this on the ML. Maintaining ordering in a plan is definitely possible but not simple and I don't know that anyone is working on it. The way this works in {{ToTable}} is we do all the sorting on either end of the execution engine. We attach fragment & batch indices (as columns) to the batches before we put them into the plan. Then, after pulling the data out of the plan, we sort the data by these two columns and then drop the two columns before returning the table. > [Python] Ensure _exec_plan.execplan preserves order of inputs > ------------------------------------------------------------- > > Key: ARROW-16518 > URL: https://issues.apache.org/jira/browse/ARROW-16518 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python > Reporter: Alessandro Molina > Priority: Major > Fix For: 9.0.0 > > > At the moment execplan doesn't guarantee any ordered output, the batches are > consumed in a random order. This can lead to unordered rows in outputs when > {{use_threads=True}} > For example providing a column with {{b=[a, a, a, a, b, b, b, b]}} will > sometimes give back {{b=[a, b]}} and sometimes {{b=[b, a]}} > See > {code:java} > In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4}) > In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4}) > In [20]: table = pa.concat_tables([table1, table2]) > In [21]: ep._filter_table(table, pc.field('a') == 1) > Out[21]: > pyarrow.Table > a: int64 > b: string > ---- > a: [[1],[1]] > b: [["b"],["a"]] > In [22]: ep._filter_table(table, pc.field('a') == 1) > Out[22]: > pyarrow.Table > a: int64 > b: string > ---- > a: [[1],[1]] > b: [["a"],["b"]] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)