[ 
https://issues.apache.org/jira/browse/ARROW-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534457#comment-17534457
 ] 

Weston Pace commented on ARROW-16518:
-------------------------------------

I doubt anyone expects or wants their data to be scrambled but many users want 
maximum performance and can tolerate it.

Yes, scrambling within batches will be possible.  I'm pretty sure you can 
observe it today if you have batches with more than 128Ki rows.

ARROW-10883 is related.  There has been discussion of this on the ML.  
Maintaining ordering in a plan is definitely possible but not simple and I 
don't know that anyone is working on it.

The way this works in {{ToTable}} is we do all the sorting on either end of the 
execution engine.  We attach fragment & batch indices (as columns) to the 
batches before we put them into the plan.  Then, after pulling the data out of 
the plan, we sort the data by these two columns and then drop the two columns 
before returning the table.


> [Python] Ensure _exec_plan.execplan preserves order of inputs
> -------------------------------------------------------------
>
>                 Key: ARROW-16518
>                 URL: https://issues.apache.org/jira/browse/ARROW-16518
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Python
>            Reporter: Alessandro Molina
>            Priority: Major
>             Fix For: 9.0.0
>
>
> At the moment execplan doesn't guarantee any ordered output, the batches are 
> consumed in a random order. This can lead to unordered rows in outputs when 
> {{use_threads=True}}
> For example providing a column with {{b=[a, a, a, a, b, b, b, b]}} will 
> sometimes give back {{b=[a, b]}} and sometimes {{b=[b, a]}}
> See
> {code:java}
> In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
> In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
> In [20]: table = pa.concat_tables([table1, table2])
> In [21]: ep._filter_table(table, pc.field('a') == 1)
> Out[21]: 
> pyarrow.Table
> a: int64
> b: string
> ----
> a: [[1],[1]]
> b: [["b"],["a"]]
> In [22]: ep._filter_table(table, pc.field('a') == 1)
> Out[22]: 
> pyarrow.Table
> a: int64
> b: string
> ----
> a: [[1],[1]]
> b: [["a"],["b"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to