jorisvandenbossche commented on PR #13075:
URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122058756

   What I observed from trying out this branch is that it does not preserve:
   
   ```
   In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4})
   
   In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4})
   
   In [20]: table = pa.concat_tables([table1, table2])
   
   In [21]: ep._filter_table(table, pc.field('a') == 1)
   Out[21]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["b"],["a"]]
   
   In [22]: ep._filter_table(table, pc.field('a') == 1)
   Out[22]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1],[1]]
   b: [["a"],["b"]]
   ```
   
   But the current cython wrappers of the ExecPlan are not using a "to_table" 
method, it is using `Table::FromRecordBatchReader`, with a record batch 
generator created using `MakeGeneratorReader`, which explicitly says it does 
not preserve order:
   
   
https://github.com/apache/arrow/blob/5a4b3db396f88898917620de50863b6d3a477a7a/cpp/src/arrow/compute/exec/exec_plan.h#L440-L447
   
   I see that we currently use a "sink" node as final output node, but there is 
also a "table_sink" node. That might simplify the code a bit to get a Table 
(without manually creating a record batch reader and creating a table from 
that), but  don't think that would help with the ordering? (at least in the 
code I also don't see any ordering handling for this sink)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to