jorisvandenbossche commented on PR #13075: URL: https://github.com/apache/arrow/pull/13075#issuecomment-1122058756
What I observed from trying out this branch is that it does not preserve: ``` In [18]: table1 = pa.table({'a': [1, 2, 3, 4], 'b': ['a'] * 4}) In [19]: table2 = pa.table({'a': [1, 2, 3, 4], 'b': ['b'] * 4}) In [20]: table = pa.concat_tables([table1, table2]) In [21]: ep._filter_table(table, pc.field('a') == 1) Out[21]: pyarrow.Table a: int64 b: string ---- a: [[1],[1]] b: [["b"],["a"]] In [22]: ep._filter_table(table, pc.field('a') == 1) Out[22]: pyarrow.Table a: int64 b: string ---- a: [[1],[1]] b: [["a"],["b"]] ``` But the current cython wrappers of the ExecPlan are not using a "to_table" method, it is using `Table::FromRecordBatchReader`, with a record batch generator created using `MakeGeneratorReader`, which explicitly says it does not preserve order: https://github.com/apache/arrow/blob/5a4b3db396f88898917620de50863b6d3a477a7a/cpp/src/arrow/compute/exec/exec_plan.h#L440-L447 I see that we currently use a "sink" node as final output node, but there is also a "table_sink" node. That might simplify the code a bit to get a Table (without manually creating a record batch reader and creating a table from that), but don't think that would help with the ordering? (at least in the code I also don't see any ordering handling for this sink) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org