gitmodimo commented on PR #44083:
URL: https://github.com/apache/arrow/pull/44083#issuecomment-2403347394

   Let me pitch in. Disclaimer I am working @mroz45 on the same project using 
arrow.
   
   
   ```
                              -bool require_sequenced_output = false)
                              +bool require_sequenced_output = true)
   ```
   > Changing this default would be a breaking change and I'm not certain it's 
warranted.
   
   Without it python tests are failing.
   
   > Should this only be `Ordering::Implicit` if `require_sequenced_output` is 
set?
   
   It requires deeper and breaking changes which I think are necessary.
   `require_sequenced_output` - means the source should give implicit ordering 
to produces batches ant therefore `require_sequenced_output` should be moved 
from `ScanNodeOptions` to `ScanOptions` to allow pass this option to 
`ScannerBuilder` which is used by python. But in fact I think there should be 
unified way to assert implicit ordering in all source nodes. Or maybe the 
_need_ for implicit ordering should propagate from nodes that need ordering 
(asof_join, fetch etc.) down the line to source nodes (and maybe fail if the 
source node cannot provide it). There are few related issues:
   [no standardized sorting 
information](https://github.com/apache/arrow/issues/34451)
   [add ordering information to exec 
batches](https://github.com/apache/arrow/issues/32991)
   [Add AsofJoin Ordering 
Assertion](https://github.com/apache/arrow/issues/20353)
   
   This [issue](https://github.com/apache/arrow/issues/27651) gave me the idea 
that implicit ordering should be asserted by default. And additional source 
node/additional option to assert no ordering - to enable some performance 
optimization for "don't care" ordering cases. This would fix those issues:
   [asof_join node not working 
propertly](https://github.com/apache/arrow/issues/41706)
   [order is unstable](https://github.com/apache/arrow/issues/15144)
   [Preserve order when writing 
dataset](https://github.com/apache/arrow/issues/26818)
   [ordering is weird](https://github.com/apache/arrow/issues/37542)
   [dataset not preserving 
ordering](https://github.com/apache/arrow/issues/39030)
   [scan node not asserting 
ordering](https://github.com/apache/arrow/issues/34698)
   
   We are willing to contribute to fix ordering issue within acero but we have 
next to none experience in python/Cython. Also the size of the issue seems to 
grow with every little change. I think the ordering in Acero is a little bigger 
topic to discuss.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to