[GitHub] [arrow] westonpace commented on issue #36709: [Python] Guarantee that `group_by` has stable ordering.

via GitHub Wed, 19 Jul 2023 07:58:16 -0700


westonpace commented on issue #36709:
URL: https://github.com/apache/arrow/issues/36709#issuecomment-1642250990


   +1 for exposing `use_threads`.  However, even with `use_threads=False` there 
is generally not a guarantee of stable ordering if are reading from a disk.  
Even with `use_threads=False` we still read from I/O in parallel (in an async 
fashion).  This can be disabled by setting the I/O thread pool size to 1 but 
that sometimes has unpleasant side effects.  If your data source is in-memory, 
and you disable threads, then you should get stable results.
   
   There are two different things here and I'm not sure which we are discussing:
   
    * The order of the keys
   
   Stable ordering of the keys is tricky since we are essentially using an 
"unordered hashmap" for our grouping.  Changes in the order data arrives, or 
even just changes in the amount of data could, in theory, change the order of 
the resulting keys.  Perhaps the easiest thing to do for now is to just sort 
the results by key.  At some point we could investigate an ordered hashmap if 
desired. In classic SQL one generally needs to add an order-by clause to the 
end of the query to order by the keys (and the underlying implementation may 
just add a sort node or it may do something more clever with the groupby).
   
    * The order of the grouping values
   
   Stable ordering of the values is a bit trickier, especially in parallel.  
There are only a few aggregate functions which depend on this.  In postgres 
there is special syntax for dealing with this case.  Note, this is also 
something that can be expressed in Substrait.
   
   ```
   SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
   ```
   
   Acero doesn't yet have the components that would be needed to do this.  I 
think, at a minimum, you would want a way to force the aggregate "consume" 
operation to be serialized (with some kind of sequencing queue).  Then you 
could use a regular order-by followed by the group-by node and get predictable 
results.  Since you're paying the cost of ordering you could probably order 
first by the grouping keys, and then by the measure column.  Then you could use 
a streaming group-by operator.
   
   Then, of course, there is a whole different can of worms about how to 
effectively wrap this all up in pyarrow :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #36709: [Python] Guarantee that `group_by` has stable ordering.

Reply via email to