I understand that GROUP BY ought not imply any particular ordering; it's just that working with other SQL databases, I've come to expect that ordering will be consistent between multiple runs of the same statement, at least within the context of a single transaction on a single connection.
I do note that the random ordering is still present when I reduce the CSV to 2 rows, so I doubt it is #2, exclusively at least. But the ability to process and return the data in parallel is very compelling, so I don't think it matters whether the phenomenon is a result of #1 or something else. Thanks for indulging my curiosity! –Marc On 2021/02/20 14:49:59, Andy Grove <andygrov...@gmail.com> wrote: > The SQL standard in general makes no guarantee of the order of resulting > data unless there is an explicit ORDER BY clause. > > I would guess that there are two factors in play here: > > 1. The use of hash-based data structures, as you mention > 2. If you have partitioned data then it is processed on multiple threads > and that can affect ordering as well > > Andy. > > On Sat, Feb 20, 2021 at 7:31 AM Marc Prud'hommeaux <mprud...@apache.org> > wrote: > > > When I group by a column in DataFusion SQL, the order of the results is > > different every time. For example, "select country from data group by > > country" against > > https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv > > might return "Moldova" first one time, and then "Sweden" first the next > > time I execute it. > > > > It appears that this is known and acknowledged behavior (it is mentioned > > at https://issues.apache.org/jira/browse/ARROW-5680), but is there good > > reason for it (e.g., performance; simplicity; random hash seeding)? I > > understand why it makes sense to not unnecessarily impose a particular > > ordering, but is there any reason the results are not consistent between > > two identical SQL statements executed against the same > > datafusion::execution::context::ExecutionContext? > > > > >