The SQL standard in general makes no guarantee of the order of resulting
data unless there is an explicit ORDER BY clause.

I would guess that there are two factors in play here:

1. The use of hash-based data structures, as you mention
2. If you have partitioned data then it is processed on multiple threads
and that can affect ordering as well

Andy.

On Sat, Feb 20, 2021 at 7:31 AM Marc Prud'hommeaux <mprud...@apache.org>
wrote:

> When I group by a column in DataFusion SQL, the order of the results is
> different every time. For example, "select country from data group by
> country" against
> https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv
> might return "Moldova" first one time, and then "Sweden" first the next
> time I execute it.
>
> It appears that this is known and acknowledged behavior (it is mentioned
> at https://issues.apache.org/jira/browse/ARROW-5680), but is there good
> reason for it (e.g., performance; simplicity; random hash seeding)? I
> understand why it makes sense to not unnecessarily impose a particular
> ordering, but is there any reason the results are not consistent between
> two identical SQL statements executed against the same
> datafusion::execution::context::ExecutionContext?
>
>

Reply via email to