I understand that GROUP BY ought not imply any particular ordering; it's just
that working with other SQL databases, I've come to expect that ordering will
be consistent between multiple runs of the same statement, at least within the
context of a single transaction on a single connection.
I do note that the random ordering is still present when I reduce the CSV to 2
rows, so I doubt it is #2, exclusively at least. But the ability to process and
return the data in parallel is very compelling, so I don't think it matters
whether the phenomenon is a result of #1 or something else.
Thanks for indulging my curiosity!
–Marc
On 2021/02/20 14:49:59, Andy Grove <[email protected]> wrote:
> The SQL standard in general makes no guarantee of the order of resulting
> data unless there is an explicit ORDER BY clause.
>
> I would guess that there are two factors in play here:
>
> 1. The use of hash-based data structures, as you mention
> 2. If you have partitioned data then it is processed on multiple threads
> and that can affect ordering as well
>
> Andy.
>
> On Sat, Feb 20, 2021 at 7:31 AM Marc Prud'hommeaux <[email protected]>
> wrote:
>
> > When I group by a column in DataFusion SQL, the order of the results is
> > different every time. For example, "select country from data group by
> > country" against
> > https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv
> > might return "Moldova" first one time, and then "Sweden" first the next
> > time I execute it.
> >
> > It appears that this is known and acknowledged behavior (it is mentioned
> > at https://issues.apache.org/jira/browse/ARROW-5680), but is there good
> > reason for it (e.g., performance; simplicity; random hash seeding)? I
> > understand why it makes sense to not unnecessarily impose a particular
> > ordering, but is there any reason the results are not consistent between
> > two identical SQL statements executed against the same
> > datafusion::execution::context::ExecutionContext?
> >
> >
>