Re: [Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

Marc Prud'hommeaux Sat, 20 Feb 2021 07:31:03 -0800


I understand that GROUP BY ought not imply any particular ordering; it's just 
that working with other SQL databases, I've come to expect that ordering will 
be consistent between multiple runs of the same statement, at least within the 
context of a single transaction on a single connection.


I do note that the random ordering is still present when I reduce the CSV to 2 
rows, so I doubt it is #2, exclusively at least. But the ability to process and 
return the data in parallel is very compelling, so I don't think it matters 
whether the phenomenon is a result of #1 or something else.

Thanks for indulging my curiosity!

    –Marc

On 2021/02/20 14:49:59, Andy Grove <andygrov...@gmail.com> wrote: 
> The SQL standard in general makes no guarantee of the order of resulting
> data unless there is an explicit ORDER BY clause.
> 
> I would guess that there are two factors in play here:
> 
> 1. The use of hash-based data structures, as you mention
> 2. If you have partitioned data then it is processed on multiple threads
> and that can affect ordering as well
> 
> Andy.
> 
> On Sat, Feb 20, 2021 at 7:31 AM Marc Prud'hommeaux <mprud...@apache.org>
> wrote:
> 
> > When I group by a column in DataFusion SQL, the order of the results is
> > different every time. For example, "select country from data group by
> > country" against
> > https://github.com/Teradata/kylo/blob/master/samples/sample-data/csv/userdata3.csv
> > might return "Moldova" first one time, and then "Sweden" first the next
> > time I execute it.
> >
> > It appears that this is known and acknowledged behavior (it is mentioned
> > at https://issues.apache.org/jira/browse/ARROW-5680), but is there good
> > reason for it (e.g., performance; simplicity; random hash seeding)? I
> > understand why it makes sense to not unnecessarily impose a particular
> > ordering, but is there any reason the results are not consistent between
> > two identical SQL statements executed against the same
> > datafusion::execution::context::ExecutionContext?
> >
> >
>

Re: [Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

Reply via email to