These are good points. In traditional RDBMSs, SQL query results without an
explicit *ORDER BY* clause may vary in order due to optimization,
especially when no clustered index is defined. In contrast, systems like
Hive and Spark SQL, which are based on distributed file storage, do not
rely on physical data order (co-location of data blocks). They deploy
techniques like columnar storage and predicate pushdown instead of
traditional indexing due to the distributed nature of their storage
systems.

HTH


On Mon, 18 Sept 2023 at 20:19, Sean Owen <sro...@gmail.com> wrote:

> I think it's the same, and always has been - yes you don't have a
> guaranteed ordering unless an operation produces a specific ordering. Could
> be the result of order by, yes; I believe you would be guaranteed that
> reading input files results in data in the order they appear in the file,
> etc. 1:1 operations like map() don't change ordering. But not the result of
> a shuffle, for example. So yeah anything like limit or head might give
> different results in the future (or simply on different cluster setups with
> different parallelism, etc). The existence of operations like offset
> doesn't contradict that. Maybe that's totally fine in some situations (ex:
> I just want to display some sample rows) but otherwise yeah you've always
> had to state your ordering for "first" or "nth" to have a guaranteed result.
>
> On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> <https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301>
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>>

Reply via email to