RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.

On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
david.hagl...@husqvarnagroup.com> wrote:

> Hi,
> I found something that surprised me, I expected the order of the rows to
> be preserved, so I suspect this might be a bug. The problem is illustrated
> with the Python example below:
> In [1]:
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
> df.cache()
> df.count()
> df.coalesce(2).rdd.glom().collect()
> Out[1]:
> [[Row(n=1)], [Row(n=0), Row(n=2)]]
> Note how n=1 comes before n=0, above.
> If I remove the cache line I get the rows in the correct order and the
> same if I use df.rdd.count() instead of df.count(), see examples below:
> In [2]:
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
> df.count()
> df.coalesce(2).rdd.glom().collect()
> Out[2]:
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
> In [3]:
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
> df.cache()
> df.rdd.count()
> df.coalesce(2).rdd.glom().collect()
> Out[3]:
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
> I use spark 2.1.0 and pyspark.
> Regards,
> /David
> The information in this email may be confidential and/or legally
> privileged. It has been sent for the sole use of the intended recipient(s).
> If you are not an intended recipient, you are strictly prohibited from
> reading, disclosing, distributing, copying or using this email or any of
> its contents, in any way whatsoever. If you have received this email in
> error, please contact the sender by reply email and destroy all copies of
> the original message. Please also be advised that emails are not a secure
> form for communication, and may contain errors.

Reply via email to