RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.
​

On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
david.hagl...@husqvarnagroup.com> wrote:

> Hi,
>
>
>
> I found something that surprised me, I expected the order of the rows to
> be preserved, so I suspect this might be a bug. The problem is illustrated
> with the Python example below:
>
>
>
> In [1]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[1]:
>
> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>
>
>
> Note how n=1 comes before n=0, above.
>
>
>
>
>
> If I remove the cache line I get the rows in the correct order and the
> same if I use df.rdd.count() instead of df.count(), see examples below:
>
>
>
> In [2]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[2]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
> In [3]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.rdd.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[3]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
>
>
> I use spark 2.1.0 and pyspark.
>
>
>
> Regards,
>
> /David
>
> The information in this email may be confidential and/or legally
> privileged. It has been sent for the sole use of the intended recipient(s).
> If you are not an intended recipient, you are strictly prohibited from
> reading, disclosing, distributing, copying or using this email or any of
> its contents, in any way whatsoever. If you have received this email in
> error, please contact the sender by reply email and destroy all copies of
> the original message. Please also be advised that emails are not a secure
> form for communication, and may contain errors.
>

Reply via email to