Hi,

I found something that surprised me, I expected the order of the rows to be 
preserved, so I suspect this might be a bug. The problem is illustrated with 
the Python example below:

In [1]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.cache()
df.count()
df.coalesce(2).rdd.glom().collect()
Out[1]:
[[Row(n=1)], [Row(n=0), Row(n=2)]]

Note how n=1 comes before n=0, above.


If I remove the cache line I get the rows in the correct order and the same if 
I use df.rdd.count() instead of df.count(), see examples below:

In [2]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.count()
df.coalesce(2).rdd.glom().collect()
Out[2]:
[[Row(n=0)], [Row(n=1), Row(n=2)]]

In [3]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.cache()
df.rdd.count()
df.coalesce(2).rdd.glom().collect()
Out[3]:
[[Row(n=0)], [Row(n=1), Row(n=2)]]


I use spark 2.1.0 and pyspark.

Regards,
/David

The information in this email may be confidential and/or legally privileged. It 
has been sent for the sole use of the intended recipient(s). If you are not an 
intended recipient, you are strictly prohibited from reading, disclosing, 
distributing, copying or using this email or any of its contents, in any way 
whatsoever. If you have received this email in error, please contact the sender 
by reply email and destroy all copies of the original message. Please also be 
advised that emails are not a secure form for communication, and may contain 
errors.

Reply via email to