RDDs and DataFrames do not guarantee any specific ordering of data. They are like tables in a SQL database. The only way to get a guaranteed ordering of rows is to explicitly specify an orderBy() clause in your statement. Any ordering you see otherwise is incidental.
On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) < david.hagl...@husqvarnagroup.com> wrote: > Hi, > > > > I found something that surprised me, I expected the order of the rows to > be preserved, so I suspect this might be a bug. The problem is illustrated > with the Python example below: > > > > In [1]: > > df = spark.createDataFrame([(i,) for i in range(3)], ['n']) > > df.cache() > > df.count() > > df.coalesce(2).rdd.glom().collect() > > Out[1]: > > [[Row(n=1)], [Row(n=0), Row(n=2)]] > > > > Note how n=1 comes before n=0, above. > > > > > > If I remove the cache line I get the rows in the correct order and the > same if I use df.rdd.count() instead of df.count(), see examples below: > > > > In [2]: > > df = spark.createDataFrame([(i,) for i in range(3)], ['n']) > > df.count() > > df.coalesce(2).rdd.glom().collect() > > Out[2]: > > [[Row(n=0)], [Row(n=1), Row(n=2)]] > > > > In [3]: > > df = spark.createDataFrame([(i,) for i in range(3)], ['n']) > > df.cache() > > df.rdd.count() > > df.coalesce(2).rdd.glom().collect() > > Out[3]: > > [[Row(n=0)], [Row(n=1), Row(n=2)]] > > > > > > I use spark 2.1.0 and pyspark. > > > > Regards, > > /David > > The information in this email may be confidential and/or legally > privileged. It has been sent for the sole use of the intended recipient(s). > If you are not an intended recipient, you are strictly prohibited from > reading, disclosing, distributing, copying or using this email or any of > its contents, in any way whatsoever. If you have received this email in > error, please contact the sender by reply email and destroy all copies of > the original message. Please also be advised that emails are not a secure > form for communication, and may contain errors. >