Spark has a zipWithIndex function for RDDs ( http://stackoverflow.com/a/26081548) that adds an index column right after you create an RDD, and I believe it preserves order. Then you can sort it by the index after the cache step.
I haven't tried this with a Dataframe but this answer seems promising: http://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex On Mon, Feb 13, 2017 at 8:34 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > RDDs and DataFrames do not guarantee any specific ordering of data. They > are like tables in a SQL database. The only way to get a guaranteed > ordering of rows is to explicitly specify an orderBy() clause in your > statement. Any ordering you see otherwise is incidental. > > > On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) < > david.hagl...@husqvarnagroup.com> wrote: > >> Hi, >> >> >> >> I found something that surprised me, I expected the order of the rows to >> be preserved, so I suspect this might be a bug. The problem is illustrated >> with the Python example below: >> >> >> >> In [1]: >> >> df = spark.createDataFrame([(i,) for i in range(3)], ['n']) >> >> df.cache() >> >> df.count() >> >> df.coalesce(2).rdd.glom().collect() >> >> Out[1]: >> >> [[Row(n=1)], [Row(n=0), Row(n=2)]] >> >> >> >> Note how n=1 comes before n=0, above. >> >> >> >> >> >> If I remove the cache line I get the rows in the correct order and the >> same if I use df.rdd.count() instead of df.count(), see examples below: >> >> >> >> In [2]: >> >> df = spark.createDataFrame([(i,) for i in range(3)], ['n']) >> >> df.count() >> >> df.coalesce(2).rdd.glom().collect() >> >> Out[2]: >> >> [[Row(n=0)], [Row(n=1), Row(n=2)]] >> >> >> >> In [3]: >> >> df = spark.createDataFrame([(i,) for i in range(3)], ['n']) >> >> df.cache() >> >> df.rdd.count() >> >> df.coalesce(2).rdd.glom().collect() >> >> Out[3]: >> >> [[Row(n=0)], [Row(n=1), Row(n=2)]] >> >> >> >> >> >> I use spark 2.1.0 and pyspark. >> >> >> >> Regards, >> >> /David >> >> The information in this email may be confidential and/or legally >> privileged. It has been sent for the sole use of the intended recipient(s). >> If you are not an intended recipient, you are strictly prohibited from >> reading, disclosing, distributing, copying or using this email or any of >> its contents, in any way whatsoever. If you have received this email in >> error, please contact the sender by reply email and destroy all copies of >> the original message. Please also be advised that emails are not a secure >> form for communication, and may contain errors. >> >