Spark has a zipWithIndex function for RDDs (
http://stackoverflow.com/a/26081548) that adds an index column right after
you create an RDD, and I believe it preserves order.  Then you can sort it
by the index after the cache step.

I haven't tried this with a Dataframe but this answer seems promising:
http://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex



On Mon, Feb 13, 2017 at 8:34 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> RDDs and DataFrames do not guarantee any specific ordering of data. They
> are like tables in a SQL database. The only way to get a guaranteed
> ordering of rows is to explicitly specify an orderBy() clause in your
> statement. Any ordering you see otherwise is incidental.
> ​
>
> On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
> david.hagl...@husqvarnagroup.com> wrote:
>
>> Hi,
>>
>>
>>
>> I found something that surprised me, I expected the order of the rows to
>> be preserved, so I suspect this might be a bug. The problem is illustrated
>> with the Python example below:
>>
>>
>>
>> In [1]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.cache()
>>
>> df.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[1]:
>>
>> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>>
>>
>>
>> Note how n=1 comes before n=0, above.
>>
>>
>>
>>
>>
>> If I remove the cache line I get the rows in the correct order and the
>> same if I use df.rdd.count() instead of df.count(), see examples below:
>>
>>
>>
>> In [2]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[2]:
>>
>> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>>
>>
>>
>> In [3]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.cache()
>>
>> df.rdd.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[3]:
>>
>> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>>
>>
>>
>>
>>
>> I use spark 2.1.0 and pyspark.
>>
>>
>>
>> Regards,
>>
>> /David
>>
>> The information in this email may be confidential and/or legally
>> privileged. It has been sent for the sole use of the intended recipient(s).
>> If you are not an intended recipient, you are strictly prohibited from
>> reading, disclosing, distributing, copying or using this email or any of
>> its contents, in any way whatsoever. If you have received this email in
>> error, please contact the sender by reply email and destroy all copies of
>> the original message. Please also be advised that emails are not a secure
>> form for communication, and may contain errors.
>>
>

Reply via email to