Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Ofer Mendelevitch
Thanks Sean, Sorting definitely solves it, but I was hoping it could be avoided :) In the documentation for Classification in ML-Lib for example, zip() is used to create labelsAndPredictions: - from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and pa

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think this is a bad example since testData is not deterministic at all. I thought we had fixed this or similar examples in the past? As in https://github.com/apache/spark/pull/1250/files Hm, anyone see a reason that shouldn't be changed too? On Mon, Mar 23, 2015 at 7:00 PM, Ofer Mendelevitch w

Re: Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Sean Owen
I think the explanation is that the join does not guarantee any order, since it causes a shuffle in general, and it is computed twice in the first example, resulting in a difference for d1 and d2. You can persist() the result of the join and in practice I believe you'd find it behaves as expected,

Strange behavior with PySpark when using Join() and zip()

2015-03-23 Thread Ofer Mendelevitch
Hi, I am running into a strange issue when doing a JOIN of two RDDs followed by ZIP from PySpark. It’s part of a more complex application, but was able to narrow it down to a simplified example that’s easy to replicate and causes the same problem to appear: raw = sc.parallelize([('k'+str(x),'