Thanks Sean,
Sorting definitely solves it, but I was hoping it could be avoided :)
In the documentation for Classification in ML-Lib for example, zip() is used to
create labelsAndPredictions:
-
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and pa
I think this is a bad example since testData is not deterministic at
all. I thought we had fixed this or similar examples in the past? As
in https://github.com/apache/spark/pull/1250/files
Hm, anyone see a reason that shouldn't be changed too?
On Mon, Mar 23, 2015 at 7:00 PM, Ofer Mendelevitch
w
I think the explanation is that the join does not guarantee any order,
since it causes a shuffle in general, and it is computed twice in the
first example, resulting in a difference for d1 and d2.
You can persist() the result of the join and in practice I believe
you'd find it behaves as expected,
Hi,
I am running into a strange issue when doing a JOIN of two RDDs followed by ZIP
from PySpark.
It’s part of a more complex application, but was able to narrow it down to a
simplified example that’s easy to replicate and causes the same problem to
appear:
raw = sc.parallelize([('k'+str(x),'