Discovered this in ipynb, and I haven't yet checked to see if it happens elsewhere.
here's a simple example: this produces the output: Which is not what I wanted. Alarmingly, if I call .cache() on these rdds, it changes the result and I get what I wanted. which produces: It's very unexpected for .cache() to actually change the results here. Also there is additional weirdness when doing more interesting things that .cache() still doesn't fix, but I don't yet have a simple example. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-references-to-different-rdds-being-overwritten-to-point-to-the-same-rdd-different-results-wh-tp9248.html Sent from the Apache Spark User List mailing list archive at Nabble.com.