That looks very odd indeed. Things like this work as expected: rdd = spark.sparkContext.parallelize([0,1,2])
def my_filter(data, i): return data.filter(lambda x: x != i) for i in range(3): rdd = my_filter(rdd, i) rdd.collect() ... as does unrolling the loop. But your example behaves as if only the final filter is applied. Is this is some really obscure Python scoping thing with lambdas that I don't understand, like the lambda only binds i once? but then you'd expect to only filter the first number. I also keep looking in the code to figure out if these are somehow being erroneously 'collapsed' as the same function, but the RDD APIs don't do that kind of thing. They get put into a chain of pipeline_funcs, but, still shouldn't be an issue. I wonder if this is some strange interaction with serialization of the lambda and/or scoping? Really strange! python people? On Wed, Jan 20, 2021 at 7:14 AM Marco Wong <mck...@gmail.com> wrote: > Dear Spark users, > > I ran the Python code below on a simple RDD, but it gave strange results. > The filtered RDD contains non-existent elements which were filtered away > earlier. Any idea why this happened? > ``` > rdd = spark.sparkContext.parallelize([0,1,2]) > for i in range(3): > print("RDD is ", rdd.collect()) > print("Filtered RDD is ", rdd.filter(lambda x:x!=i).collect()) > rdd = rdd.filter(lambda x:x!=i) > print("Result is ", rdd.collect()) > print() > ``` > which gave > ``` > RDD is [0, 1, 2] > Filtered RDD is [1, 2] > Result is [1, 2] > > RDD is [1, 2] > Filtered RDD is [0, 2] > Result is [0, 2] > > RDD is [0, 2] > Filtered RDD is [0, 1] > Result is [0, 1] > ``` > > Thanks, > > Marco >