That looks very odd indeed. Things like this work as expected:

rdd = spark.sparkContext.parallelize([0,1,2])

def my_filter(data, i):
  return data.filter(lambda x: x != i)

for i in range(3):
  rdd = my_filter(rdd, i)
rdd.collect()

... as does unrolling the loop.

But your example behaves as if only the final filter is applied. Is this is
some really obscure Python scoping thing with lambdas that I don't
understand, like the lambda only binds i once? but then you'd expect to
only filter the first number.

I also keep looking in the code to figure out if these are somehow being
erroneously 'collapsed' as the same function, but the RDD APIs don't do
that kind of thing. They get put into a chain of pipeline_funcs, but, still
shouldn't be an issue. I wonder if this is some strange interaction with
serialization of the lambda and/or scoping?

Really strange! python people?

On Wed, Jan 20, 2021 at 7:14 AM Marco Wong <mck...@gmail.com> wrote:

> Dear Spark users,
>
> I ran the Python code below on a simple RDD, but it gave strange results.
> The filtered RDD contains non-existent elements which were filtered away
> earlier. Any idea why this happened?
> ```
> rdd = spark.sparkContext.parallelize([0,1,2])
> for i in range(3):
>     print("RDD is ", rdd.collect())
>     print("Filtered RDD is ", rdd.filter(lambda x:x!=i).collect())
>     rdd = rdd.filter(lambda x:x!=i)
>     print("Result is ", rdd.collect())
>     print()
> ```
> which gave
> ```
> RDD is  [0, 1, 2]
> Filtered RDD is  [1, 2]
> Result is  [1, 2]
>
> RDD is  [1, 2]
> Filtered RDD is  [0, 2]
> Result is  [0, 2]
>
> RDD is  [0, 2]
> Filtered RDD is  [0, 1]
> Result is  [0, 1]
> ```
>
> Thanks,
>
> Marco
>

Reply via email to