Hi dev list,

I'm running into an issue where I'm seeing different results from Spark
when I run with spark.shuffle.spill=false vs leaving it at the default
(true).

It's on internal data so I can't share my exact repro, but here's roughly
what I'm doing:

val rdd = sc.textFile(...)
  .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
  .distinct
  .join(
    sc.textFile(...)
       .map(l => ... (col1, col2))  // parse CSV into Tuple2[String,String]
       .distinct
  )
  .map{ case (k,(v1,v2)) => Seq(v1,k,v2).mkString("|") }

Then I output:
(rdd.count, rdd.distinct.count)

When I run with spark.shuffle.spill=false I get this:
(3192729,3192729)

And with spark.shuffle.spill=true I get this:
(3192931,3192726)

Has anyone else seen any bugs in join-heavy operations while using
spark.shuffle.spill=true?

My current theory is that I have a hashcode collision between rows (unusual
I know) and that the AppendOnlyMap does equality based on
hashcode()+equals() and ExternalAppendOnlyMap does equality based just on
hashcode().

Would appreciate some additional eyes on this problem for sure.

Right now I'm looking through the source and tests for AppendOnlyMap and
ExternalAppendOnlyMap to see if anything jumps out at me.

Thanks!
Andrew

Reply via email to