sam created SPARK-26534: --------------------------- Summary: Closure Cleaner Bug Key: SPARK-26534 URL: https://issues.apache.org/jira/browse/SPARK-26534 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: sam
I've found a strange combination of closures where the closure cleaner doesn't seem to be smart enough to figure out how to remove a reference that is not used. I.e. we get a `org.apache.spark.SparkException: Task not serializable` for a Task that is perfectly serializable. In the example below, the only `val` that is actually needed for the closure of the `map` is `foo`, but it tries to serialise `thingy`. What is odd is changing this code in a number of subtle ways eliminates the error, which I've tried to highlight using comments inline. {code:java} import org.apache.spark.sql._ object Test { val sparkSession: SparkSession = SparkSession.builder.master("local").appName("app").getOrCreate() def apply(): Unit = { import sparkSession.implicits._ val landedData: Dataset[String] = sparkSession.sparkContext.makeRDD(Seq("foo", "bar")).toDS() // thingy has to be in this outer scope to reproduce, if in someFunc, cannot reproduce val thingy: Thingy = new Thingy // If not wrapped in someFunc cannot reproduce val someFunc = () => { // If don't reference this foo inside the closer (e.g. just use identity function) cannot reproduce val foo: String = "foo" thingy.run(block = () => { landedData.map(r => { r + foo }) .count() }) } someFunc() } } class Thingy { def run[R](block: () => R): R = { block() } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org