Rajesh Balamohan created SPARK-14113: ----------------------------------------
Summary: Consider marking JobConf closure-cleaning in HadoopRDD as optional Key: SPARK-14113 URL: https://issues.apache.org/jira/browse/SPARK-14113 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Rajesh Balamohan In HadoopRDD, the following code was introduced as a part of SPARK-6943. {noformat} if (initLocalJobConfFuncOpt.isDefined) { sparkContext.clean(initLocalJobConfFuncOpt.get) } {noformat} When working on one of the changes in OrcRelation, I tried passing initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance penalty (due to closure cleaning) with large RDDs. This would be invoked for every HadoopRDD initialization causing the bottleneck. example threadstack is given below {noformat} at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.b(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:2079) at org.apache.spark.rdd.HadoopRDD.<init>(HadoopRDD.scala:112){noformat} Creating this JIRA to explore the possibility of removing it or mark it optional. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org