[ 
https://issues.apache.org/jira/browse/SPARK-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210038#comment-15210038
 ] 

Rajesh Balamohan commented on SPARK-14113:
------------------------------------------

[~srowen] - In some cases, queries have 5000+ RDDs and whenever HadoopRDD gets 
initialized this cleanup gets called causing the bottleneck. So the overall 
runtime gets increased by couple of seconds when the entire job runtime itself 
is smaller. For instance, SQLHadoopRDD does not have this kind of check.

> Consider marking JobConf closure-cleaning in HadoopRDD as optional
> ------------------------------------------------------------------
>
>                 Key: SPARK-14113
>                 URL: https://issues.apache.org/jira/browse/SPARK-14113
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>
> In HadoopRDD, the following code was introduced as a part of SPARK-6943.
> {noformat}
>   if (initLocalJobConfFuncOpt.isDefined) {
>     sparkContext.clean(initLocalJobConfFuncOpt.get)
>   }
> {noformat}
> When working on one of the changes in OrcRelation, I tried passing 
> initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance 
> penalty (due to closure cleaning) with large RDDs. This would be invoked for 
> every HadoopRDD initialization causing the bottleneck.
> example threadstack is given below
> {noformat}
>         at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>         at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402)
>         at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390)
>         at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>         at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
>         at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
>         at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
>         at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>         at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390)
>         at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>         at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
>         at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224)
>         at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223)
>         at scala.collection.immutable.List.foreach(List.scala:318)
>         at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223)
>         at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>         at org.apache.spark.SparkContext.clean(SparkContext.scala:2079)
>         at 
> org.apache.spark.rdd.HadoopRDD.<init>(HadoopRDD.scala:112){noformat}
> Creating this JIRA to explore the possibility of removing it or mark it 
> optional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to