[ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
-------------------------------
    Description: 
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method (See PR - 
https://github.com/apache/spark/pull/6256).

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes

  was:
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes


> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> ----------------------------------------------------------------------
>
>                 Key: SPARK-7970
>                 URL: https://issues.apache.org/jira/browse/SPARK-7970
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Nitin Goyal
>         Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to