[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

rajeshbalamohan Tue, 22 Mar 2016 22:54:07 -0700

GitHub user rajeshbalamohan opened a pull request:

    https://github.com/apache/spark/pull/11911


    SPARK-14091 [core] Consider improving performance of SparkContext.getâ¦

    ## What changes were proposed in this pull request?
    Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
    
     private[spark] def getCallSite(): CallSite = {
        val callSite = Utils.getCallSite()
        CallSite(
          
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
          
Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
        )
      }
    
    However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
expensive threaddumps within getCallSite(). But Utils.getCallSite() is 
evaluated earlier causing threaddumps to be computed. This would impact when 
lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
are present, which can have significant impact when entire query runtime is in 
the order of 10-20 seconds)
    Creating this jira to consider evaluating getCallSite only when needed.
    
    
    ## How was this patch tested?
    No new test cases are added. Following standalone test was tried out 
manually. Also, built entire spark binary and tried with few SQL queries in 
TPC-DS  and TPC-H in multi node cluster
    
    def run(): Unit = {
        val conf = new SparkConf()
        val sc = new SparkContext("local[1]", "test-context", conf)
        val start: Long = System.currentTimeMillis();
        val confBroadcast = sc.broadcast(new SerializableConfiguration(new 
Configuration()))
        Utils.withDummyCallSite(sc) {
          //Large tables end up creating 5500 RDDs
          for(i <- 1 to 5000) {
            val testRDD = new HadoopRDD(sc, confBroadcast, None, null,
              classOf[NullWritable], classOf[Writable], 10)
          }
        }
        val end: Long = System.currentTimeMillis();
        println("Time taken : " + (end - start))
      }
      
    def main(args: Array[String]): Unit = {
        run
      }
    
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    
    â¦CallSite() (rbalamohan)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rajeshbalamohan/spark SPARK-14091

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11911
    
----
commit dba630b854d6fdb298f8ef7ed25acf497f0eeebe
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-03-23T04:57:01Z

    SPARK-14091 [core] Consider improving performance of 
SparkContext.getCallSite() (rbalamohan)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

Reply via email to