GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/21018

    [SPARK-23880][SQL] Do not trigger any jobs for caching data

    ## What changes were proposed in this pull request?
    This pr fixed code so that `cache` could prevent any jobs from being 
triggered.
    For example, in the current master, an operation below triggers a actual 
job;
    ```
    val df = spark.range(10000000000L)
      .filter('id > 1000)
      .orderBy('id.desc)
      .cache()
    ```
    This triggers a job while the cache should be lazy. The problem is that, 
when creating `InMemoryRelation`, we build the RDD, which calls 
`SparkPlan.execute` and may trigger jobs, like sampling job for range 
partitioner, or broadcast job.
    
    This fix do not build a `RDD` in the constructor of `InMemoryRelation`. 
Then, `InMemoryTableScanExec` materializes the cache and updates the entry in 
`CacheManager`. 
    
    ## How was this patch tested?
    Added tests in `CachedTableSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-23880

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21018
    
----
commit 01d75d789c45f73bd999106dfc6f29cdc3050ce9
Author: Takeshi Yamamuro <yamamuro@...>
Date:   2018-04-09T09:30:10Z

    Fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to