GitHub user heary-cao opened a pull request:

    https://github.com/apache/spark/pull/20415

    [SPARK-23247][SQL]combines Unsafe operations and statistics operations in 
Scan Data Source

    ## What changes were proposed in this pull request?
    
    Currently, we scan the execution plan of the data source, first the unsafe 
operation of each row of data, and then re traverse the data for the count of 
rows. In terms of performance, this is not necessary. this PR combines the two 
operations and makes statistics on the number of rows while performing the 
unsafe operation.
    
    Before modified,
    
    ```
    val unsafeRow = rdd.mapPartitionsWithIndexInternal { (index, iter) =>
    val proj = UnsafeProjection.create(schema)
     proj.initialize(index)
    iter.map(proj)
    }
    
    val numOutputRows = longMetric("numOutputRows")
    unsafeRow.map { r =>
    numOutputRows += 1
     r
    }
    
    
    ```
    After modified,
    
        val numOutputRows = longMetric("numOutputRows")
    
        rdd.mapPartitionsWithIndexInternal { (index, iter) =>
          val proj = UnsafeProjection.create(schema)
          proj.initialize(index)
          iter.map( r => {
            numOutputRows += 1
            proj(r)
          })
        }
    
    ## How was this patch tested?
    
    the existed test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/heary-cao/spark DataSourceScanExec

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20415
    
----
commit e3e09d98072bd39328a4e7d4de1ddd38594c6232
Author: caoxuewen <cao.xuewen@...>
Date:   2018-01-27T06:27:37Z

    combines Unsafe operations and statistics operations in Scan Data Source

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to