GitHub user dujunling opened a pull request:

    https://github.com/apache/spark/pull/22232

    [SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop 
FileSystem s…

    …tatistics to update the inputMetrics
    
    ## What changes were proposed in this pull request?
    
    In FileScanRdd, we will update inputMetrics's bytesRead using 
updateBytesRead  every 1000 rows or when close the iterator.
    
    but when close the iterator,  we will invoke updateBytesReadWithFileSize to 
increase the inputMetrics's bytesRead with file's length.
    
    this will result in the inputMetrics's bytesRead is wrong when run the 
query with limit such as select * from table limit 1.
    
    because we do not support for Hadoop 2.5 and earlier now, we always get the 
bytesRead from  Hadoop FileSystem statistics other than files's length.
    
    ## How was this patch tested?
    
    manual test
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dujunling/spark fileScanRddInput

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22232
    
----
commit 0f75257b50a611e069d406da8d72225bb4e73b51
Author: dujunling <dujunling@...>
Date:   2018-08-25T06:20:35Z

    remove updateBytesReadWithFileSize because we use Hadoop FileSystem 
statistics to update the inputMetrics

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to