GitHub user dujunling opened a pull request: https://github.com/apache/spark/pull/22232
[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop FileSystem s⦠â¦tatistics to update the inputMetrics ## What changes were proposed in this pull request? In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead every 1000 rows or when close the iterator. but when close the iterator, we will invoke updateBytesReadWithFileSize to increase the inputMetrics's bytesRead with file's length. this will result in the inputMetrics's bytesRead is wrong when run the query with limit such as select * from table limit 1. because we do not support for Hadoop 2.5 and earlier now, we always get the bytesRead from Hadoop FileSystem statistics other than files's length. ## How was this patch tested? manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/dujunling/spark fileScanRddInput Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22232.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22232 ---- commit 0f75257b50a611e069d406da8d72225bb4e73b51 Author: dujunling <dujunling@...> Date: 2018-08-25T06:20:35Z remove updateBytesReadWithFileSize because we use Hadoop FileSystem statistics to update the inputMetrics ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org