[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22594#discussion_r222884751 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -104,12 +104,14 @@ class FileScanRDD( val nextElement = currentIterator.next() // TODO: we should have a better separation of row based and batch based scan, so that we // don't need to run this `if` for every record. +val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows()) } else { inputMetrics.incRecordsRead(1) } -if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) { --- End diff -- I think the issue is that in line 108, this value can be incremented by more than 1. It might skip over the count that is an exact multiple of `UPDATE_INPUT_METRICS_INTERVAL_RECORDS`. If that code path is common, it might rarely ever get updated. This now just checks whether the increment causes the value to exceed a higher multiple of `UPDATE_INPUT_METRICS_INTERVAL_RECORDS`, which sounds more correct. But yeah needs a description and ideally a little test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22594#discussion_r222875551 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -104,12 +104,14 @@ class FileScanRDD( val nextElement = currentIterator.next() // TODO: we should have a better separation of row based and batch based scan, so that we // don't need to run this `if` for every record. +val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows()) } else { inputMetrics.incRecordsRead(1) } -if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) { --- End diff -- The original goal here is to avoid updating it every record, because it is too expensive. I am not sure what is the goal of your changes. Try to write a test case in SQLMetricsSuite? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/22594 [MINOR][SQL] When batch reading, the number of bytes can not be updated as expected. ## What changes were proposed in this pull request? When batch reading, the number of bytes can not be updated as expected. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark inputMetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22594.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22594 commit e589e1ef83418a485c9d55a72209c0c86cf7b044 Author: liuxian Date: 2018-09-30T09:14:20Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org