[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...

2018-10-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22594#discussion_r222884751
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -104,12 +104,14 @@ class FileScanRDD(
 val nextElement = currentIterator.next()
 // TODO: we should have a better separation of row based and batch 
based scan, so that we
 // don't need to run this `if` for every record.
+val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
   
inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows())
 } else {
   inputMetrics.incRecordsRead(1)
 }
-if (inputMetrics.recordsRead % 
SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
--- End diff --

I think the issue is that in line 108, this value can be incremented by 
more than 1. It might skip over the count that is an exact multiple of 
`UPDATE_INPUT_METRICS_INTERVAL_RECORDS`. If that code path is common, it might 
rarely ever get updated. This now just checks whether the increment causes the 
value to exceed a higher multiple of `UPDATE_INPUT_METRICS_INTERVAL_RECORDS`, 
which sounds more correct. But yeah needs a description and ideally a little 
test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...

2018-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22594#discussion_r222875551
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -104,12 +104,14 @@ class FileScanRDD(
 val nextElement = currentIterator.next()
 // TODO: we should have a better separation of row based and batch 
based scan, so that we
 // don't need to run this `if` for every record.
+val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
   
inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows())
 } else {
   inputMetrics.incRecordsRead(1)
 }
-if (inputMetrics.recordsRead % 
SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
--- End diff --

The original goal here is to avoid updating it every record, because it is 
too expensive. I am not sure what is the goal of your changes. Try to write a 
test case in SQLMetricsSuite?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22594: [MINOR][SQL] When batch reading, the number of by...

2018-09-30 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22594

[MINOR][SQL] When batch reading, the number of bytes can not be updated as 
expected.

## What changes were proposed in this pull request?

When batch reading, the number of bytes can not be updated as expected.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark inputMetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22594


commit e589e1ef83418a485c9d55a72209c0c86cf7b044
Author: liuxian 
Date:   2018-09-30T09:14:20Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org