[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22731


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread 10110346
Github user 10110346 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225362470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

Considering that the default value of 
`"spark.sql.parquet.columnarReaderBatchSize` is 4096, this change is better .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225337806
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

Makes sense. In this case the behavior should be the same before and after 
this change, but it's therefore fine, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225333576
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

4096 is the default number of the batch reader in both ORC and Parquet. If 
the users set the conf to a much smaller number, they will face the perf 
regression due to the the extra overhead in many places. I do not think end 
users will do this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225325506
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

... I guess the only possible drawback is that if the number of records in 
a ColumnarBatch is pretty small, then this could cause it to update bytes read 
a lot more frequently than before. Bu if the number of records is large (>100) 
then this won't matter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22731#discussion_r225262559
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -106,15 +106,16 @@ class FileScanRDD(
 // don't need to run this `if` for every record.
 val preNumRecordsRead = inputMetrics.recordsRead
 if (nextElement.isInstanceOf[ColumnarBatch]) {
+  incTaskInputMetricsBytesRead()
--- End diff --

I see, so always update when processing `ColumnarBatch`, but use the 
previous logic otherwise. That seems OK. It should still address the original 
problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

2018-10-15 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/22731

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch

## What changes were proposed in this pull request?
This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This 
alternative can avoid the unneeded computation in the hot code path. 

- For row-based scan, we keep the original way. 
- For the columnar scan, we just need to update the stats after each batch.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark udpateStatsFileScanRDD

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22731


commit 8731588d28f302e51095f7ed1a4331edc5233958
Author: gatorsmile 
Date:   2018-10-15T17:44:39Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org