[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22731 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225362470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- Considering that the default value of `"spark.sql.parquet.columnarReaderBatchSize` is 4096, this change is better . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225337806 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- Makes sense. In this case the behavior should be the same before and after this change, but it's therefore fine, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225333576 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- 4096 is the default number of the batch reader in both ORC and Parquet. If the users set the conf to a much smaller number, they will face the perf regression due to the the extra overhead in many places. I do not think end users will do this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225325506 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- ... I guess the only possible drawback is that if the number of records in a ColumnarBatch is pretty small, then this could cause it to update bytes read a lot more frequently than before. Bu if the number of records is large (>100) then this won't matter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22731#discussion_r225262559 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -106,15 +106,16 @@ class FileScanRDD( // don't need to run this `if` for every record. val preNumRecordsRead = inputMetrics.recordsRead if (nextElement.isInstanceOf[ColumnarBatch]) { + incTaskInputMetricsBytesRead() --- End diff -- I see, so always update when processing `ColumnarBatch`, but use the previous logic otherwise. That seems OK. It should still address the original problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/22731 [SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch ## What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark udpateStatsFileScanRDD Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22731.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22731 commit 8731588d28f302e51095f7ed1a4331edc5233958 Author: gatorsmile Date: 2018-10-15T17:44:39Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org