Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22621#discussion_r222337687 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -517,4 +517,93 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared test("writing data out metrics with dynamic partition: parquet") { testMetricsDynamicPartition("parquet", "parquet", "t1") } + + test("SPARK-25602: range metrics can be wrong if the result rows are not fully consumed") { + val df = spark.range(0, 30, 1, 2).toDF().filter('id % 3 === 0) + + withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true") { + df.collect() + df.queryExecution.executedPlan.foreach { + case w: WholeStageCodegenExec => + w.child.foreach { + case f: FilterExec => assert(f.metrics("numOutputRows").value == 10L) --- End diff -- not a big issue, but if later we change things and these are not anymore here, we would not run the assert here. I would suggest to collect the `FilterExec` and the `RangeExec` and enforce that we collected 1 of both and then assert on them. What do you think? Moreover, nit: would it be possible to dedup the code here? The tests are very similar with codegen on and off, only collecting the two exec nodes differs...
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org