Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22621#discussion_r222337687
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
    @@ -517,4 +517,93 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
       test("writing data out metrics with dynamic partition: parquet") {
         testMetricsDynamicPartition("parquet", "parquet", "t1")
       }
    +
    +  test("SPARK-25602: range metrics can be wrong if the result rows are not 
fully consumed") {
    +    val df = spark.range(0, 30, 1, 2).toDF().filter('id % 3 === 0)
    +
    +    withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true") {
    +      df.collect()
    +      df.queryExecution.executedPlan.foreach {
    +        case w: WholeStageCodegenExec =>
    +          w.child.foreach {
    +            case f: FilterExec => assert(f.metrics("numOutputRows").value 
== 10L)
    --- End diff --
    
    not a big issue, but if later we change things and these are not anymore 
here, we would not run the assert here. I would suggest to collect the 
`FilterExec` and the `RangeExec` and enforce that we collected 1 of both and 
then assert on them. What do you think?
    
    Moreover, nit: would it be possible to dedup the code here? The tests are 
very similar with codegen on and off, only collecting the two exec nodes 
differs...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to