Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19783#discussion_r154254145
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
    @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
       def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
         : AttributeMap[ColumnStat] = {
         val newColumnStats = originalMap.map { case (attr, oriColStat) =>
    -      // Update ndv based on the overall filter selectivity: scale down 
ndv if the number of rows
    -      // decreases; otherwise keep it unchanged.
    -      val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
    -        newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
           val colStat = 
updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
    +      val newNdv = if (colStat.distinctCount > 1) {
    --- End diff --
    
    The old code does not work well for a couple of new skewed-distribution 
tests.  For example, test("cintHgm < 3") would fail.  Because it still computes 
to find newNdv in updateNdv() method.  But, in reality, we already scale it 
down to 1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to