[ https://issues.apache.org/jira/browse/SPARK-44448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-44448: ------------------------------- Labels: correctness (was: ) > Wrong results for dense_rank() <= k from InferWindowGroupLimit and > DenseRankLimitIterator > ----------------------------------------------------------------------------------------- > > Key: SPARK-44448 > URL: https://issues.apache.org/jira/browse/SPARK-44448 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Reporter: Jack Chen > Assignee: Jack Chen > Priority: Major > Labels: correctness > Fix For: 3.5.0 > > > Top-k filters on a dense_rank() window function return wrong results, due to > a bug in optimization InferWindowGroupLimit, specifically in the code for > DenseRankLimitIterator, introduced in > https://issues.apache.org/jira/browse/SPARK-37099. > Repro: > {code:java} > create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2, > 1), (2, 1), (2, 2); > select * from (select *, dense_rank() over (partition by p order by o) as rnk > from t1) where rnk = 1;{code} > Spark result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1]{code} > Correct result: > {code:java} > [1,1,1] > [1,1,1] > [2,1,1] > [2,1,1]{code} > > The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state > properly when transitioning from one window partition to the next. {{reset}} > only resets {{{}rank = 0{}}}, what it is missing is to reset > {{{}currentRankRow = null{}}}. This means that when processing the second and > later window partitions, the rank incorrectly gets incremented based on > comparing the ordering of the last row of the previous partition to the first > row of the new partition. > This means that a dense_rank window func that has more than one window > partition and more than one row with dense_rank = 1 in the second or later > partitions can give wrong results when optimized. > ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the > first row in the new partition will try to increment rank, but increment it > by the value of count which is 0, so it happens to work by accident). > Unfortunately, tests for the optimization only had a single row per rank, so > did not catch the bug as the bug requires multiple rows per rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org