[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-37099: ----------------------------------- Assignee: jiaan.geng > Introduce a rank-based filter to optimize top-k computation > ----------------------------------------------------------- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0 > Reporter: Ruifeng Zheng > Assignee: jiaan.geng > Priority: Major > Fix For: 3.5.0 > > Attachments: q67.png, q67_optimized.png, skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. so we can safely discard rows with partitial rank > k, > anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org