[ https://issues.apache.org/jira/browse/SPARK-35563 ]
Sean R. Owen deleted comment on SPARK-35563: -------------------------------------- was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. [Rails Course|https://www.igmguru.com/digital-marketing-programming/ruby-on-rails-certification-training/] > [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows > ------------------------------------------------------------------------------ > > Key: SPARK-35563 > URL: https://issues.apache.org/jira/browse/SPARK-35563 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2 > Reporter: Robert Joseph Evans > Priority: Major > Labels: data-loss > > I think this impacts a lot more versions of Spark, but I don't know for sure > because it takes a long time to test. As a part of doing corner case > validation testing for spark rapids I found that if a window function has > more than {{Int.MaxValue + 1}} rows the result is silently truncated to that > many rows. I have only tested this on 3.0.2 with {{row_number}}, but I > suspect it will impact others as well. This is a really rare corner case, but > because it is silent data corruption I personally think it is quite serious. > {code:scala} > import org.apache.spark.sql.expressions.Window > val windowSpec = Window.partitionBy("a").orderBy("b") > val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as > b") > spark.time(df.select(col("a"), col("b"), > row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), > desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20)) > +-----+----------+ > > | dir| count| > +-----+----------+ > |false|2147483647| > | true| 1| > +-----+----------+ > Time taken: 1139089 ms > Int.MaxValue.toLong + 100 > res15: Long = 2147483747 > 2147483647L + 1 > res16: Long = 2147483648 > {code} > I had to make sure that I ran the above with at least 64GiB of heap for the > executor (I did it in local mode and it worked, but took forever to run) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org