[jira] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

2023-02-22 Thread Sean R. Owen (Jira)


[ https://issues.apache.org/jira/browse/SPARK-35563 ]


Sean R. Owen deleted comment on SPARK-35563:
--

was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[Rails 
Course|https://www.igmguru.com/digital-marketing-programming/ruby-on-rails-certification-training/]

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> --
>
> Key: SPARK-35563
> URL: https://issues.apache.org/jira/browse/SPARK-35563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
>  Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-+--+
>   
> |  dir| count|
> +-+--+
> |false|2147483647|
> | true| 1|
> +-+--+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows

2022-11-01 Thread Jungtaek Lim (Jira)


[ https://issues.apache.org/jira/browse/SPARK-35563 ]


Jungtaek Lim deleted comment on SPARK-35563:
--

was (Author: JIRAUSER294516):
Alternately, simply perform the int overflow check. Personally, I don't think 
it's a problem that Spark doesn't support really huge windows. I simply want to 
prevent data loss or corruption. 
Thank you.
[Salesforce Marketing Cloud 
Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/]

> [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
> --
>
> Key: SPARK-35563
> URL: https://issues.apache.org/jira/browse/SPARK-35563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: data-loss
>
> I think this impacts a lot more versions of Spark, but I don't know for sure 
> because it takes a long time to test. As a part of doing corner case 
> validation testing for spark rapids I found that if a window function has 
> more than {{Int.MaxValue + 1}} rows the result is silently truncated to that 
> many rows. I have only tested this on 3.0.2 with {{row_number}}, but I 
> suspect it will impact others as well. This is a really rare corner case, but 
> because it is silent data corruption I personally think it is quite serious.
> {code:scala}
> import org.apache.spark.sql.expressions.Window
> val windowSpec = Window.partitionBy("a").orderBy("b")
> val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as 
> b")
> spark.time(df.select(col("a"), col("b"), 
> row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), 
> desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20))
> +-+--+
>   
> |  dir| count|
> +-+--+
> |false|2147483647|
> | true| 1|
> +-+--+
> Time taken: 1139089 ms
> Int.MaxValue.toLong + 100
> res15: Long = 2147483747
> 2147483647L + 1
> res16: Long = 2147483648
> {code}
> I had to make sure that I ran the above with at least 64GiB of heap for the 
> executor (I did it in local mode and it worked, but took forever to run)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org