[
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicholas Chammas updated SPARK-46992:
-------------------------------------
Labels: correctness (was: )
> Inconsistent results with 'sort', 'cache', and AQE.
> ---------------------------------------------------
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.3.2, 3.5.0
> Reporter: Denis Tarima
> Priority: Critical
> Labels: correctness
>
>
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached, {color:#4c9aff}collect{color} returns records as if
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println(" count: " + df.count())
> println(" collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println(" count: " + df.count())
> println(" collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
> count: 2
> collect: [1] [4]
> +---+
> | id|
> +---+
> | 1|
> | 4|
> +---+
> CACHED:
> count: 3
> collect: [1] [4]
> +---+
> | id|
> +---+
> | 1|
> | 2|
> | 3|
> +---+
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]