[
https://issues.apache.org/jira/browse/SPARK-52873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bruce Robbins updated SPARK-52873:
----------------------------------
Labels: correctness (was: )
> Hint causes semi join results to vary
> -------------------------------------
>
> Key: SPARK-52873
> URL: https://issues.apache.org/jira/browse/SPARK-52873
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 3.4.1, 3.5.3, 3.5.5
> Reporter: Peter Connolly
> Priority: Major
> Labels: correctness
>
> I was running a query in production that was not getting the expected
> results. I suspect that this query is using an improper join type for a semi
> join with a non-equal predicate.
> While my production code has no query hint I am able to reproduce the issue
> consistently by doing a semi join with a hint in some sample code. In this
> sample code I create two dataframes, one of which has strings and another
> that I want to match on the first x characters using a startswith clause. I
> realize that the substring equality join is logically redundant and could be
> theoretically removed, but it helps greatly with performance in production. I
> then run two queries that are nearly identical to find records that match.
> The only difference between the queries is that the second query has a join
> hint. Despite being the same logical query they produce different output.
> {code:java}
> import spark.implicits._
> val geohashes = Seq(
> ("9ykchgz95z"),
> ("abckd3kdf1"),
> ).toDF("geohash")
> .repartition(2)
> geohashes.createOrReplaceTempView("geohashes")
> val geohashesToInclude = Seq(
> ("9ykchgz91"),
> ("9ykchgz92"),
> ("9ykchgz93"),
> ("9ykchgz94"),
> ("9ykchgz95"),
> ("9ykchgz96"),
> ("9ykchgz97"),
> ("9ykchgz98"),
> ("9ykchgz99"),
> ("9ykchgz90"),
> ).toDF("geohash_prefix")
> .repartition(10)
> geohashesToInclude.createOrReplaceTempView("geohashes_to_include")
> spark.sql("SELECT * FROM geohashes g LEFT SEMI JOIN geohashes_to_include i ON
> SUBSTRING(g.geohash, 1, 7) = SUBSTRING(i.geohash_prefix, 1, 7) AND
> STARTSWITH(g.geohash, i.geohash_prefix)").show()
> +----------+
> | geohash|
> +----------+
> |9ykchgz95z|
> +----------+
> spark.sql("SELECT /*+ SHUFFLE_HASH(i) */ * FROM geohashes g LEFT SEMI JOIN
> geohashes_to_include i ON SUBSTRING(g.geohash, 1, 7) =
> SUBSTRING(i.geohash_prefix, 1, 7) AND STARTSWITH(g.geohash,
> i.geohash_prefix)").show()
> +-------+
> |geohash|
> +-------+
> +-------+ {code}
> I produced the above sample on spark 3.5.3, but I think I have observed
> similar issues in production code in spark 3.4.1 and 3.5.5 (the latest
> version on EMR).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]