[ https://issues.apache.org/jira/browse/SPARK-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-17913: ---------------------------- Assignee: Wenchen Fan > Filter/join expressions can return incorrect results when comparing strings > to longs > ------------------------------------------------------------------------------------ > > Key: SPARK-17913 > URL: https://issues.apache.org/jira/browse/SPARK-17913 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.2, 2.0.0 > Reporter: Ming Beckwith > Assignee: Wenchen Fan > Labels: release_notes > Fix For: 2.2.0 > > > Reproducer: > {code} > case class E(subject: Long, predicate: String, objectNode: String) > def test(sc: SparkContext) = { > val sqlContext: SQLContext = new SQLContext(sc) > import sqlContext.implicits._ > val broken = List( > (19157170390056969L, "right", 19157170390056969L), > (19157170390056973L, "wrong", 19157170390056971L), > (19157190254313477L, "wrong", 19157190254313475L), > (19157180859056133L, "wrong", 19157180859056131L), > (19157170390056969L, "number", 161), > (19157170390056971L, "string", "a string"), > (19157190254313475L, "string", "another string"), > (19157180859056131L, "number", 191) > ) > val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, > b._3.toString)).toDF() > val brokenFilter = brokenDF.filter($"subject" === $"objectNode") > val fixed = brokenDF.filter(brokenDF("subject").cast("string") === > brokenDF("objectNode")) > println("***** incorrect filter results *****") > println(brokenFilter.show()) > println("***** correct filter results *****") > println(fixed.show()) > println("***** both sides cast to double *****") > println(brokenFilter.explain()) > } > Broken filter returns: > +-----------------+---------+-----------------+ > | subject|predicate| objectNode| > +-----------------+---------+-----------------+ > |19157170390056969| right|19157170390056969| > |19157170390056973| wrong|19157170390056971| > |19157190254313477| wrong|19157190254313475| > |19157180859056133| wrong|19157180859056131| > +-----------------+---------+-----------------+ > {code} > The physical plan shows both sides of the expression are being cast to Double > before evaluation. So while comparing numbers to a string number appears to > work in many cases, when the numbers are sufficiently large and close > together there is enough loss of precision to cause incorrect results. > {code} > == Physical Plan == > Filter (cast(subject#0L as double) = cast(objectNode#2 as double)) > After casting the left side into strings, the filter returns the expected > result: > +-----------------+---------+-----------------+ > | subject|predicate| objectNode| > +-----------------+---------+-----------------+ > |19157170390056969| right|19157170390056969| > +-----------------+---------+-----------------+ > {code} > Expected behavior in this case is probably to choose one side and cast the > other (compare string to string or long to long) instead of using a data type > with less precision. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org