[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jianshi Huang updated SPARK-6201: --------------------------------- Description: Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", "{\"a\": \"3\"}}"))).registerTempTable("d") {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql("select * from d where (d.a = 1 or d.a = 2)").collect => Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql("select * from d where d.a in (1,2)").collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi was: Suppose we the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", "{\"a\": \"3\"}}"))).registerTempTable("d") {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql("select * from d where (d.a = 1 or d.a = 2)").collect => Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql("select * from d where d.a in (1,2)").collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. Jianshi > INSET should coerce types > ------------------------- > > Key: SPARK-6201 > URL: https://issues.apache.org/jira/browse/SPARK-6201 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0, 1.3.0, 1.2.1 > Reporter: Jianshi Huang > > Suppose we have the following table: > {code} > sqlc.jsonRDD(sc.parallelize(Seq("{\"a\": \"1\"}}", "{\"a\": \"2\"}}", > "{\"a\": \"3\"}}"))).registerTempTable("d") > {code} > The schema is > {noformat} > root > |-- a: string (nullable = true) > {noformat} > Then, > {code} > sql("select * from d where (d.a = 1 or d.a = 2)").collect > => > Array([1], [2]) > {code} > where d.a and constants 1,2 will be casted to Double first and do the > comparison as you can find it out in the plan: > {noformat} > Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, > DoubleType) = CAST(2, DoubleType))) > {noformat} > However, if I use > {code} > sql("select * from d where d.a in (1,2)").collect > {code} > The result is empty. > The physical plan shows it's using INSET: > {noformat} > == Physical Plan == > Filter a#155 INSET (1,2) > PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 > {noformat} > But it seems INSET implementation in SparkSQL doesn't coerce type implicitly, > where Hive does. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org