[ https://issues.apache.org/jira/browse/SPARK-12940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Wheeler updated SPARK-12940: ---------------------------------- Attachment: spark-12940.txt These commands can be run in the spark shell to duplicate this big. I assume in this example that /tmp is your temp directory. > Partition field in Spark SQL WHERE clause causing Exception > ----------------------------------------------------------- > > Key: SPARK-12940 > URL: https://issues.apache.org/jira/browse/SPARK-12940 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2, 1.6.0 > Environment: AWS EMR 4.2, OSX > Reporter: Brian Wheeler > Attachments: spark-12940.txt > > > I have partitioned Parquet that I am trying to query with Spark SQL. When I > involve a partition column in the {{WHERE}} clause when using {{OR}} I get an > exception. > I have had this issue when using spark-submit on a cluster when the Parquet > was created externally and registered with Hive JDBC-backed metastore > externally. I can also duplicate this behavior with a simplified example in > the spark shell. I will include the simplified example. Note that I am using > my hive-site.xml when I launch the spark-shell so the metastore is set up the > same way. > I also tried this locally with the same results on a Mac laptop with 1.6.0. > Create some partitioned parquet: > {code} > case class Hit(meta_ts_unix_ms: Long, username: String, srclatitude: Double, > srclongitude: Double, srccity: String, srcregion: String, srccountrycode: > String, metaclass: String) > val rdd = sc.parallelize(Array(Hit(34L, "user1", 45.2, 23.2, "city1", > "state1", "US", "blah, other"), Hit(35L, "user1", 53.2, 11.2, "city2", > "state2", "US", "blah"))) > sqlContext.createDataFrame(rdd).registerTempTable("test_table") > sqlContext.sql("select * from test_table where meta_ts_unix_ms = > 35").write.parquet("file:///tmp/year=2015/month=12/day=4/hour=1/") > sqlContext.sql("select * from test_table where meta_ts_unix_ms = > 34").write.parquet("file:///tmp/year=2015/month=12/day=3/hour=23/") > {code} > Create an external table from the parquet: > {code} > sqlContext.createExternalTable("test_table2", "file:///tmp/year=2015/", > "parquet") > {code} > If I understand correctly the partitions were discovered automatically > because they show up in the describe command even though they were not part > of the schema generated from the case classes: > {code} > +---------------+---------+-------+ > | col_name|data_type|comment| > +---------------+---------+-------+ > |meta_ts_unix_ms| bigint| | > | username| string| | > | srclatitude| double| | > | srclongitude| double| | > | srccity| string| | > | srcregion| string| | > | srccountrycode| string| | > | metaclass| string| | > | year| int| | > | month| int| | > | day| int| | > | hour| int| | > +---------------+---------+-------+ > {code} > This query: > {code} > sqlContext.sql("SELECT > meta_ts_unix_ms,username,srclatitude,srclongitude,srccity,srcregion,srccountrycode > FROM test_table2 WHERE meta_ts_unix_ms IS NOT NULL AND username IS NOT NULL > AND metaclass like '%blah%' OR hour = 1").show() > {code} > Throws this exception: > {noformat} > 16/01/20 21:36:46 WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 84, > ip-192-168-111-222.ec2.internal): > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: metaclass#53 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:217) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$.create(predicates.scala:31) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:281) > at > org.apache.spark.sql.execution.Filter$$anonfun$4.apply(basicOperators.scala:114) > at > org.apache.spark.sql.execution.Filter$$anonfun$4.apply(basicOperators.scala:113) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Couldn't find metaclass#53 in > [meta_ts_unix_ms#45L,username#46,srclatitude#47,srclongitude#48,srccity#49,srcregion#50,srccountrycode#51] > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:92) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:86) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) > ... 77 more > {noformat} > This query works fine and returns expected results but it does not involve > any of the partition columns in the {{OR}} portion of the {{WHERE}} clause: > {code} > sqlContext.sql("SELECT > meta_ts_unix_ms,username,srclatitude,srclongitude,srccity,srcregion,srccountrycode > FROM test_table2 WHERE meta_ts_unix_ms IS NOT NULL AND username IS NOT NULL > AND metaclass like '%other%' OR metaclass = 'blah'").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org