[jira] [Created] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
Dilip Biswal created SPARK-23275: Summary: hive/tests have been failing when run locally on the laptop (Mac) with OOM Key: SPARK-23275 URL: https://issues.apache.org/jira/browse/SPARK-23275 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Dilip Biswal hive tests have been failing when they are run locally (Mac Os) after a recent change in the trunk. After running the tests for some time, the test fails with OOM with Error: unable to create new native thread. I noticed the thread count goes all the way up to 2000+ after which we start getting these OOM errors. Most of the threads seem to be related to the connection pool in hive metastore (BoneCP-x- ). This behaviour change is happening after we made the following change to HiveClientImpl.reset() {code} def reset(): Unit = withHiveState { try { // code } finally { runSqlHive("USE default") ===> this is causing the issue } {code} I am proposing to temporarily back-out part of a fix made to address SPARK-23000 to resolve this issue while we work-out the exact reason for this sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
[ https://issues.apache.org/jira/browse/SPARK-23095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-23095: - Description: The following SQL involving scalar correlated query returns a map exception. {code:java} SELECT t1a FROM t1 WHERE t1a = (SELECT count FROM t2 WHERE t2c = t1c HAVING count >= 1) {code} {code:java} key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) {code} In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. was: The following SQL involving scalar correlated query returns a map exception. {code:java} SELECT t1a FROM t1 WHERE t1a = (SELECT count FROM t2 WHERE t2c = t1c HAVING count >= 1) {code} {code:java} key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) {code} In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. > Decorrelation of scalar subquery fails with java.util.NoSuchElementException. > - > > Key: SPARK-23095 > URL: https://issues.apache.org/jira/browse/SPARK-23095 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Major > > The following SQL involving scalar correlated query returns a map exception. > {code:java} > SELECT t1a > FROM t1 > WHERE t1a = (SELECT count > FROM t2 > WHERE t2c = t1c > HAVING count >= 1) > {code} > {code:java} > > key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) > java.util.NoSuchElementException: key not found: > ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at > scala.collection.MapLike$class.default(MapLike.scala:228) at > scala.collection.AbstractMap.default(Map.scala:59) at > scala.collection.MapLike$class.apply(MapLike.scala:141) at > scala.collection.AbstractMap.apply(Map.scala:59) at > org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun
[jira] [Updated] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
[ https://issues.apache.org/jira/browse/SPARK-23095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-23095: - Description: The following SQL involving scalar correlated query returns a map exception. {code:java} SELECT t1a FROM t1 WHERE t1a = (SELECT count FROM t2 WHERE t2c = t1c HAVING count >= 1) {code} {code:java} key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) {code} In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. was: The following SQL involving scalar correlated query returns a map exception. SELECT t1a FROM t1 WHERE t1a = (SELECT count(*) FROM t2 WHERE t2c = t1c HAVING count(*) >= 1) key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. > Decorrelation of scalar subquery fails with java.util.NoSuchElementException. > - > > Key: SPARK-23095 > URL: https://issues.apache.org/jira/browse/SPARK-23095 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Major > > The following SQL involving scalar correlated query returns a map exception. > {code:java} > SELECT t1a > FROM t1 > WHERE t1a = (SELECT count > FROM t2 > WHERE t2c = t1c > HAVING count >= 1) > {code} > {code:java} > > key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) > java.util.NoSuchElementException: key not found: > ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at > scala.collection.MapLike$class.default(MapLike.scala:228) at > scala.collection.AbstractMap.default(Map.scala:59) at > scala.collection.MapLike$class.apply(MapLike.scala:141) at > scala.collection.AbstractMap.apply(Map.scala:59) at > org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) > at > org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun
[jira] [Created] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
Dilip Biswal created SPARK-23095: Summary: Decorrelation of scalar subquery fails with java.util.NoSuchElementException. Key: SPARK-23095 URL: https://issues.apache.org/jira/browse/SPARK-23095 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Dilip Biswal The following SQL involving scalar correlated query returns a map exception. SELECT t1a FROM t1 WHERE t1a = (SELECT count(*) FROM t2 WHERE t2c = t1c HAVING count(*) >= 1) key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Welcoming Tejas Patil as a Spark committer
Congratulations , Tejas! -- Dilip - Original message -From: Suresh Thalamati To: "dev@spark.apache.org" Cc:Subject: Re: Welcoming Tejas Patil as a Spark committerDate: Tue, Oct 3, 2017 12:01 PM Congratulations , Tejas!-suresh> On Sep 29, 2017, at 12:58 PM, Matei Zaharia wrote:>> Hi all,>> The Spark PMC recently added Tejas Patil as a committer on the> project. Tejas has been contributing across several areas of Spark for> a while, focusing especially on scalability issues and SQL. Please> join me in welcoming Tejas!>> Matei>> -> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org>-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Welcoming Saisai (Jerry) Shao as a committer
Congratulations, Jerry! Regards,Dilip Biswal - Original message -From: Takuya UESHIN To: Saisai Shao Cc: dev Subject: Re: Welcoming Saisai (Jerry) Shao as a committerDate: Mon, Aug 28, 2017 10:22 PM Congratulations, Jerry! On Tue, Aug 29, 2017 at 2:14 PM, Suresh Thalamati <suresh.thalam...@gmail.com> wrote: Congratulations, Jerry > On Aug 28, 2017, at 6:28 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:>> Hi everyone,>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has been contributing to many areas of the project for a long time, so it’s great to see him join. Join me in thanking and congratulating him!>> Matei> -> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org>-To unsubscribe e-mail: dev-unsubscribe@spark.apache.org -- Takuya UESHINTokyo, Japanhttp://twitter.com/ueshin - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers
Congratulations Hyukjin and Sameer !! Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com - Original message -From: Liang-Chi Hsieh To: dev@spark.apache.orgCc:Subject: Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committersDate: Tue, Aug 8, 2017 12:29 AM Congrats to Hyukjin and Sameer!Xiao Li wrote> Congrats!>>> On Mon, 7 Aug 2017 at 10:21 PM Takuya UESHIN <> ueshin@> > wrote:>>> Congrats! On Tue, Aug 8, 2017 at 11:38 AM, Felix Cheung <> felixcheung_m@> >>> wrote:> Congrats!!>> -->>> *From:* Kevin Kim (Sangwoo) <> kevin@> *Sent:* Monday, August 7, 2017 7:30:01 PM>>> *To:* Hyukjin Kwon; dev>>> *Cc:* Bryan Cutler; Mridul Muralidharan; Matei Zaharia; Holden Karau>>> *Subject:* Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers>> Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!> 2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon <> gurwls223@> >님이 작성:>>> Thank you all. Will do my best! 2017-08-08 8:53 GMT+09:00 Holden Karau <> holden@> >:> Congrats!>> On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler <> cutlerb@> > wrote:>>> Great work Hyukjin and Sameer! On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan <> mridul@> > > wrote:> Congratulations Hyukjin, Sameer !>> Regards,>>> Mridul>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia < matei.zaharia@>> wrote:>>> > Hi everyone,>>> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer>>> Agarwal>>> as committers. Join me in congratulating both of them and thanking>>> them for>>> their contributions to the project!>>> > Matei>>> ->>> > To unsubscribe e-mail:> dev-unsubscribe@.apache>>> >>> ->>> To unsubscribe e-mail:> dev-unsubscribe@.apache --> Cell : 425-233-8271 <(425)%20233-8271>> Twitter: https://twitter.com/holdenkarau>>> -->> Takuya UESHIN>> Tokyo, Japan http://twitter.com/ueshin>>-Liang-Chi Hsieh | @viiryaSpark Technology Centerhttp://www.spark.tc/--View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Welcoming-Hyukjin-Kwon-and-Sameer-Agarwal-as-committers-tp22092p22109.htmlSent from the Apache Spark Developers List mailing list archive at Nabble.com.-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Created] (SPARK-21599) Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
Dilip Biswal created SPARK-21599: Summary: Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException Key: SPARK-21599 URL: https://issues.apache.org/jira/browse/SPARK-21599 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Dilip Biswal Collecting column level statistics for non compatible hive tables using {code} ANALYZE TABLE FOR COLUMNS {code} may fail with the following exception. {code} key not found: a java.util.NoSuchElementException: key not found: a at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:657) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:656) at scala.collection.immutable.Map$Map2.foreach(Map.scala:137) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:656) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375) at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20417) Move error reporting for subquery from Analyzer to CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-20417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977286#comment-15977286 ] Dilip Biswal commented on SPARK-20417: -- Currently waiting on [pr 17636|https://github.com/apache/spark/pull/17636] to be merged. After that i will rebase and open a PR for this. > Move error reporting for subquery from Analyzer to CheckAnalysis > > > Key: SPARK-20417 > URL: https://issues.apache.org/jira/browse/SPARK-20417 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dilip Biswal > > Currently we do a lot of validations for subquery in the Analyzer. We should > move them to CheckAnalysis which is the framework to catch and report > Analysis errors. This was mentioned as a review comment in SPARK-18874. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20417) Move error reporting for subquery from Analyzer to CheckAnalysis
Dilip Biswal created SPARK-20417: Summary: Move error reporting for subquery from Analyzer to CheckAnalysis Key: SPARK-20417 URL: https://issues.apache.org/jira/browse/SPARK-20417 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Dilip Biswal Currently we do a lot of validations for subquery in the Analyzer. We should move them to CheckAnalysis which is the framework to catch and report Analysis errors. This was mentioned as a review comment in SPARK-18874. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations
[ https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973929#comment-15973929 ] Dilip Biswal commented on SPARK-20356: -- [~viirya] Did you try from spark-shell or from one of our query suites ? I could reproduce it from spark-shell fine. From our query suites i had to force the number of shuffle partitions to reproduce it. {code} test("cache defect") { withSQLConf("spark.sql.shuffle.partitions" -> "200") { val df1 = Seq(("a", 1), ("b", 1), ("c", 2)).toDF("item", "group") val df2 = Seq(("a", 1), ("b", 2), ("c", 3)).toDF("item", "id") val df3 = df1.join(df2, Seq("item")).select($"id", $"group".as("item")).distinct() df3.explain(true) df3.unpersist() val agg_without_cache = df3.groupBy($"item").count() agg_without_cache.show() df3.cache() val agg_with_cache = df3.groupBy($"item").count() agg_with_cache.explain(true) agg_with_cache.show() } } {code} > Spark sql group by returns incorrect results after join + distinct > transformations > -- > > Key: SPARK-20356 > URL: https://issues.apache.org/jira/browse/SPARK-20356 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Linux mint 18 > Python 3.5 >Reporter: Chris Kipers > > I'm experiencing a bug with the head version of spark as of 4/17/2017. After > joining to dataframes, renaming a column and invoking distinct, the results > of the aggregation is incorrect after caching the dataframe. The following > code snippet consistently reproduces the error. > from pyspark.sql import SparkSession > import pyspark.sql.functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > mapping_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "GROUP": 1}, > {"ITEM": "b", "GROUP": 1}, > {"ITEM": "c", "GROUP": 2} > ])) > items_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "ID": 1}, > {"ITEM": "b", "ID": 2}, > {"ITEM": "c", "ID": 3} > ])) > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')).distinct() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect > The next code snippet is almost the same after the first except I don't call > distinct on the dataframe. This snippet performs as expected: > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')) > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > I don't experience this bug with spark 2.1 or event earlier versions for 2.2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations
[ https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455 ] Dilip Biswal edited comment on SPARK-20356 at 4/18/17 8:47 PM: --- [~viirya] [~hvanhovell] [~cloud_fan] [~smilegator] I took a quick look and it seems the issue started happening after this [pr|https://github.com/apache/spark/pull/17175]. We are changing the output partitioning information of InMemoryTableScanExec as part of the fix ( id, item -> item, item) causing a missing shuffle in the operators above InMemoryTableScan. Changing to use the child's output partitioning like before fixes the issue. I am a little new to this code :-) And this is what i have found so far. Hope this helps. was (Author: dkbiswal): [~viirya] [~hvanhovell] [~cloud_fan] [~smilegator] I took a quick look and it seems the issue started happening after this [pr|https://github.com/apache/spark/pull/17175]. We are changing the output partitioning information as part of the fix ( id, item -> item, item) causing a missing shuffle in the operators above InMemoryTableScan. Changing to use the child's output partitioning like before fixes the issue. I am a little new to this code :-) And this is what i have found so far. Hope this helps. > Spark sql group by returns incorrect results after join + distinct > transformations > -- > > Key: SPARK-20356 > URL: https://issues.apache.org/jira/browse/SPARK-20356 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Linux mint 18 > Python 3.5 >Reporter: Chris Kipers > > I'm experiencing a bug with the head version of spark as of 4/17/2017. After > joining to dataframes, renaming a column and invoking distinct, the results > of the aggregation is incorrect after caching the dataframe. The following > code snippet consistently reproduces the error. > from pyspark.sql import SparkSession > import pyspark.sql.functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > mapping_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "GROUP": 1}, > {"ITEM": "b", "GROUP": 1}, > {"ITEM": "c", "GROUP": 2} > ])) > items_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "ID": 1}, > {"ITEM": "b", "ID": 2}, > {"ITEM": "c", "ID": 3} > ])) > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')).distinct() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect > The next code snippet is almost the same after the first except I don't call > distinct on the dataframe. This snippet performs as expected: > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')) > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > I don't experience this bug with spark 2.1 or event earlier versions for 2.2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations
[ https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455 ] Dilip Biswal edited comment on SPARK-20356 at 4/18/17 8:46 PM: --- [~viirya] [~hvanhovell] [~cloud_fan] [~smilegator] I took a quick look and it seems the issue started happening after this [pr|https://github.com/apache/spark/pull/17175]. We are changing the output partitioning information as part of the fix ( id, item -> item, item) causing a missing shuffle in the operators above InMemoryTableScan. Changing to use the child's output partitioning like before fixes the issue. I am a little new to this code :-) And this is what i have found so far. Hope this helps. was (Author: dkbiswal): [~viirya] [~hvanhovell] [~cloud_fan] [~smilegator] I took a quick look and it seems the issue started happening after [pr|https://github.com/apache/spark/pull/17175]. We are changing the output partitioning information as part of the fix ( id, item -> item, item) causing a missing shuffle in the operators above InMemoryTableScan. Changing to use the child's output partitioning like before fixes the issue. I am a little new to this code :-) And this is what i have found so far. Hope this helps. > Spark sql group by returns incorrect results after join + distinct > transformations > -- > > Key: SPARK-20356 > URL: https://issues.apache.org/jira/browse/SPARK-20356 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Linux mint 18 > Python 3.5 >Reporter: Chris Kipers > > I'm experiencing a bug with the head version of spark as of 4/17/2017. After > joining to dataframes, renaming a column and invoking distinct, the results > of the aggregation is incorrect after caching the dataframe. The following > code snippet consistently reproduces the error. > from pyspark.sql import SparkSession > import pyspark.sql.functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > mapping_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "GROUP": 1}, > {"ITEM": "b", "GROUP": 1}, > {"ITEM": "c", "GROUP": 2} > ])) > items_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "ID": 1}, > {"ITEM": "b", "ID": 2}, > {"ITEM": "c", "ID": 3} > ])) > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')).distinct() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect > The next code snippet is almost the same after the first except I don't call > distinct on the dataframe. This snippet performs as expected: > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')) > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > I don't experience this bug with spark 2.1 or event earlier versions for 2.2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations
[ https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455 ] Dilip Biswal commented on SPARK-20356: -- [~viirya] [~hvanhovell] [~cloud_fan] [~smilegator] I took a quick look and it seems the issue started happening after [pr|https://github.com/apache/spark/pull/17175]. We are changing the output partitioning information as part of the fix ( id, item -> item, item) causing a missing shuffle in the operators above InMemoryTableScan. Changing to use the child's output partitioning like before fixes the issue. I am a little new to this code :-) And this is what i have found so far. Hope this helps. > Spark sql group by returns incorrect results after join + distinct > transformations > -- > > Key: SPARK-20356 > URL: https://issues.apache.org/jira/browse/SPARK-20356 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Linux mint 18 > Python 3.5 >Reporter: Chris Kipers > > I'm experiencing a bug with the head version of spark as of 4/17/2017. After > joining to dataframes, renaming a column and invoking distinct, the results > of the aggregation is incorrect after caching the dataframe. The following > code snippet consistently reproduces the error. > from pyspark.sql import SparkSession > import pyspark.sql.functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > mapping_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "GROUP": 1}, > {"ITEM": "b", "GROUP": 1}, > {"ITEM": "c", "GROUP": 2} > ])) > items_sdf = spark.createDataFrame(pd.DataFrame([ > {"ITEM": "a", "ID": 1}, > {"ITEM": "b", "ID": 2}, > {"ITEM": "c", "ID": 3} > ])) > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')).distinct() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect > The next code snippet is almost the same after the first except I don't call > distinct on the dataframe. This snippet performs as expected: > mapped_sdf = \ > items_sdf.join(mapping_sdf, on='ITEM').select("ID", > sf.col("GROUP").alias('ITEM')) > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > mapped_sdf.cache() > print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct > I don't experience this bug with spark 2.1 or event earlier versions for 2.2 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references
Dilip Biswal created SPARK-20334: Summary: Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references Key: SPARK-20334 URL: https://issues.apache.org/jira/browse/SPARK-20334 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Dilip Biswal Priority: Minor Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a code gen error like following : {code:java} Cannot evaluate expression: min((input[0, int, false] + input[4, int, false])) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443) at org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) {code} We should catch this situation and return a better error message to the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19993) Caching logical plans containing subquery expressions does not work.
Dilip Biswal created SPARK-19993: Summary: Caching logical plans containing subquery expressions does not work. Key: SPARK-19993 URL: https://issues.apache.org/jira/browse/SPARK-19993 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Dilip Biswal Here is a simple repro that depicts the problem. In this case the second invocation of the sql should have been from the cache. However the lookup fails currently. {code} scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)") ds: org.apache.spark.sql.DataFrame = [c1: int] scala> ds.cache res13: ds.type = [c1: int] scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true) == Analyzed Logical Plan == c1: int Project [c1#86] +- Filter c1#86 IN (list#78 [c1#86]) : +- Project [c1#87] : +- Filter (outer(c1#86) = c1#87) :+- SubqueryAlias s2 : +- Relation[c1#87] parquet +- SubqueryAlias s1 +- Relation[c1#86] parquet == Optimized Logical Plan == Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87)) :- Relation[c1#86] parquet +- Relation[c1#87] parquet {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: welcoming Takuya Ueshin as a new Apache Spark committer
Congratulations, Takuya! Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com - Original message -From: Takeshi Yamamuro To: dev Cc:Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committerDate: Mon, Feb 13, 2017 2:14 PM congrats! On Tue, Feb 14, 2017 at 6:05 AM, Sam Elaminwrote: Congrats Takuya-san! Clearly well deserved! Well done :) On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz wrote: Congratulations! On 02/13/2017 08:16 PM, Reynold Xin wrote:> Hi all,>> Takuya-san has recently been elected an Apache Spark committer. He's> been active in the SQL area and writes very small, surgical patches> that are high quality. Please join me in congratulating Takuya-san!> -To unsubscribe e-mail: dev-unsubscribe@spark.apache.org -- ---Takeshi Yamamuro - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Created] (SPARK-18533) Raise correct error upon specification of schema for datasource tables created through CTAS
Dilip Biswal created SPARK-18533: Summary: Raise correct error upon specification of schema for datasource tables created through CTAS Key: SPARK-18533 URL: https://issues.apache.org/jira/browse/SPARK-18533 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Dilip Biswal Priority: Minor Currently hive serde tables created through CTAS does not allow explicit specification of schema as its inferred from the select clause. Currently a semantic error is raised for this case. However for data source tables currently we raise a parser error which is not as informative. We should raise consistent error for both forms of tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609295#comment-15609295 ] Dilip Biswal commented on SPARK-18009: -- [~martha.solarte] Not sure [~smilegator] Sean, do we back port to 2.0.0 any more ? > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at &g
[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error
[ https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607343#comment-15607343 ] Dilip Biswal commented on SPARK-18009: -- [~smilegator][~jerryjung] [~martha.solarte] Thanks. I am testing a fix and should submit a PR for this soon. > Spark 2.0.1 SQL Thrift Error > > > Key: SPARK-18009 > URL: https://issues.apache.org/jira/browse/SPARK-18009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: apache hadoop 2.6.2 > spark 2.0.1 >Reporter: Jerryjung >Priority: Critical > Labels: thrift > > After deploy spark thrift server on YARN, then I tried to execute from the > beeline following command. > > show databases; > I've got this error message. > {quote} > beeline> !connect jdbc:hive2://localhost:1 a a > Connecting to jdbc:hive2://localhost:1 > 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1 > 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1 > 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:1 > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:1> show databases; > java.lang.IllegalStateException: Can't overwrite cause with > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at java.lang.Throwable.initCause(Throwable.java:456) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236) > at > org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197) > at > org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108) > at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) > at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) > at > org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794) > at org.apache.hive.beeline.Commands.execute(Commands.java:860) > at org.apache.hive.beeline.Commands.sql(Commands.java:713) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
[jira] [Created] (SPARK-17860) SHOW COLUMN's database conflict check should use case sensitive compare.
Dilip Biswal created SPARK-17860: Summary: SHOW COLUMN's database conflict check should use case sensitive compare. Key: SPARK-17860 URL: https://issues.apache.org/jira/browse/SPARK-17860 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Dilip Biswal Priority: Minor SHOW COLUMNS command validates the user supplied database name with database name from qualified table name name to make sure both of them are consistent. This comparison should respect case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting
[ https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-17860: - Summary: SHOW COLUMN's database conflict check should respect case sensitivity setting (was: SHOW COLUMN's database conflict check should use case sensitive compare.) > SHOW COLUMN's database conflict check should respect case sensitivity setting > - > > Key: SPARK-17860 > URL: https://issues.apache.org/jira/browse/SPARK-17860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > SHOW COLUMNS command validates the user supplied database > name with database name from qualified table name name to make > sure both of them are consistent. This comparison should respect > case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: welcoming Xiao Li as a committer
Hi Xiao, Congratulations Xiao !! This is indeed very well deserved !! Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From: Reynold Xin To: "dev@spark.apache.org" , Xiao Li Date: 10/03/2016 10:47 PM Subject:welcoming Xiao Li as a committer Hi all, Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark committer. Xiao has been a super active contributor to Spark SQL. Congrats and welcome, Xiao! - Reynold
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141 ] Dilip Biswal commented on SPARK-17709: -- @ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in the outputset and that is why we see the exception. I tried using productid instead and got the correct result. {code} scala> df1.join(df2, Seq("companyid","loyaltycardnumber")); org.apache.spark.sql.AnalysisException: using columns ['companyid,'loyaltycardnumber] can not be resolved given input columns: [productid, companyid, avgprice, avgitemcount, companyid, productid] ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651) at org.apache.spark.sql.Dataset.join(Dataset.scala:679) at org.apache.spark.sql.Dataset.join(Dataset.scala:652) ... 48 elided scala> df1.join(df2, Seq("companyid","productid")); res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2 more fields] scala> df1.join(df2, Seq("companyid","productid")).show +-+-+++ |companyid|productid|avgprice|avgitemcount| +-+-+++ | 101|3|13.0|12.0| | 100|1|10.0|10.0| +-+-+++ {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543046#comment-15543046 ] Dilip Biswal commented on SPARK-17709: -- Hi Ashish, Thanks a lot.. will try and get back. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537336#comment-15537336 ] Dilip Biswal commented on SPARK-17709: -- [~ashrowty] Hmmn.. and your join keys are companyid or loyalitycardnumber or both ? If so, i have the exact same scenario but not seeing the error you are seeing. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537205#comment-15537205 ] Dilip Biswal edited comment on SPARK-17709 at 9/30/16 10:07 PM: @ashrowty Hi Ashish, is it possible for you to post explain output for both the legs of the join. So if we are joining two dataframes df1 and df2 , can we get the output of df1.explain(true) df2.explain(true) >From the error, it seems like key1 and key2 are not present in one leg of join >output attribute set. So if i were to change your test program to the following : {code} val df1 = d1.groupBy("key1", "key2") .agg(avg("totalprice").as("avgtotalprice")) df1.explain(true) val df2 = d1.agg(avg("itemcount").as("avgqty")) df2.explain(true) df1.join(df2, Seq("key1", "key2")) {code} I am able to see the same error you are seeing. was (Author: dkbiswal): @ashrowty Hi Ashish, is it possible for you to post explain output for both the legs of the join. So if we are joining two dataframes df1 and df2 , can we get the output of df1.explain(true) df2.explain(true) >From the error, it seems like key1 and key2 are not present in one leg of join >output attribute set. So if i were to change your test program to the following : val df1 = d1.groupBy("key1", "key2") .agg(avg("totalprice").as("avgtotalprice")) df1.explain(true) val df2 = d1.agg(avg("itemcount").as("avgqty")) df2.explain(true) df1.join(df2, Seq("key1", "key2")) I am able to see the same error you are seeing. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537205#comment-15537205 ] Dilip Biswal commented on SPARK-17709: -- @ashrowty Hi Ashish, is it possible for you to post explain output for both the legs of the join. So if we are joining two dataframes df1 and df2 , can we get the output of df1.explain(true) df2.explain(true) >From the error, it seems like key1 and key2 are not present in one leg of join >output attribute set. So if i were to change your test program to the following : val df1 = d1.groupBy("key1", "key2") .agg(avg("totalprice").as("avgtotalprice")) df1.explain(true) val df2 = d1.agg(avg("itemcount").as("avgqty")) df2.explain(true) df1.join(df2, Seq("key1", "key2")) I am able to see the same error you are seeing. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536464#comment-15536464 ] Dilip Biswal commented on SPARK-17709: -- [~ashrowty] Ashish, you have the same column name as regular and partitioning columns ? I thought hive didn't allow it ? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534434#comment-15534434 ] Dilip Biswal commented on SPARK-17709: -- [~smilegator] Sure. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534417#comment-15534417 ] Dilip Biswal commented on SPARK-17709: -- [~smilegator] Hi Sean, I tried it on my master branch and don't see the exception. {code} test("join issue") { withTable("tbl") { sql("CREATE TABLE tbl(key1 int, key2 int, totalprice int, itemcount int)") sql("insert into tbl values (1, 1, 1, 1)") val d1 = sql("select * from tbl") val df1 = d1.groupBy("key1","key2") .agg(avg("totalprice").as("avgtotalprice")) val df2 = d1.groupBy("key1","key2") .agg(avg("itemcount").as("avgqty")) df1.join(df2, Seq("key1","key2")).show() } } Output +++-+--+ |key1|key2|avgtotalprice|avgqty| +++-+--+ | 1| 1| 1.0| 1.0| +++-+--+ {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde
[ https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511267#comment-15511267 ] Dilip Biswal commented on SPARK-17620: -- fix it now. Thanks! > hive.default.fileformat=orc does not set OrcSerde > - > > Key: SPARK-17620 > URL: https://issues.apache.org/jira/browse/SPARK-17620 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Brian Cho >Priority: Minor > > Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior > is inconsistent with {{STORED AS ORC}}. This means we cannot set a default > behavior for creating tables using orc. > The behavior using stored as: > {noformat} > scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println) > ... > [# Storage Information,,] > [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] > [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] > [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] > ... > {noformat} > Behavior setting default conf (SerDe Library is not set properly): > {noformat} > scala> spark.sql("SET hive.default.fileformat=orc") > res2: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("CREATE TABLE tmp_default(id INT)") > res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) > ... > [# Storage Information,,] > [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] > [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] > [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16423) Inconsistent settings on the first day of a week
[ https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369006#comment-15369006 ] Dilip Biswal commented on SPARK-16423: -- [~yhuai] [~smilegator] Just wanted to quickly share the information i have found so far. - WeekOfYear - Checked the mysql, hive behaviour. Spark seems to be consistent with the mysql and hive. They both assume the first day of week to be Monday and more than 3 days per week. http://www.techonthenet.com/mysql/functions/weekofyear.php - Postgres - week The number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. In other words, the first Thursday of a year is in week 1 of that year. https://www.postgresql.org/docs/current/static/functions-datetime.html (function : week) - SQL Server They have a way to set a registry variable to influence the first day of the week. SET DATEFIRST { number | @number_var } When this is set, the DATEPART function considers the setting while calculating day of week. When this is not set, they also seem to follow ISO which again assumes Monday to be start of the week. https://msdn.microsoft.com/en-us/library/ms186724.aspx - Oracle In case of oracle, the day of the week is controlled by session specific NLS_TERRITORY setting. https://community.oracle.com/thread/2207756?tstart=0 - DB2 Have two flavors of WEEK function. One for ISO (Monday start) and other one for non ISO (Sunday start). http://www.ibm.com/developerworks/data/library/techarticle/0211yip/0211yip3.html Given this, it seems like more systems follow Monday to be first day of week semantics and i am wondering if we should change this ? Also, is there a co-relation between fromUnixTime and WeekOfYear. fromUnixTime returns the user supplied time in seconds in string after applying the date format. In my understanding it respects the system locale settings. > Inconsistent settings on the first day of a week > > > Key: SPARK-16423 > URL: https://issues.apache.org/jira/browse/SPARK-16423 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > For the function {{WeekOfYear}}, we explicitly set the first day of the week > to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. > So, we are using the default first day of the week based on the locale > setting (see > https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-). > > Let's do a survey on what other databases do and make the setting consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API
[ https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-16195: - Description: In SQL, its allowed to specify an empty OVER clause in the window expression. {code} select area, sum(product) over () as c from windowData where product > 3 group by area, product having avg(month) > 0 order by avg(month), product {code} In this case the analytic function sum is presented based on all the rows of the result set Currently its not allowed through dataset API. was: In SQL, its allowed to specify an empty OVER clause in the window expression. select area, sum(product) over () as c from windowData where product > 3 group by area, product having avg(month) > 0 order by avg(month), product In this case the analytic function sum is presented based on all the rows of the result set Currently its not allowed through dataset API. > Allow users to specify empty over clause in window expressions through > dataset API > -- > > Key: SPARK-16195 > URL: https://issues.apache.org/jira/browse/SPARK-16195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dilip Biswal >Priority: Minor > > In SQL, its allowed to specify an empty OVER clause in the window expression. > {code} > select area, sum(product) over () as c from windowData > where product > 3 group by area, product > having avg(month) > 0 order by avg(month), product > {code} > In this case the analytic function sum is presented based on all the rows of > the result set > Currently its not allowed through dataset API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API
Dilip Biswal created SPARK-16195: Summary: Allow users to specify empty over clause in window expressions through dataset API Key: SPARK-16195 URL: https://issues.apache.org/jira/browse/SPARK-16195 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Dilip Biswal Priority: Minor In SQL, its allowed to specify an empty OVER clause in the window expression. select area, sum(product) over () as c from windowData where product > 3 group by area, product having avg(month) > 0 order by avg(month), product In this case the analytic function sum is presented based on all the rows of the result set Currently its not allowed through dataset API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar
[ https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305143#comment-15305143 ] Dilip Biswal commented on SPARK-15634: -- I would like to work on this issue. > SQL repl is bricked if a function is registered with a non-existent jar > --- > > Key: SPARK-15634 > URL: https://issues.apache.org/jira/browse/SPARK-15634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Liang > > After attempting to register a function using a non-existent jar, no further > SQL commands succeed (and you also cannot un-register the function). > {code} > build/sbt -Phive sparkShell > {code} > {code} > scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" > USING JAR "file:///path/to/example.jar"""") > 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not > exist > java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist > at > org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102) > at > org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191) > at > org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) > at > org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523) > at > org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668) > at > org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109) > at > org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734) > at > org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryEx
[jira] [Comment Edited] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304782#comment-15304782 ] Dilip Biswal edited comment on SPARK-15557 at 5/27/16 9:09 PM: --- I am looking into this issue. I am testing a fix currently. was (Author: dkbiswal): I am looking into this issue. > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null
[ https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304782#comment-15304782 ] Dilip Biswal commented on SPARK-15557: -- I am looking into this issue. > expression ((cast(99 as decimal) + '3') * '2.3' ) return null > - > > Key: SPARK-15557 > URL: https://issues.apache.org/jira/browse/SPARK-15557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 > Environment: spark1.6.1 >Reporter: cen yuhai > > expression "select (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null > expression "select (cast(40 as decimal(19,6))+ '3')*'2.3' " is OK > I find that maybe it will be null if the result is more than 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279229#comment-15279229 ] Dilip Biswal commented on SPARK-15114: -- Going to submit a PR for this tonight. > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271972#comment-15271972 ] Dilip Biswal commented on SPARK-15114: -- [~yhuai] Sure Yin. I will give it a try. > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271351#comment-15271351 ] Dilip Biswal commented on SPARK-15114: -- [~yhuai] Currently we use the sql representation of the expression as the system generated alias name. like following. case expr: Expression => Alias(expr, usePrettyExpression(expr).sql)() Do we add a additional case for AggregateExpression and use toString() instead of sql() for a shorter name ? Like, case aggExpr: AggregateExpression => Alias(aggExpr, usePrettyExpression(aggExpr).toString)() > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14947) Showtable Command - Can't List Tables Using JDBC Connector
[ https://issues.apache.org/jira/browse/SPARK-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259692#comment-15259692 ] Dilip Biswal commented on SPARK-14947: -- [~raymond.honderd...@sizmek.com] Hi Raymond, could you please share a little more details on how to reproduce this ? Do we see any error or we see an empty output ? Also, what is the SQL microstrategy issues to Spark ? > Showtable Command - Can't List Tables Using JDBC Connector > -- > > Key: SPARK-14947 > URL: https://issues.apache.org/jira/browse/SPARK-14947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Raymond Honderdors >Priority: Minor > Fix For: 2.0.0 > > > Showtable Command does not list tables using external tool like > (microstrategy) > it does work in beeline > between the master and 1.6 branch there is a difference in the command file, > 1 it was relocated 2 the content changed. > when i compiled the master branch with the "old" version of the code JDBC > functionality was restored -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate
[ https://issues.apache.org/jira/browse/SPARK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246796#comment-15246796 ] Dilip Biswal commented on SPARK-13698: -- [~cloud_fan] Hi Wenchen, Can you please help to fix the assignee field for this JIRA ? Thanks in advance !! > Fix Analysis Exceptions when Using Backticks in Generate > > > Key: SPARK-13698 > URL: https://issues.apache.org/jira/browse/SPARK-13698 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Dilip Biswal > > Analysis exception occurs while running the following query. > {code} > SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` > {code} > {code} > Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot > resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 > 'Project ['ints] > +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] >+- SubqueryAlias nestedarray > +- LocalRelation [a#0], 1,2,3 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14445) Support native execution of SHOW COLUMNS and SHOW PARTITIONS command
Dilip Biswal created SPARK-14445: Summary: Support native execution of SHOW COLUMNS and SHOW PARTITIONS command Key: SPARK-14445 URL: https://issues.apache.org/jira/browse/SPARK-14445 Project: Spark Issue Type: Improvement Reporter: Dilip Biswal 1. Support native execution of SHOW COLUMNS 2. Support native execution of SHOW PARTITIONS The syntax of SHOW commands are described in following link. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Show -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14121) Show commands (Native)
[ https://issues.apache.org/jira/browse/SPARK-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228932#comment-15228932 ] Dilip Biswal commented on SPARK-14121: -- I will submit a pull request for SHOW COLUMNS and SHOW PARTITIONS today. Thanks !! > Show commands (Native) > -- > > Key: SPARK-14121 > URL: https://issues.apache.org/jira/browse/SPARK-14121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > For the following tokens, we should have native implementations. > -TOK_SHOWDATABASES (Native)- > -TOK_SHOWTABLES (Native)- > -TOK_SHOW_TBLPROPERTIES (Native)- > TOK_SHOWCOLUMNS (Native) > TOK_SHOWPARTITIONS (Native) > TOK_SHOW_TABLESTATUS (Native) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command
[ https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-14348: - Summary: Support native execution of SHOW TBLPROPERTIES command (was: Support native execution of SHOW DATABASE command) > Support native execution of SHOW TBLPROPERTIES command > -- > > Key: SPARK-14348 > URL: https://issues.apache.org/jira/browse/SPARK-14348 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Dilip Biswal > > 1. Support parsing of SHOW TBLPROPERTIES command > 2. Support the native execution of SHOW TBLPROPERTIES command > The syntax for SHOW commands are described in following link: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14348) Support native execution of SHOW DATABASE command
Dilip Biswal created SPARK-14348: Summary: Support native execution of SHOW DATABASE command Key: SPARK-14348 URL: https://issues.apache.org/jira/browse/SPARK-14348 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dilip Biswal 1. Support parsing of SHOW TBLPROPERTIES command 2. Support the native execution of SHOW TBLPROPERTIES command The syntax for SHOW commands are described in following link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14121) Show commands (Native)
[ https://issues.apache.org/jira/browse/SPARK-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213743#comment-15213743 ] Dilip Biswal commented on SPARK-14121: -- Just submitted a PR for SHOW TABLES and SHOW DATABASES. Will look into the rest of commands. Regards, -- Dilip > Show commands (Native) > -- > > Key: SPARK-14121 > URL: https://issues.apache.org/jira/browse/SPARK-14121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > For the following tokens, we should have native implementations. > TOK_SHOWDATABASES (Native) > TOK_SHOWTABLES (Native) > TOK_SHOW_TBLPROPERTIES (Native) > TOK_SHOWCOLUMNS (Native) > TOK_SHOWPARTITIONS (Native) > TOK_SHOW_TABLESTATUS (Native) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14184) Support native execution of SHOW DATABASE command and fix SHOW TABLE to use table identifier pattern
Dilip Biswal created SPARK-14184: Summary: Support native execution of SHOW DATABASE command and fix SHOW TABLE to use table identifier pattern Key: SPARK-14184 URL: https://issues.apache.org/jira/browse/SPARK-14184 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Dilip Biswal Need to address the following two scenarios. 1. Support native execution of SHOW DATABASES 2. Currently native execution of SHOW TABLES is supported with the exception that identifier_with_wildcards is not passed to the plan. So SHOW TABLES 'pattern' fails. The syntax for SHOW commands are described in following link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199097#comment-15199097 ] Dilip Biswal commented on SPARK-13859: -- I have looked into this issue. After changing the query to use null safe equal operators in join condition, I can get the expected count of rows. In this case , the data set has lots of NULL values for columns which are part of join condition. for example : {color:green} ON (tmp1.c_last_name = tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and (tmp1.d_date = tmp2.d_date) {color} We need to use the null safe equal operator to select these rows. Below is the query output : {code} spark-sql> > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk <=> customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk <=> customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name <=> tmp2.c_last_name) and (tmp1.c_first_name <=> tmp2.c_first_name) and (tmp1.d_date <=> tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk <=> customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name <=> tmp3.c_last_name) and (tmp1.c_first_name <=> tmp3.c_first_name) and (tmp1.d_date <=> tmp3.d_date) > limit 100 > ; 107 {code} [~jfc...@us.ibm.com] Jesse, lets please try the modified query in your environment. Note : I have tried this on my 2.0 dev environment. > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-+ > | 1 | > +-+ > | 107 | > +-+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199136#comment-15199136 ] Dilip Biswal commented on SPARK-13865: -- [~smilegator] Quick update on this .. This also seems related to to null safe equal issue. I just put a comment [spark-13859|https://issues.apache.org/jira/browse/SPARK-13859] Here is the output of the query with expected count after doing similar modification. {code} spark-sql> select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk <=> customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk <=> customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 <=> tmp2.cln2) > and (tmp1.cfn1 <=> tmp2.cfn2) > and (tmp1.ddate1<=> tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk <=> date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk <=> customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp3 > on (tmp1.cln1 <=> tmp3.cln3) > and (tmp1.cfn1 <=> tmp3.cfn3) > and (tmp1.ddate1<=> tmp3.ddate3) > where > notnull2 is null and notnull3 is null; 47298 Time taken: 13.561 seconds, Fetched 1 row(s) {code} > TPCDS query 87 returns wrong results compared to TPC official result set > - > > Key: SPARK-13865 > URL: https://issues.apache.org/jira/browse/SPARK-13865 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 87 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 47555, answer set expects 47298. > Actual results: > {noformat} > [47555] > {noformat} > {noformat} > Expected: > +---+ > | 1 | > +---+ > | 47298 | > +---+ > {noformat} > Query used: > {noformat} > -- start query 87 in stream 0 using template query87.tpl and seed > QUALIFICATION > select count(*) > from > (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as > ddate1, 1 as notnull1 >from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp1 >left outer join > (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as > ddate2, 1 as notnull2 >from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk >where > d_month_seq between 1200 and 1200+11 >) tmp2 > on (tmp1.cln1 = tmp2.cln2) > and (tmp1.cfn1 = tmp2.cfn2) > and (tmp1.ddate1= tmp2.ddate2) >left outer join > (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as > ddate3, 1 as notnull3 >from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk >where
[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set
[ https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200465#comment-15200465 ] Dilip Biswal commented on SPARK-13859: -- Hello, Just checked the original spec for this query from tpcds website. Here is the template for Q38. {code} [_LIMITA] select [_LIMITB] count(*) from ( select distinct c_last_name, c_first_name, d_date from store_sales, date_dim, customer where store_sales.ss_sold_date_sk = date_dim.d_date_sk and store_sales.ss_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 intersect select distinct c_last_name, c_first_name, d_date from catalog_sales, date_dim, customer where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 intersect select distinct c_last_name, c_first_name, d_date from web_sales, date_dim, customer where web_sales.ws_sold_date_sk = date_dim.d_date_sk and web_sales.ws_bill_customer_sk = customer.c_customer_sk and d_month_seq between [DMS] and [DMS] + 11 ) hot_cust [_LIMITC]; {code} In this case the query in spec uses intersect operator where the implicitly generated join conditions use null safe comparison. In other-words, if we ran the query as is from spec then it would have worked. However the query in this JIRA has user supplied join conditions and uses "=". In my knowledge in SQL, the semantics of equal operator is well defined. So i don't think its a spark SQL issue. [~rxin] [~marmbrus] Please let us know your thoughts.. > TPCDS query 38 returns wrong results compared to TPC official result set > - > > Key: SPARK-13859 > URL: https://issues.apache.org/jira/browse/SPARK-13859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: tpcds-result-mismatch > > Testing Spark SQL using TPC queries. Query 38 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL returns count of 0, answer set reports 107. > Actual results: > {noformat} > [0] > {noformat} > Expected: > {noformat} > +-+ > | 1 | > +-+ > | 107 | > +-+ > {noformat} > query used: > {noformat} > -- start query 38 in stream 0 using template query38.tpl and seed > QUALIFICATION > select count(*) from ( > select distinct c_last_name, c_first_name, d_date > from store_sales > JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk > JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp1 > JOIN > (select distinct c_last_name, c_first_name, d_date > from catalog_sales > JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk > JOIN customer ON catalog_sales.cs_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = > tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and > (tmp1.d_date = tmp2.d_date) > JOIN > ( > select distinct c_last_name, c_first_name, d_date > from web_sales > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN customer ON web_sales.ws_bill_customer_sk = > customer.c_customer_sk > where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = > tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and > (tmp1.d_date = tmp3.d_date) > limit 100 > ; > -- end query 38 in stream 0 using template query38.tpl > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201828#comment-15201828 ] Dilip Biswal commented on SPARK-13821: -- [~roycecil] Thanks Roy !! > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464 ] Dilip Biswal edited comment on SPARK-13821 at 3/16/16 4:41 AM: --- [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems that there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? was (Author: dkbiswal): [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems the there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile
[ https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464 ] Dilip Biswal commented on SPARK-13821: -- [~roycecil] Just tried the original query no. 20 against spark 2.0 posted at https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 . I could see the same error that is reported in the JIRA. It seems the there is an extra comma in the projection list between two columns like following. {code} select i_item_id, ,i_item_desc {code} Please note that we ran against 2.0 and not 1.6. Can you please re-run to make sure ? > TPC-DS Query 20 fails to compile > > > Key: SPARK-13821 > URL: https://issues.apache.org/jira/browse/SPARK-13821 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 > Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo) > Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 > 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Roy Cecil > > TPC-DS Query 20 Fails to compile with the follwing Error Message > {noformat} > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( > tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( > expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA > identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) > );]) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835) > at org.antlr.runtime.DFA.predict(DFA.java:80) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401) > at > org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate
Dilip Biswal created SPARK-13698: Summary: Fix Analysis Exceptions when Using Backticks in Generate Key: SPARK-13698 URL: https://issues.apache.org/jira/browse/SPARK-13698 Project: Spark Issue Type: Bug Components: SQL Reporter: Dilip Biswal Analysis exception occurs while running the following query. {code} SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` {code} {code} Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 'Project ['ints] +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] +- SubqueryAlias nestedarray +- LocalRelation [a#0], 1,2,3 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13651) Generator outputs are not resolved correctly resulting in runtime error
Dilip Biswal created SPARK-13651: Summary: Generator outputs are not resolved correctly resulting in runtime error Key: SPARK-13651 URL: https://issues.apache.org/jira/browse/SPARK-13651 Project: Spark Issue Type: Bug Components: SQL Reporter: Dilip Biswal Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src") sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 'key2', 200)) t1 AS key, value") Running above repro results in : java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98) at org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) at org.apache.spark.scheduler.Task.run(Task.scala:82) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13427) Support USING clause in JOIN
Dilip Biswal created SPARK-13427: Summary: Support USING clause in JOIN Key: SPARK-13427 URL: https://issues.apache.org/jira/browse/SPARK-13427 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dilip Biswal Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Welcoming two new committers
Congratulations Wenchen and Herman !! Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From: Xiao Li To: Corey Nolet Cc: Ted Yu , Matei Zaharia , dev Date: 02/08/2016 09:39 AM Subject:Re: Welcoming two new committers Congratulations! Herman and Wenchen! I am just so happy for you! You absolutely deserve it! 2016-02-08 9:35 GMT-08:00 Corey Nolet : Congrats guys! On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu wrote: Congratulations, Herman and Wenchen. On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia wrote: Hi all, The PMC has recently added two new Spark committers -- Herman van Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten, adding new features, optimizations and APIs. Please join me in welcoming Herman and Wenchen. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128902#comment-15128902 ] Dilip Biswal edited comment on SPARK-12988 at 2/2/16 7:56 PM: -- The subtle difference between column path and column name may not be very obvious to a common user of this API. val df = Seq((1, 1)).toDF("a_b", "a.b") df.select("`a.b`") df.drop("`a.b`") => the fact that one can not use back tick here , would it be that obvious to the user ? I believe that was the motivation to allow it but then i am not sure of its implications. was (Author: dkbiswal): The shuttle difference between column path and column name may not be very obvious to a common user of this API. val df = Seq((1, 1)).toDF("a_b", "a.b") df.select("`a.b`") df.drop("`a.b`") => the fact that one can not use back tick here , would it be that obvious to the user ? I believe that was the motivation to allow it but then i am not sure of its implications. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128902#comment-15128902 ] Dilip Biswal commented on SPARK-12988: -- The shuttle difference between column path and column name may not be very obvious to a common user of this API. val df = Seq((1, 1)).toDF("a_b", "a.b") df.select("`a.b`") df.drop("`a.b`") => the fact that one can not use back tick here , would it be that obvious to the user ? I believe that was the motivation to allow it but then i am not sure of its implications. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118775#comment-15118775 ] Dilip Biswal commented on SPARK-12988: -- [~marmbrus][~rxin] Thanks for your input. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118319#comment-15118319 ] Dilip Biswal edited comment on SPARK-12988 at 1/27/16 12:02 AM: [~marmbrus] Hi Michael, need your input on the semantics. Say we have a dataframe defined like following : val df = Seq((1, 1,1)).toDF("a_b", "a.c", "`a.c`") df.drop("a.c") => Should we remove the 2nd column here ? df.drop("`a.c`") => Should we remove the 3rd column here ? Regards, -- Dilip was (Author: dkbiswal): [~marmbrus] Hi Michael, need your input on the semantics. Say we have a dataframe defined like following : val df = Seq((1, 1,1,1,1,1)).toDF("a_b", "a.c", "`a.c`") df.drop("a.c") => Should we remove the 2nd column here ? df.drop("`a.c`") => Should we remove the 3rd column here ? Regards, -- Dilip > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118319#comment-15118319 ] Dilip Biswal commented on SPARK-12988: -- [~marmbrus] Hi Michael, need your input on the semantics. Say we have a dataframe defined like following : val df = Seq((1, 1,1,1,1,1)).toDF("a_b", "a.c", "`a.c`") df.drop("a.c") => Should we remove the 2nd column here ? df.drop("`a.c`") => Should we remove the 3rd column here ? Regards, -- Dilip > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117509#comment-15117509 ] Dilip Biswal commented on SPARK-12988: -- I would like to work on this one. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074508#comment-15074508 ] Dilip Biswal commented on SPARK-12558: -- I would like to work on this one. > AnalysisException when multiple functions applied in GROUP BY clause > > > Key: SPARK-12558 > URL: https://issues.apache.org/jira/browse/SPARK-12558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following issue when trying to use functions in group by clause. > Example: > {code} > sqlCtx = HiveContext(sc) > rdd = sc.parallelize([{'test_date': 1451400761}]) > df = sqlCtx.createDataFrame(rdd) > df.registerTempTable("df") > {code} > Now, where I'm using single function it's OK. > {code} > sqlCtx.sql("select cast(test_date as timestamp) from df group by > cast(test_date as timestamp)").collect() > [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))] > {code} > Where I'm using more than one function I'm getting AnalysisException > {code} > sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by > date(cast(test_date as timestamp))").collect() > Py4JJavaError: An error occurred while calling o38.sql. > : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067278#comment-15067278 ] Dilip Biswal commented on SPARK-12458: -- I would like to work on this one. > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12398) Smart truncation of DataFrame / Dataset toString
[ https://issues.apache.org/jira/browse/SPARK-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062527#comment-15062527 ] Dilip Biswal commented on SPARK-12398: -- [~rxin] Hi Reynold, are you working on this ? If not, i would like to make a try to fix this. > Smart truncation of DataFrame / Dataset toString > > > Key: SPARK-12398 > URL: https://issues.apache.org/jira/browse/SPARK-12398 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > Labels: starter > > When a DataFrame or Dataset has a long schema, we should intelligently > truncate to avoid flooding the screen with unreadable information. > {code} > // Standard output > [a: int, b: int] > // Truncate many top level fields > [a: int, b, string ... 10 more fields] > // Truncate long inner structs > [a: struct] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12359) Add showString() to DataSet API.
Dilip Biswal created SPARK-12359: Summary: Add showString() to DataSet API. Key: SPARK-12359 URL: https://issues.apache.org/jira/browse/SPARK-12359 Project: Spark Issue Type: Bug Components: SQL Reporter: Dilip Biswal Priority: Minor JIRA 12105 exposed showString and its variants as public API. This adds the two APIs into DataSet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail
[ https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050319#comment-15050319 ] Dilip Biswal commented on SPARK-12257: -- Was able to reproduce this issue. Looking into it. > Non partitioned insert into a partitioned Hive table doesn't fail > - > > Key: SPARK-12257 > URL: https://issues.apache.org/jira/browse/SPARK-12257 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Mark Grover >Priority: Minor > > I am using Spark 1.5.1 but I anticipate this to be a problem with master as > well (will check later). > I have a dataframe, and a partitioned Hive table that I want to insert the > contents of the data frame into. > Let's say mytable is a non-partitioned Hive table and mytable_partitioned is > a partitioned Hive table. In Hive, if you try to insert from the > non-partitioned mytable table into mytable_partitioned without specifying the > partition, the query fails, as expected: > {quote} > INSERT INTO mytable_partitioned SELECT * FROM mytable; > {quote} > Error: Error while compiling statement: FAILED: SemanticException 1:12 Need > to specify partition columns because the destination table is partitioned. > Error encountered near token 'mytable_partitioned' (state=42000,code=4) > {quote} > However, if I do the same in Spark SQL: > {code} > val myDfTempTable = myDf.registerTempTable("my_df_temp_table") > sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM > my_df_temp_table") > {code} > This appears to succeed but does no insertion. This should fail with an error > stating the data is being inserted into a partitioned table without > specifying the name of the partition. > Of course, the name of the partition is explicitly specified, both Hive and > Spark SQL do the right thing and function correctly. > In hive: > {code} > INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable; > {code} > In Spark SQL: > {code} > val myDfTempTable = myDf.registerTempTable("my_df_temp_table") > sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * > FROM my_df_temp_table") > {code} > And, here are the definitions of my tables, as reference: > {code} > CREATE TABLE mytable(x INT); > CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT); > {code} > You will also need to insert some dummy data into mytable to ensure that the > insertion is actually not working: > {code} > #!/bin/bash > rm -rf data.txt; > for i in {0..9}; do > echo $i >> data.txt > done > sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11949) Query on DataFrame from cube gives wrong results
[ https://issues.apache.org/jira/browse/SPARK-11949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030718#comment-15030718 ] Dilip Biswal commented on SPARK-11949: -- I would like to work on this issue. > Query on DataFrame from cube gives wrong results > > > Key: SPARK-11949 > URL: https://issues.apache.org/jira/browse/SPARK-11949 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Veli Kerim Celik > Labels: dataframe, sql > > {code:title=Reproduce bug|borderStyle=solid} > case class fact(date: Int, hour: Int, minute: Int, room_name: String, temp: > Double) > val df0 = sc.parallelize(Seq > ( > fact(20151123, 18, 35, "room1", 18.6), > fact(20151123, 18, 35, "room2", 22.4), > fact(20151123, 18, 36, "room1", 17.4), > fact(20151123, 18, 36, "room2", 25.6) > )).toDF() > val cube0 = df0.cube("date", "hour", "minute", "room_name").agg(Map > ( > "temp" -> "avg" > )) > cube0.where("date IS NULL").show() > {code} > The query result is empty. It should not be, because cube0 contains the value > null several times in column 'date'. The issue arises because the cube > function reuses the schema information from df0. If I change the type of > parameters in the case class to Option[T] the query gives correct results. > Solution: The cube function should change the schema by changing the nullable > property to true, for the columns (dimensions) specified in the method call > parameters. > I am new at Scala and Spark. I don't know how to implement this. Somebody > please do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column
[ https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334 ] Dilip Biswal edited comment on SPARK-11997 at 11/26/15 8:32 AM: I would like to work on this issue. Currently testing the patch. was (Author: dkbiswal): I would like to work on this issue. > NPE when save a DataFrame as parquet and partitioned by long column > --- > > Key: SPARK-11997 > URL: https://issues.apache.org/jira/browse/SPARK-11997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Blocker > > {code} > >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - > >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3") > 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.Range.foreach(Range.scala:141) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615) > at > org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131) > at > org.ap
[jira] [Commented] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column
[ https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334 ] Dilip Biswal commented on SPARK-11997: -- I would like to work on this issue. > NPE when save a DataFrame as parquet and partitioned by long column > --- > > Key: SPARK-11997 > URL: https://issues.apache.org/jira/browse/SPARK-11997 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Priority: Blocker > > {code} > >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - > >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3") > 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.Range.foreach(Range.scala:141) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616) > at > org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615) > at > org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > at > org.apache.spark.sql.execution.QueryExecution.
[jira] [Created] (SPARK-11863) Unable to resolve order by if it contains mixture of aliases and real columns.
Dilip Biswal created SPARK-11863: Summary: Unable to resolve order by if it contains mixture of aliases and real columns. Key: SPARK-11863 URL: https://issues.apache.org/jira/browse/SPARK-11863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Dilip Biswal Analyzer is unable to resolve order by if the columns in the order by clause contains a mixture of alias and real column names. Example : var var3 = sqlContext.sql("select c1 as a, c2 as b from inttab group by c1, c2 order by b, c1") This used to work in 1.4 and is failing starting 1.5 and is affecting some tpcds queries (19, 55,71) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11584) The attribute of temporay table shows false
[ https://issues.apache.org/jira/browse/SPARK-11584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997265#comment-14997265 ] Dilip Biswal commented on SPARK-11584: -- I would like to work on this issue. > The attribute of temporay table shows false > > > Key: SPARK-11584 > URL: https://issues.apache.org/jira/browse/SPARK-11584 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jay >Priority: Minor > > After using command " create temporary table tableName" to create a table, > then I think the attribute(istemporary) of that table should be true, but > actually "show tables" indicates that the table's attribute is false. So, I > am confused and hope somebody can solve this problem. > Note: The command is just "create temporary table tableName" without "using". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11577) Handle code review comments for SPARK-11188
[ https://issues.apache.org/jira/browse/SPARK-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-11577: - Summary: Handle code review comments for SPARK-11188 (was: Suppress stacktraces in bin/spark-sql for AnalysisExceptions) > Handle code review comments for SPARK-11188 > --- > > Key: SPARK-11577 > URL: https://issues.apache.org/jira/browse/SPARK-11577 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Reporter: Dilip Biswal >Priority: Minor > Fix For: 1.5.2 > > > The fix for 11188 handled suppressed printing stack traces to console when > encountering AnalysisException. This JIRA completes 11188 by handling code > review comments from Michael. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11577) Suppress stacktraces in bin/spark-sql for AnalysisExceptions
Dilip Biswal created SPARK-11577: Summary: Suppress stacktraces in bin/spark-sql for AnalysisExceptions Key: SPARK-11577 URL: https://issues.apache.org/jira/browse/SPARK-11577 Project: Spark Issue Type: Bug Affects Versions: 1.5.2 Reporter: Dilip Biswal Priority: Minor Fix For: 1.5.2 The fix for 11188 handled suppressed printing stack traces to console when encountering AnalysisException. This JIRA completes 11188 by handling code review comments from Michael. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11544) sqlContext doesn't use PathFilter
[ https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995476#comment-14995476 ] Dilip Biswal edited comment on SPARK-11544 at 11/8/15 4:52 AM: --- I would like to work on this issue. was (Author: dkbiswal): I am looking into this issue. > sqlContext doesn't use PathFilter > - > > Key: SPARK-11544 > URL: https://issues.apache.org/jira/browse/SPARK-11544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: AWS EMR 4.1.0, Spark 1.5.0 >Reporter: Frank Dai > > When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the > underlying SparkContext > {code:java} > val sc = new SparkContext(conf) > sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > classOf[TmpFileFilter], classOf[PathFilter]) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > {code} > The definition of {{TmpFileFilter}} is: > {code:title=TmpFileFilter.scala|borderStyle=solid} > import org.apache.hadoop.fs.{Path, PathFilter} > class TmpFileFilter extends PathFilter { > override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp") > } > {code} > When use {{sqlContext}} to read JSON files, e.g., > {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an > exception: > {quote} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp > {quote} > It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which > causes the above exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11544) sqlContext doesn't use PathFilter
[ https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995476#comment-14995476 ] Dilip Biswal commented on SPARK-11544: -- I am looking into this issue. > sqlContext doesn't use PathFilter > - > > Key: SPARK-11544 > URL: https://issues.apache.org/jira/browse/SPARK-11544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: AWS EMR 4.1.0, Spark 1.5.0 >Reporter: Frank Dai > > When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the > underlying SparkContext > {code:java} > val sc = new SparkContext(conf) > sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > classOf[TmpFileFilter], classOf[PathFilter]) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > {code} > The definition of {{TmpFileFilter}} is: > {code:title=TmpFileFilter.scala|borderStyle=solid} > import org.apache.hadoop.fs.{Path, PathFilter} > class TmpFileFilter extends PathFilter { > override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp") > } > {code} > When use {{sqlContext}} to read JSON files, e.g., > {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an > exception: > {quote} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp > {quote} > It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which > causes the above exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Master build fails ?
Hello Ted, Thanks for your response. Here is the command i used : build/sbt clean build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0 -DskipTests assembly I am building on CentOS and on master branch. One other thing, i was able to build fine with the above command up until recently. I think i have stared to have problem after SPARK-11073 where the HashCodes import was added. Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From: Ted Yu To: Dilip Biswal/Oakland/IBM@IBMUS Cc: Jean-Baptiste Onofré , "dev@spark.apache.org" Date: 11/05/2015 10:46 AM Subject:Re: Master build fails ? Dilip: Can you give the command you used ? Which release were you building ? What OS did you build on ? Cheers On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal wrote: Hello, I am getting the same build error about not being able to find com.google.common.hash.HashCodes. Is there a solution to this ? Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From:Jean-Baptiste Onofré To:Ted Yu Cc:"dev@spark.apache.org" Date:11/03/2015 07:20 AM Subject:Re: Master build fails ? Hi Ted, thanks for the update. The build with sbt is in progress on my box. Regards JB On 11/03/2015 03:31 PM, Ted Yu wrote: > Interesting, Sbt builds were not all failing: > > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/ > > FYI > > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré <mailto:j...@nanthrax.net>> wrote: > > Hi Jacek, > > it works fine with mvn: the problem is with sbt. > > I suspect a different reactor order in sbt compare to mvn. > > Regards > JB > > On 11/03/2015 02:44 PM, Jacek Laskowski wrote: > > Hi, > > Just built the sources using the following command and it worked > fine. > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver > -DskipTests clean install > ... > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 14:15 min > [INFO] Finished at: 2015-11-03T14:40:40+01:00 > [INFO] Final Memory: 438M/1972M > [INFO] > > > ➜ spark git:(master) ✗ java -version > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) > > I'm on Mac OS. > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | http://blog.japila.pl| > http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré > mailto:j...@nanthrax.net>> wrote: > > Thanks for the update, I used mvn to build but without hive > profile. > > Let me try with mvn with the same options as you and sbt also. > > I keep you posted. > > Regards > JB > > On 11/03/2015 12:55 PM, Jeff Zhang wrote: > > > I found it is due to SPARK-11073. > > Here's the command I used to build > > build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive > -Phive-thriftserver > -Psparkr > > On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré > mailto:j...@nanthrax.net> > <mailto:j...@nanthrax.net<mailto:j...@nanthrax.net>>> wrote: > > Hi Jeff, > > it works for me (with skipping the tests). > > Let me try again, just to be sure. > > Regards > JB > > > On 11/03/2015 11:50 AM, Jeff Zhang wrote: > > Looks like it's due to guava version > conflicts, I see both guava > 14.0.1 > and 16.0.1 under lib_managed/bundles. Anyone > meet this issue too ? > > [error] > > /Users/jzhang/github/spark_apache/cor
Re: Master build fails ?
Hello, I am getting the same build error about not being able to find com.google.common.hash.HashCodes. Is there a solution to this ? Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From: Jean-Baptiste Onofré To: Ted Yu Cc: "dev@spark.apache.org" Date: 11/03/2015 07:20 AM Subject:Re: Master build fails ? Hi Ted, thanks for the update. The build with sbt is in progress on my box. Regards JB On 11/03/2015 03:31 PM, Ted Yu wrote: > Interesting, Sbt builds were not all failing: > > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/ > > FYI > > On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré <mailto:j...@nanthrax.net>> wrote: > > Hi Jacek, > > it works fine with mvn: the problem is with sbt. > > I suspect a different reactor order in sbt compare to mvn. > > Regards > JB > > On 11/03/2015 02:44 PM, Jacek Laskowski wrote: > > Hi, > > Just built the sources using the following command and it worked > fine. > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver > -DskipTests clean install > ... > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 14:15 min > [INFO] Finished at: 2015-11-03T14:40:40+01:00 > [INFO] Final Memory: 438M/1972M > [INFO] > > > ➜ spark git:(master) ✗ java -version > java version "1.8.0_66" > Java(TM) SE Runtime Environment (build 1.8.0_66-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) > > I'm on Mac OS. > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | http://blog.japila.pl | > http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré > mailto:j...@nanthrax.net>> wrote: > > Thanks for the update, I used mvn to build but without hive > profile. > > Let me try with mvn with the same options as you and sbt also. > > I keep you posted. > > Regards > JB > > On 11/03/2015 12:55 PM, Jeff Zhang wrote: > > > I found it is due to SPARK-11073. > > Here's the command I used to build > > build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive > -Phive-thriftserver > -Psparkr > > On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré > mailto:j...@nanthrax.net> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote: > > Hi Jeff, > > it works for me (with skipping the tests). > > Let me try again, just to be sure. > > Regards > JB > > > On 11/03/2015 11:50 AM, Jeff Zhang wrote: > > Looks like it's due to guava version > conflicts, I see both guava > 14.0.1 > and 16.0.1 under lib_managed/bundles. Anyone > meet this issue too ? > > [error] > > /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26: > object HashCodes is not a member of package > com.google.common.hash > [error] import com.google.common.hash.HashCodes > [error]^ > [info] Resolving > org.apache.commons#commons-math;2.2 ... > [error] > > /Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384: > not found: value HashCodes > [error] val cookie = > HashCodes.fromBytes(secret).toString() > [error] ^ > > > > > -- > Best Regards > >
Re: SPARK SQL Error
Hi Giri, You are perhaps missing the "--files" option before the supplied hdfs file name ? spark-submit --master yarn --class org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar --files hdfs://quickstart.cloudera:8020/people_csv Please refer to Ritchard's comments on why the --files option may be redundant in your case. Regards, Dilip Biswal Tel: 408-463-4980 dbis...@us.ibm.com From: Giri To: user@spark.apache.org Date: 10/15/2015 02:44 AM Subject:Re: SPARK SQL Error Hi Ritchard, Thank you so much again for your input.This time I ran the command in the below way spark-submit --master yarn --class org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar hdfs://quickstart.cloudera:8020/people_csv But I am facing the new error "Could not parse Master URL: 'hdfs://quickstart.cloudera:8020/people_csv'" file path is correct hadoop fs -ls hdfs://quickstart.cloudera:8020/people_csv -rw-r--r-- 1 cloudera supergroup 29 2015-10-10 00:02 hdfs://quickstart.cloudera:8020/people_csv Can you help me to fix this new error 15/10/15 02:24:39 INFO spark.SparkContext: Added JAR file:/home/cloudera/Desktop/TestMain.jar at http://10.0.2.15:40084/jars/TestMain.jar with timestamp 1444901079484 Exception in thread "main" org.apache.spark.SparkException: Could not parse Master URL: 'hdfs://quickstart.cloudera:8020/people_csv' at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244) at org.apache.spark.SparkContext.(SparkContext.scala:361) at org.apache.spark.SparkContext.(SparkContext.scala:154) at org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10) at org.spark.apache.CsvDataSource.main(CsvDataSource.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks & Regards, Giri. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25075.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958109#comment-14958109 ] Dilip Biswal commented on SPARK-10943: -- Hi Jason, >From the parquet format page , here are the data types thats supported in >parquet. BOOLEAN: 1 bit boolean INT32: 32 bit signed ints INT64: 64 bit signed ints INT96: 96 bit signed ints FLOAT: IEEE 32-bit floating point values DOUBLE: IEEE 64-bit floating point values BYTE_ARRAY: arbitrarily long byte arrays. In your test case , you are trying to write an un-typed null value and there is no mapping between this type (NullType) to the builtin types supported by parquet. [~marmbrus] is this a valid scenario ? Regards, -- Dilip > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > //FAIL - Try writing a NullType column (where all the values are NULL) > data02.write.parquet("/tmp/celtra-test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.Stru
[jira] [Created] (SPARK-11024) Optimize NULL in by folding it to Literal(null)
Dilip Biswal created SPARK-11024: Summary: Optimize NULL in by folding it to Literal(null) Key: SPARK-11024 URL: https://issues.apache.org/jira/browse/SPARK-11024 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Dilip Biswal Priority: Minor Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to Literal(null). This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11024) Optimize NULL in by folding it to Literal(null)
[ https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950124#comment-14950124 ] Dilip Biswal commented on SPARK-11024: -- I am currently working on a PR for this issue. > Optimize NULL in by folding it to Literal(null) > > > Key: SPARK-11024 > URL: https://issues.apache.org/jira/browse/SPARK-11024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dilip Biswal >Priority: Minor > > Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to > Literal(null). > This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement
[ https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944462#comment-14944462 ] Dilip Biswal commented on SPARK-10534: -- I would like to work on this. > ORDER BY clause allows only columns that are present in SELECT statement > > > Key: SPARK-10534 > URL: https://issues.apache.org/jira/browse/SPARK-10534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michal Cwienczek > > When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) > Spark 1.5 throws exception: > {code} > cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given > input columns EmployeeID; line 2 pos 14 StackTrace: > org.apache.spark.sql.AnalysisException: cannot resolve > 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns > EmployeeID; line 2 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
[jira] [Commented] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934731#comment-14934731 ] Dilip Biswal commented on SPARK-8654: - I would like to work on this issue.. > Analysis exception when using "NULL IN (...)": invalid cast > --- > > Key: SPARK-8654 > URL: https://issues.apache.org/jira/browse/SPARK-8654 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Santiago M. Mola >Priority: Minor > > The following query throws an analysis exception: > {code} > SELECT * FROM t WHERE NULL NOT IN (1, 2, 3); > {code} > The exception is: > {code} > org.apache.spark.sql.AnalysisException: invalid cast from int to null; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) > {code} > Here is a test that can be added to AnalysisSuite to check the issue: > {code} > test("SPARK- regression test") { > val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), > "a")() :: Nil, > LocalRelation() > ) > caseInsensitiveAnalyze(plan) > } > {code} > Note that this kind of query is a corner case, but it is still valid SQL. An > expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL > as a result, even if the list contains NULL. So it is safe to translate these > expressions to Literal(null) during analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org