[jira] [Created] (SPARK-14953) LocalBackend should revive offers periodically
Cheng Lian created SPARK-14953: -- Summary: LocalBackend should revive offers periodically Key: SPARK-14953 URL: https://issues.apache.org/jira/browse/SPARK-14953 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Cheng Lian {{LocalBackend}} only revives offers when tasks are submitted, succeed, or fail. This may lead to deadlock due to delayed scheduling. A case study is provided in [this PR comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425]. Basically, a job may have a task is delayed to be scheduled due to locality mismatch. The default delay timeout is 3s. If all other tasks finish during this period, {{LocalBackend}} won't revive any offer after the timeout since no tasks are submitted, succeed or fail then. Thus, the delayed task will never be scheduled again and the job never completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14918) ExternalCatalog.TablePartitionSpec doesn't preserve partition column order
[ https://issues.apache.org/jira/browse/SPARK-14918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14918. Resolution: Not A Problem > ExternalCatalog.TablePartitionSpec doesn't preserve partition column order > -- > > Key: SPARK-14918 > URL: https://issues.apache.org/jira/browse/SPARK-14918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > Equivalent entity of {{ExternalCatalog.TablePartitionSpec}} in Hive is a > {{LinkedHashMap}} returned by {{Partition.getSpec()}}, which preserves > partition column order. > However, we are using a {{scala.immutable.Map}} to store the result, which no > longer preserves the original order. What makes it worse, Scala specializes > immutable maps with less than 5 elements. And these specialized versions do > preserve order, thus hides this issue in test cases since we never use more > than 4 partition columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14918) ExternalCatalog.TablePartitionSpec doesn't preserve partition column order
[ https://issues.apache.org/jira/browse/SPARK-14918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259906#comment-15259906 ] Cheng Lian commented on SPARK-14918: Decided to leave this one as "not an issue" since the problem I hit in PR #1 can be fixed inside the PR itself. The fact that Scala immutable map doesn't preserve order doesn't bring any negative effect anywhere else. > ExternalCatalog.TablePartitionSpec doesn't preserve partition column order > -- > > Key: SPARK-14918 > URL: https://issues.apache.org/jira/browse/SPARK-14918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Equivalent entity of {{ExternalCatalog.TablePartitionSpec}} in Hive is a > {{LinkedHashMap}} returned by {{Partition.getSpec()}}, which preserves > partition column order. > However, we are using a {{scala.immutable.Map}} to store the result, which no > longer preserves the original order. What makes it worse, Scala specializes > immutable maps with less than 5 elements. And these specialized versions do > preserve order, thus hides this issue in test cases since we never use more > than 4 partition columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14445) Show columns/partitions
[ https://issues.apache.org/jira/browse/SPARK-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14445. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 1 [https://github.com/apache/spark/pull/1] > Show columns/partitions > --- > > Key: SPARK-14445 > URL: https://issues.apache.org/jira/browse/SPARK-14445 > Project: Spark > Issue Type: Sub-task >Reporter: Dilip Biswal >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > 1. Support native execution of SHOW COLUMNS > 2. Support native execution of SHOW PARTITIONS > The syntax of SHOW commands are described in following link. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Show -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13983) HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)
[ https://issues.apache.org/jira/browse/SPARK-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258139#comment-15258139 ] Cheng Lian edited comment on SPARK-13983 at 4/26/16 4:44 PM: - Here's my (incomplete) finding: Configurations set using {{-hiveconf}} and {{-hivevar}} are set to the current {{SessionState}} after [calling SessionManager.openSession here|https://github.com/apache/spark/blob/branch-1.6/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L68-L70]. In 1.5, these configurations are populated implicitly since {{SessionState}} is thread-local. In 1.6, we create a new {{HiveContext}} using {{HiveContext.newSession}} under multi-session mode, which then [creates a new execution Hive client|https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L119]. My theory is that, {{ClientWrapper.newSession}} ignores the current {{SessionState}} and simply creates a new one, thus configurations set via CLI flags are dropped. I haven't completely verified the last point though. was (Author: lian cheng): Here's my (incomplete) finding: Configurations set using {{--hiveconf}} and {{--hivevar}} are set to the current {{SessionState}} after [calling SessionManager.openSession here|https://github.com/apache/spark/blob/branch-1.6/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L68-L70]. In 1.5, these configurations are populated implicitly since {{SessionState}} is thread-local. In 1.6, we create a new {{HiveContext}} using {{HiveContext.newSession}} under multi-session mode, which then [creates a new execution Hive client|https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L119]. My theory is that, {{ClientWrapper.newSession}} ignores the current {{SessionState}} and simply creates a new one, thus configurations set via CLI flags are dropped. I haven't completely verified the last point though. > HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since > 1.6 version (both multi-session and single session) > -- > > Key: SPARK-13983 > URL: https://issues.apache.org/jira/browse/SPARK-13983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone > (tried spark branch-1.6 snapshot as well) > compiled with scala 2.10.5 and hadoop 2.6 > (-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver) >Reporter: Teng Qiu >Assignee: Cheng Lian > > HiveThriftServer2 should be able to get "\--hiveconf" or ''\-\-hivevar" > variables from JDBC client, either from command line parameter of beeline, > such as > {{beeline --hiveconf spark.sql.shuffle.partitions=3 --hivevar > db_name=default}} > or from JDBC connection string, like > {{jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default}} > this worked in spark version 1.5.x, but after upgraded to 1.6, it doesn't > work. > to reproduce this issue, try to connect to HiveThriftServer2 with beeline: > {code} > bin/beeline -u jdbc:hive2://localhost:1 \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > {code} > or > {code} > bin/beeline -u > jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default > {code} > will get following results: > {code} > 0: jdbc:hive2://localhost:1> set spark.sql.shuffle.partitions; > +---++--+ > | key | value | > +---++--+ > | spark.sql.shuffle.partitions | 200| > +---++--+ > 1 row selected (0.192 seconds) > 0: jdbc:hive2://localhost:1> use ${db_name}; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '$' '{' 'db_name' in switch database statement; line 1 pos 4 (state=,code=0) > {code} > - > but this bug does not affect current versions of spark-sql CLI, following > commands works: > {code} > bin/spark-sql --master local[2] \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > spark-sql> set spark.sql.shuffle.partitions > spark.sql.shuffle.partitions 3 > Time taken: 1.037 seconds, Fetched 1 row(s) > spark-sql> use ${db_name};
[jira] [Commented] (SPARK-14918) ExternalCatalog.TablePartitionSpec doesn't preserve partition column order
[ https://issues.apache.org/jira/browse/SPARK-14918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258191#comment-15258191 ] Cheng Lian commented on SPARK-14918: Hit this issue while reviewing https://github.com/apache/spark/pull/1 However, checked all usages of {{TablePartitionSpec}} throughout current master branch, none of them assumes that {{TablePartitionSpec}} preserves the order. I tend to fix the problematic PR by stop relying on the wrong assumption. > ExternalCatalog.TablePartitionSpec doesn't preserve partition column order > -- > > Key: SPARK-14918 > URL: https://issues.apache.org/jira/browse/SPARK-14918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > Equivalent entity of {{ExternalCatalog.TablePartitionSpec}} in Hive is a > {{LinkedHashMap}} returned by {{Partition.getSpec()}}, which preserves > partition column order. > However, we are using a {{scala.immutable.Map}} to store the result, which no > longer preserves the original order. What makes it worse, Scala specializes > immutable maps with less than 5 elements. And these specialized versions do > preserve order, thus hides this issue in test cases since we never use more > than 4 partition columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14918) ExternalCatalog.TablePartitionSpec doesn't preserve partition column order
Cheng Lian created SPARK-14918: -- Summary: ExternalCatalog.TablePartitionSpec doesn't preserve partition column order Key: SPARK-14918 URL: https://issues.apache.org/jira/browse/SPARK-14918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Equivalent entity of {{ExternalCatalog.TablePartitionSpec}} in Hive is a {{LinkedHashMap}} returned by {{Partition.getSpec()}}, which preserves partition column order. However, we are using a {{scala.immutable.Map}} to store the result, which no longer preserves the original order. What makes it worse, Scala specializes immutable maps with less than 5 elements. And these specialized versions do preserve order, thus hides this issue in test cases since we never use more than 4 partition columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13983) HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since 1.6 version (both multi-session and single session)
[ https://issues.apache.org/jira/browse/SPARK-13983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258139#comment-15258139 ] Cheng Lian commented on SPARK-13983: Here's my (incomplete) finding: Configurations set using {{--hiveconf}} and {{--hivevar}} are set to the current {{SessionState}} after [calling SessionManager.openSession here|https://github.com/apache/spark/blob/branch-1.6/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L68-L70]. In 1.5, these configurations are populated implicitly since {{SessionState}} is thread-local. In 1.6, we create a new {{HiveContext}} using {{HiveContext.newSession}} under multi-session mode, which then [creates a new execution Hive client|https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L119]. My theory is that, {{ClientWrapper.newSession}} ignores the current {{SessionState}} and simply creates a new one, thus configurations set via CLI flags are dropped. I haven't completely verified the last point though. > HiveThriftServer2 can not get "--hiveconf" or ''--hivevar" variables since > 1.6 version (both multi-session and single session) > -- > > Key: SPARK-13983 > URL: https://issues.apache.org/jira/browse/SPARK-13983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: ubuntu, spark 1.6.0 standalone, spark 1.6.1 standalone > (tried spark branch-1.6 snapshot as well) > compiled with scala 2.10.5 and hadoop 2.6 > (-Phadoop-2.6 -Psparkr -Phive -Phive-thriftserver) >Reporter: Teng Qiu >Assignee: Cheng Lian > > HiveThriftServer2 should be able to get "\--hiveconf" or ''\-\-hivevar" > variables from JDBC client, either from command line parameter of beeline, > such as > {{beeline --hiveconf spark.sql.shuffle.partitions=3 --hivevar > db_name=default}} > or from JDBC connection string, like > {{jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default}} > this worked in spark version 1.5.x, but after upgraded to 1.6, it doesn't > work. > to reproduce this issue, try to connect to HiveThriftServer2 with beeline: > {code} > bin/beeline -u jdbc:hive2://localhost:1 \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > {code} > or > {code} > bin/beeline -u > jdbc:hive2://localhost:1?spark.sql.shuffle.partitions=3#db_name=default > {code} > will get following results: > {code} > 0: jdbc:hive2://localhost:1> set spark.sql.shuffle.partitions; > +---++--+ > | key | value | > +---++--+ > | spark.sql.shuffle.partitions | 200| > +---++--+ > 1 row selected (0.192 seconds) > 0: jdbc:hive2://localhost:1> use ${db_name}; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '$' '{' 'db_name' in switch database statement; line 1 pos 4 (state=,code=0) > {code} > - > but this bug does not affect current versions of spark-sql CLI, following > commands works: > {code} > bin/spark-sql --master local[2] \ > --hiveconf spark.sql.shuffle.partitions=3 \ > --hivevar db_name=default > spark-sql> set spark.sql.shuffle.partitions > spark.sql.shuffle.partitions 3 > Time taken: 1.037 seconds, Fetched 1 row(s) > spark-sql> use ${db_name}; > OK > Time taken: 1.697 seconds > {code} > so I think it may caused by this change: > https://github.com/apache/spark/pull/8909 ( [SPARK-10810] [SPARK-10902] [SQL] > Improve session management in SQL ) > perhaps by calling {{hiveContext.newSession}}, the variables from > {{sessionConf}} were not loaded into the new session? > (https://github.com/apache/spark/pull/8909/files#diff-8f8b7f4172e8a07ff20a4dbbbcc57b1dR69) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14875. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12652 [https://github.com/apache/spark/pull/12652] > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > Fix For: 2.0.0 > > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255249#comment-15255249 ] Cheng Lian commented on SPARK-14875: Checked with [~cloud_fan], it was accidentally made private while adding bucketing feature. I'm removing this qualifier. > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
[ https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255248#comment-15255248 ] Cheng Lian commented on SPARK-14875: [~marmbrus] Is there any reason why we made it private in Spark 2.0? > OutputWriterFactory.newInstance shouldn't be private[sql] > - > > Key: SPARK-14875 > URL: https://issues.apache.org/jira/browse/SPARK-14875 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Reporter: Cheng Lian > Assignee: Cheng Lian > > Existing packages like spark-avro need to access > {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in > Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]
Cheng Lian created SPARK-14875: -- Summary: OutputWriterFactory.newInstance shouldn't be private[sql] Key: SPARK-14875 URL: https://issues.apache.org/jira/browse/SPARK-14875 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Existing packages like spark-avro need to access {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in Spark 2.0. Should make it public again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14843) Error while encoding: java.lang.ClassCastException with LibSVMRelation
[ https://issues.apache.org/jira/browse/SPARK-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14843. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12611 [https://github.com/apache/spark/pull/12611] > Error while encoding: java.lang.ClassCastException with LibSVMRelation > -- > > Key: SPARK-14843 > URL: https://issues.apache.org/jira/browse/SPARK-14843 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, SQL >Reporter: Nick Pentreath > Fix For: 2.0.0 > > > While trying to run some example ML linear regression code, I came across the > following. In fact this error occurs when doing {{./bin/run-example > ml.LinearRegressionWithElasticNetExample}}. > {code} > scala> import org.apache.spark.ml.regression.LinearRegression > import org.apache.spark.ml.regression.LinearRegression > scala> import org.apache.spark.mllib.linalg.Vector > import org.apache.spark.mllib.linalg.Vector > scala> import org.apache.spark.sql.Row > import org.apache.spark.sql.Row > scala> val data = > sqlContext.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt") > data: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> val model = lr.fit(data) > {code} > Stack trace: > {code} > Driver stacktrace: > ... > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.take(RDD.scala:1250) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1290) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.first(RDD.scala:1289) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:165) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:69) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > ... 48 elided > Caused by: java.lang.RuntimeException: Error while encoding: > java.lang.ClassCastException: java.lang.Double cannot be cast to > org.apache.spark.mllib.linalg.Vector > if (input[0, org.apache.spark.sql.Row].isNullAt) null else newInstance(class > org.apache.spark.mllib.linalg.VectorUDT).serialize > :- input[0, org.apache.spark.sql.Row].isNullAt > : :- input[0, org.apache.spark.sql.Row] > : +- 0 > :- null > +- newInstance(class org.apache.spark.mllib.linalg.VectorUDT).serialize >:- newInstance(class org.apache.spark.mllib.linalg.VectorUDT) >+- input[0, org.apache.spark.sql.Row].get > :- input[0, org.apache.spark.sql.Row] > +- 0 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:230) > at > org.apache.spark.ml.source.libsvm.DefaultSource$$anonfun$buildReader$1$$anonfun$8.apply(LibSVMRelation.scala:209) > at > org.apache.spark.ml.source.libsvm.DefaultSource$$anonfun$buildReader$1$$anonfun$8.apply(LibSVMRelation.scala:207) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:90) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$7$$anon$1.hasNext(WholeStageCodegen.scala:362) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >
[jira] [Updated] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
[ https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13928: --- Target Version/s: 2.0.0 > Move org.apache.spark.Logging into org.apache.spark.internal.Logging > > > Key: SPARK-13928 > URL: https://issues.apache.org/jira/browse/SPARK-13928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Logging was made private in Spark 2.0. If we move it, then users would be > able to create a Logging trait themselves to avoid changing their own code. > Alternatively, we can also provide in a compatibility package that adds > logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
[ https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13928: --- Assignee: Wenchen Fan > Move org.apache.spark.Logging into org.apache.spark.internal.Logging > > > Key: SPARK-13928 > URL: https://issues.apache.org/jira/browse/SPARK-13928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Logging was made private in Spark 2.0. If we move it, then users would be > able to create a Logging trait themselves to avoid changing their own code. > Alternatively, we can also provide in a compatibility package that adds > logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
[ https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13928: --- Fix Version/s: 2.0.0 > Move org.apache.spark.Logging into org.apache.spark.internal.Logging > > > Key: SPARK-13928 > URL: https://issues.apache.org/jira/browse/SPARK-13928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Logging was made private in Spark 2.0. If we move it, then users would be > able to create a Logging trait themselves to avoid changing their own code. > Alternatively, we can also provide in a compatibility package that adds > logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: parquet data corruption
(cc dev@parquet.apache.org) Hey Shushant, This kind of error can be tricky to debug. Could you please provide the following information: - The tool used to write those Parquet files (possibly Hive 0.13 since you mentioned hive-exec 0.13?) - The tool used to read those Parquet files (should be Hive according to the stack trace, but what version?) - What is the "complex" query? - Schema of those Parquet files (can be checked using parquet-tools), as well as corresponding schema of the user application (table schema for Hive) - If possible, code snippet you used to write the files - Are there files of different schemata mixed up? Some tools, like Hive, don't handle schema evolution well. I saw the file name in the stack trace consists of a timestamp. This isn't the naming convention used by Hive. Did you move files written somewhere else to the target directory? Cheng On 4/22/16 10:56 AM, Shushant Arora wrote: Hi I am writing to a parquet table using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13). Data is being written correctly and when I do count(1) or select * with limit I get proper result. But when I do some complex query on table it throws below excpetion : Diagnostic Messages for this Task: Error: java.io.IOException: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253) ... 11 more Caused by: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339) ... 15 more Caused by: parquet.io.ParquetDecodingException: Can't read value in column [sessionid] BINARY at value 18 out of 18, 18 out of 18 in currentPage. repetition level: 0, definition level: 1 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352)
[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240640#comment-15240640 ] Cheng Lian commented on SPARK-14463: Should we simply throw an exception when text data source is used together with partitioning? > read.text broken for partitioned tables > --- > > Key: SPARK-14463 > URL: https://issues.apache.org/jira/browse/SPARK-14463 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} > breaks when trying to load a partitioned table (or any table where the path > looks partitioned) > {code} > Seq((1, "test")) > .toDF("a", "b") > .write > .format("text") > .partitionBy("a") > .save("/home/michael/text-part-bug") > sqlContext.read.text("/home/michael/text-part-bug") > {code} > {code} > org.apache.spark.sql.AnalysisException: Try to map struct<value:string,a:int> > to Tuple1, but failed as the number of fields does not line up. > - Input schema: struct<value:string,a:int> > - Target schema: struct; > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279) > at org.apache.spark.sql.Dataset.(Dataset.scala:197) > at org.apache.spark.sql.Dataset.(Dataset.scala:168) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57) > at org.apache.spark.sql.Dataset.as(Dataset.scala:357) > at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239687#comment-15239687 ] Cheng Lian commented on SPARK-14389: Exception thrown by UnsafeRow.copy() inside BNL join doesn't necessarily indicate that BNL join ate all the memory. It's possible that other stuff ate all the memory, so that BNL join couldn't require enough memory as a victim. > OOM during BroadcastNestedLoopJoin > -- > > Key: SPARK-14389 > URL: https://issues.apache.org/jira/browse/SPARK-14389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OS: Amazon Linux AMI 2015.09 > EMR: 4.3.0 > Hadoop: Amazon 2.7.1 > Spark 1.6.0 > Ganglia 3.7.2 > Master: m3.xlarge > Core: m3.xlarge > m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD >Reporter: Steve Johnston > Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, > sample_script.py, stdout.txt > > > When executing attached sample_script.py in client mode with a single > executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", > during the self join of a small table, TPC-H lineitem generated for a 1M > dataset. Also see execution log stdout.txt attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239674#comment-15239674 ] Cheng Lian commented on SPARK-14495: This ticket is for branch-1.6. > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13753) Column nullable is derived incorrectly
[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239637#comment-15239637 ] Cheng Lian commented on SPARK-13753: [~jingweilu] Could you please provide the schema of tables involved in the SQL query you provided so that we can reproduce this issue more easily? Also, it would be greatly helpful if you can help to derive a minimized query that reproduces this issue. Thanks! > Column nullable is derived incorrectly > -- > > Key: SPARK-13753 > URL: https://issues.apache.org/jira/browse/SPARK-13753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Jingwei Lu >Priority: Critical > > There is a problem in spark sql to derive nullable column and used in > optimization incorrectly. In following query: > {code} > select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0] > from ( > select explode(map(a.frontend[0], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p50"), > a.frontend[1], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p90"), > a.backend[0], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.backend[1], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.render[0], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.render[1], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.page_load_time[0], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.page_load_time[1], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"), > a.total_load_time[0], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.total_load_time[1], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags) > from ( > select data.controller as controller, data.action as > action, > percentile(data.frontend, array(0.5, 0.9)) as > frontend, > percentile(data.backend, array(0.5, 0.9)) as > backend, > percentile(data.render, array(0.5, 0.9)) as render, > percentile(data.page_load_time, array(0.5, 0.9)) as > page_load_time, > percentile(data.total_load_time, array(0.5, 0.9)) > as total_load_time > from air_events_rt > where type='air_events' and data.event_name='pageload' > group by data.controller, data.action > ) a > ) b > where b.value is not null > {code} > b.value is incorrectly derived as not nullable. "b.value is not null" > predicate will be ignored by optimizer which cause the query return incorrect > result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239617#comment-15239617 ] Cheng Lian commented on SPARK-14463: Seems that this is because {{buildReader()}} doesn't append partitioned columns like other data sources. > read.text broken for partitioned tables > --- > > Key: SPARK-14463 > URL: https://issues.apache.org/jira/browse/SPARK-14463 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > > Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} > breaks when trying to load a partitioned table (or any table where the path > looks partitioned) > {code} > Seq((1, "test")) > .toDF("a", "b") > .write > .format("text") > .partitionBy("a") > .save("/home/michael/text-part-bug") > sqlContext.read.text("/home/michael/text-part-bug") > {code} > {code} > org.apache.spark.sql.AnalysisException: Try to map struct<value:string,a:int> > to Tuple1, but failed as the number of fields does not line up. > - Input schema: struct<value:string,a:int> > - Target schema: struct; > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279) > at org.apache.spark.sql.Dataset.(Dataset.scala:197) > at org.apache.spark.sql.Dataset.(Dataset.scala:168) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57) > at org.apache.spark.sql.Dataset.as(Dataset.scala:357) > at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema
[ https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237690#comment-15237690 ] Cheng Lian edited comment on SPARK-14566 at 4/12/16 6:25 PM: - This bug is exposed after fixing SPARK-14458. These two bugs together happened to cheat all our existing test cases. was (Author: lian cheng): This bug is exposed after fixing SPARK-14458. These two bugs together happened to cheated all our existing test cases. > When appending to partitioned persisted table, we should apply a projection > over input query plan using existing metastore schema > - > > Key: SPARK-14566 > URL: https://issues.apache.org/jira/browse/SPARK-14566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Take the following snippets slightly modified from test case > "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example: > {code} > val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j") > df1.write.partitionBy("i").saveAsTable("tbl11453") > val df2 = Seq("3" -> "30").toDF("i", "j") > df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453") > {code} > Although {{df1.schema}} is {{<i:STRING, j:STRING>}}, schema of persisted > table {{tbl11453}} is actually {{<j:STRING, i:STRING>}} because {{i}} is a > partition column, which is always appended after all data columns. Thus, when > appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are > actually different. > In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply > applies existing metastore schema to the input query plan ([see > here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]), > which is wrong. A projection should be used instead to adjust column order > here. > In branch-1.6, [this projection is added in > {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104], > but was removed in Spark 2.0. Replacing the aforementioned line in > {{CreateMetastoreDataSourceAsSelect}} with a projection should more > preferrable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema
[ https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237690#comment-15237690 ] Cheng Lian commented on SPARK-14566: This bug is exposed after fixing SPARK-14458. These two bugs together happened to cheated all our existing test cases. > When appending to partitioned persisted table, we should apply a projection > over input query plan using existing metastore schema > - > > Key: SPARK-14566 > URL: https://issues.apache.org/jira/browse/SPARK-14566 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Take the following snippets slightly modified from test case > "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example: > {code} > val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j") > df1.write.partitionBy("i").saveAsTable("tbl11453") > val df2 = Seq("3" -> "30").toDF("i", "j") > df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453") > {code} > Although {{df1.schema}} is {{<i:STRING, j:STRING>}}, schema of persisted > table {{tbl11453}} is actually {{<j:STRING, i:STRING>}} because {{i}} is a > partition column, which is always appended after all data columns. Thus, when > appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are > actually different. > In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply > applies existing metastore schema to the input query plan ([see > here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]), > which is wrong. A projection should be used instead to adjust column order > here. > In branch-1.6, [this projection is added in > {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104], > but was removed in Spark 2.0. Replacing the aforementioned line in > {{CreateMetastoreDataSourceAsSelect}} with a projection should more > preferrable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema
Cheng Lian created SPARK-14566: -- Summary: When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema Key: SPARK-14566 URL: https://issues.apache.org/jira/browse/SPARK-14566 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Take the following snippets slightly modified from test case "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example: {code} val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j") df1.write.partitionBy("i").saveAsTable("tbl11453") val df2 = Seq("3" -> "30").toDF("i", "j") df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453") {code} Although {{df1.schema}} is {{<i:STRING, j:STRING>}}, schema of persisted table {{tbl11453}} is actually {{<j:STRING, i:STRING>}} because {{i}} is a partition column, which is always appended after all data columns. Thus, when appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are actually different. In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply applies existing metastore schema to the input query plan ([see here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]), which is wrong. A projection should be used instead to adjust column order here. In branch-1.6, [this projection is added in {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104], but was removed in Spark 2.0. Replacing the aforementioned line in {{CreateMetastoreDataSourceAsSelect}} with a projection should more preferrable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14493) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." should always be used with a user defined path
[ https://issues.apache.org/jira/browse/SPARK-14493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14493. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12303 [https://github.com/apache/spark/pull/12303] > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." should always be used > with a user defined path > --- > > Key: SPARK-14493 > URL: https://issues.apache.org/jira/browse/SPARK-14493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > In current Spark 2.0 master, the following DDL command doesn't specify a > user-defined path, and writes query result to default Hive warehouse location > {code} > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > {code} > In Spark 1.6, it results in the following exception, which is expected > behavior: > {noformat} > scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * > FROM x" > java.util.NoSuchElementException: key not found: path > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14488. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12303 [https://github.com/apache/spark/pull/12303] > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Fix For: 2.0.0 > > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, > [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, > [id#0L]| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14372) Dataset.randomSplit() needs a Java version
[ https://issues.apache.org/jira/browse/SPARK-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234779#comment-15234779 ] Cheng Lian commented on SPARK-14372: Ah, sorry for the late reply. It's already taken by others. (At first I thought it was your PR, but later realized that it wasn't.) > Dataset.randomSplit() needs a Java version > -- > > Key: SPARK-14372 > URL: https://issues.apache.org/jira/browse/SPARK-14372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Rekha Joshi > Fix For: 2.0.0 > > > {{Dataset.randomSplit()}} now returns {{Array\[Dataset\[T\]\]}}, which > doesn't work for Java users since Java methods can't return generic arrays. > We may want something like {{randomSplitAsList()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14372) Dataset.randomSplit() needs a Java version
[ https://issues.apache.org/jira/browse/SPARK-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14372. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12184 [https://github.com/apache/spark/pull/12184] > Dataset.randomSplit() needs a Java version > -- > > Key: SPARK-14372 > URL: https://issues.apache.org/jira/browse/SPARK-14372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Rekha Joshi > Fix For: 2.0.0 > > > {{Dataset.randomSplit()}} now returns {{Array\[Dataset\[T\]\]}}, which > doesn't work for Java users since Java methods can't return generic arrays. > We may want something like {{randomSplitAsList()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14372) Dataset.randomSplit() needs a Java version
[ https://issues.apache.org/jira/browse/SPARK-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14372: --- Assignee: Rekha Joshi (was: Subhobrata Dey) > Dataset.randomSplit() needs a Java version > -- > > Key: SPARK-14372 > URL: https://issues.apache.org/jira/browse/SPARK-14372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Rekha Joshi > > {{Dataset.randomSplit()}} now returns {{Array\[Dataset\[T\]\]}}, which > doesn't work for Java users since Java methods can't return generic arrays. > We may want something like {{randomSplitAsList()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14372) Dataset.randomSplit() needs a Java version
[ https://issues.apache.org/jira/browse/SPARK-14372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14372: --- Assignee: Subhobrata Dey > Dataset.randomSplit() needs a Java version > -- > > Key: SPARK-14372 > URL: https://issues.apache.org/jira/browse/SPARK-14372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Subhobrata Dey > > {{Dataset.randomSplit()}} now returns {{Array\[Dataset\[T\]\]}}, which > doesn't work for Java users since Java methods can't return generic arrays. > We may want something like {{randomSplitAsList()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-14476: -- Assignee: Cheng Lian > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > Assignee: Cheng Lian > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14493) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." should always be used with a user defined path
Cheng Lian created SPARK-14493: -- Summary: "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." should always be used with a user defined path Key: SPARK-14493 URL: https://issues.apache.org/jira/browse/SPARK-14493 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian In current Spark 2.0 master, the following DDL command doesn't specify a user-defined path, and writes query result to default Hive warehouse location {code} sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" {code} In Spark 1.6, it results in the following exception, which is expected behavior: {noformat} scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" java.util.NoSuchElementException: key not found: path {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232422#comment-15232422 ] Cheng Lian commented on SPARK-14488: Yea, that's why I came to this DDL command, because this command seems to be the only way to trigger {{CreateTempTableUsingAsSelect}}. However, the physical plan doesn't use it. Will look into this. Thanks! > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, > [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, > [id#0L]| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232414#comment-15232414 ] Cheng Lian edited comment on SPARK-14488 at 4/8/16 4:17 PM: Discussed with [~yhuai] offline, and here's the summary: {{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never noticed it!). Its semantics is: # Execute the {{SELECT}} query. # Store query result to a user specified position in filesystem. Note that this means the {{PATH}} data source option should always be set when using this DDL command. # Create a temporary table using written files. Basically, it can be used to dump query results to the filesystem without creating persisted tables. It's indeed a confusing command and is kinda equivalent to the following DDL sequence: - {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}} - {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}} However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the long run, we should implement it and deprecate this confusing DDL command. Ticket title and description were updated accordingly. was (Author: lian cheng): Discussed with [~yhuai] offline, and here's the summary: {{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never noticed it!). Its semantics is: # Execute the {{SELECT}} query. # Store query result to a user specified position in filesystem. Note that this means the {{PATH}} data source option should always be set when using this DDL command. # Create a temporary table using written files. Basically, it can be used to dump query results to the filesystem without creating persisted tables. It's indeed a confusing and is kinda equivalent to the following DDL sequence: - {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}} - {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}} However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the long run, we should implement it and deprecate this confusing DDL command. Ticket title and description were updated accordingly. > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, > [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, > [id#0L]| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232414#comment-15232414 ] Cheng Lian commented on SPARK-14488: Discussed with [~yhuai] offline, and here's the summary: {{CreateTempTableUsingAsSelect}} existed since 1.3 (I'm surprised that I never noticed it!). Its semantics is: # Execute the {{SELECT}} query. # Store query result to a user specified position in filesystem. Note that this means the {{PATH}} data source option should always be set when using this DDL command. # Create a temporary table using written files. Basically, it can be used to dump query results to the filesystem without creating persisted tables. It's indeed a confusing and is kinda equivalent to the following DDL sequence: - {{INSERT OVERWRITE DIRECTORY ... STORE AS ... SELECT ...}} - {{CREATE TEMPORARY TABLE ... USING ... OPTION (PATH ...)}} However, Spark hasn't implemented {{INSERT OVERWRITE DIRECTORY}} yet. In the long run, we should implement it and deprecate this confusing DDL command. Ticket title and description were updated accordingly. > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, > [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, > [id#0L]| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232401#comment-15232401 ] Cheng Lian commented on SPARK-14488: Ah, sorry, the logical plan class {{CreateTableUsingAsSelect}} uses a boolean flag to indicate whether the table is temporary or not, while physical plan uses two different classes {{CreateTempTableUsingAsSelect}} and {{CreateTableUsingAsSelect}}. Then something is probably wrong in the planner. > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, > [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, > [id#0L]| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: The following Spark shell snippet reproduces this bug: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}. Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} rather than {{CreateTempTableUsingAsSelect}}. {noformat} == Parsed Logical Plan == 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- 'Project [*] +- 'UnresolvedRelation `x`, None == Analyzed Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Project [id#0L] +- SubqueryAlias x +- Range 0, 10, 1, 1, [id#0L] == Optimized Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Range 0, 10, 1, 1, [id#0L] == Physical Plan == ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, [id#0L]| {noformat} was: The following Spark shell snippet reproduces this bug: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}. Explain shows that parser probably drops {{TEMPORARY}} while parsing this statement: {noformat} == Parsed Logical Plan == 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- 'Project [*] +- 'UnresolvedRelation `x`, None == Analyzed Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Project [id#0L] +- SubqueryAlias x +- Range 0, 10, 1, 1, [id#0L] == Optimized Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Range 0, 10, 1, 1, [id#0L] == Physical Plan == ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, [id#0L]| {noformat} > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that the physical plan node is {{CreateTableUsingAsSelect}} > rather than {{CreateTempTableUsingAsSelect}}. > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Project [id#0L] >+- SubqueryAlias x > +- Range 0, 10, 1, 1, [id#0L] > == Optimized Logical Plan == > CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- Range 0, 10, 1, 1, [id#0L] > == Physical Plan == > Exe
[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: The following Spark shell snippet reproduces this bug: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}. Explain shows that parser probably drops {{TEMPORARY}} while parsing this statement: {noformat} == Parsed Logical Plan == 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- 'Project [*] +- 'UnresolvedRelation `x`, None == Analyzed Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Project [id#0L] +- SubqueryAlias x +- Range 0, 10, 1, 1, [id#0L] == Optimized Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Range 0, 10, 1, 1, [id#0L] == Physical Plan == ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, [id#0L]| {noformat} was: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], which exactly maps to this combination? # If it is, what is the expected semantics? > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet reproduces this bug: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}. > Explain shows that parser probably drops {{TEMPORARY}} while parsing this > statement: > {noformat} > == Parsed Logical Plan == > 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, > None, Overwrite, Map() > +- 'Project [*] >+- 'UnresolvedRelation `x`, None > == Analyzed Log
[jira] [Updated] (SPARK-14488) "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Summary: "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table (was: Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ...") > "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." creates persisted table > > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on > my local machine. > *Weird semantics* > Secondly, even if this DDL statement does create a temporary table, the > semantics is still somewhat weird: > # It has a {{AS SELECT ...}} clause, which is supposed to run a given query > instead of loading data from existing files. > # It has a {{USING }} clause, which is supposed to, I guess, > converting the result of the above query into the given format. And by > "converting", we have to write out the data into file system. > # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an > in-memory temporary table using the files written above? > The main questions: > # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a > valid one? > # If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} > command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], > which exactly maps to this combination? > # If it is, what is the expected semantics? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232092#comment-15232092 ] Cheng Lian edited comment on SPARK-14488 at 4/8/16 12:27 PM: - Tried the same snippet using Spark 1.6, and got the following exception, which makes sense. I tend to believe that the combination described in the ticket is invalid and should be rejected by either parser or analyzer. {noformat} scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" java.util.NoSuchElementException: key not found: path at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:150) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:150) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:230) at org.apache.spark.sql.execution.datasources.CreateTempTableUsingAsSelect.run(ddl.scala:112) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) at $iwC$$iwC$$iwC$$iwC$$iwC.(:35) at $iwC$$iwC$$iwC$$iwC.(:37) at $iwC$$iwC$$iwC.(:39) at $iwC$$iwC.(:41) at $iwC.(:43) at (:45) at .(:49) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccess
[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], which exactly maps to this combination? # If it is, what is the expected semantics? was: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], which exactly maps to this combination? # If it is, what is the expected semantics? > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > I
[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232118#comment-15232118 ] Cheng Lian commented on SPARK-14488: Result of {{EXPLAIN EXTENDED CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x}}: {noformat} == Parsed Logical Plan == 'CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- 'Project [*] +- 'UnresolvedRelation `x`, None == Analyzed Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Project [id#0L] +- SubqueryAlias x +- Range 0, 10, 1, 1, [id#0L] == Optimized Logical Plan == CreateTableUsingAsSelect `y`, PARQUET, true, [Ljava.lang.String;@4d001a14, None, Overwrite, Map() +- Range 0, 10, 1, 1, [id#0L] == Physical Plan == ExecutedCommand CreateMetastoreDataSourceAsSelect `y`, PARQUET, [Ljava.lang.String;@4d001a14, None, Overwrite, Map(), Range 0, 10, 1, 1, [id#0L]| {noformat} So it seems that the parser drops {{TEMPORARY}}. > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on > my local machine. > *Weird semantics* > Secondly, even if this DDL statement does create a temporary table, the > semantics is still somewhat weird: > # It has a {{AS SELECT ...}} clause, which is supposed to run a given query > instead of loading data from existing files. > # It has a {{USING }} clause, which is supposed to, I guess, > converting the result of the above query into the given format. And by > "converting", we have to write out the data into file system. > # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an > in-memory temporary table using the files written above? > The main questions: > # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a > valid one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} > command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], > which exactly maps to this combination? > # If it is, what is the expected semantics? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? If it's not, why do we have a [{{CreateTempTableUsingAsSelect}} command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], which exactly maps to this combination? # If it is, what is the expected semantics? was: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it is, what is the expected semantics? > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is N
[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232111#comment-15232111 ] Cheng Lian commented on SPARK-14488: However, if {{TEMPORARY + USING + AS SELECT}} is an invalid combination, why do we have a [{{CreateTempTableUsingAsSelect}} command|https://github.com/apache/spark/blob/583b5e05309adb73cdffd974a810d6bfb5f2ff95/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L116], which exactly maps to this combination? > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on > my local machine. > *Weird semantics* > Secondly, even if this DDL statement does create a temporary table, the > semantics is still somewhat weird: > # It has a {{AS SELECT ...}} clause, which is supposed to run a given query > instead of loading data from existing files. > # It has a {{USING }} clause, which is supposed to, I guess, > converting the result of the above query into the given format. And by > "converting", we have to write out the data into file system. > # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an > in-memory temporary table using the files written above? > The main questions: > # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a > valid one? > # If it is, what is the expected semantics? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232092#comment-15232092 ] Cheng Lian commented on SPARK-14488: Tried the same snippet using Spark 1.6, and got the following exception, which makes sense: {noformat} scala> sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" java.util.NoSuchElementException: key not found: path at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.default(ddl.scala:150) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.apply(ddl.scala:150) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:230) at org.apache.spark.sql.execution.datasources.CreateTempTableUsingAsSelect.run(ddl.scala:112) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) at $iwC$$iwC$$iwC$$iwC$$iwC.(:35) at $iwC$$iwC$$iwC$$iwC.(:37) at $iwC$$iwC$$iwC.(:39) at $iwC$$iwC.(:41) at $iwC.(:43) at (:45) at .(:49) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSu
[jira] [Commented] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232078#comment-15232078 ] Cheng Lian commented on SPARK-14488: Updated title and description of this ticket. > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on > my local machine. > *Weird semantics* > Secondly, even if this DDL statement does create a temporary table, the > semantics is still somewhat weird: > # It has a {{AS SELECT ...}} clause, which is supposed to run a given query > instead of loading data from existing files. > # It has a {{USING }} clause, which is supposed to, I guess, > converting the result of the above query into the given format. And by > "converting", we have to write out the data into file system. > # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an > in-memory temporary table using the files written above? > The main questions: > # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a > valid one? > # If it is, what is the expected semantics? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} keyword, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it is, what is the expected semantics? was: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, whichi is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it is, what is the expected semantics? > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, which is {{/user/hive/warehouse/y}} on > my local machine. > *Weird semantics* > Secon
[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Description: Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird semantics. Let's try the following Spark shell snippet: {code} sqlContext range 10 registerTempTable "x" // The problematic DDL statement: sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" sqlContext.tables().show() {code} It shows the following result: {noformat} +-+---+ |tableName|isTemporary| +-+---+ |y| false| |x| true| +-+---+ {noformat} *Weird behavior* Note that {{y}} is NOT temporary although it's created using {{CREATE TEMPORARY TABLE ...}}, and the query result is written in Parquet format under default Hive warehouse location, whichi is {{/user/hive/warehouse/y}} on my local machine. *Weird semantics* Secondly, even if this DDL statement does create a temporary table, the semantics is still somewhat weird: # It has a {{AS SELECT ...}} clause, which is supposed to run a given query instead of loading data from existing files. # It has a {{USING }} clause, which is supposed to, I guess, converting the result of the above query into the given format. And by "converting", we have to write out the data into file system. # It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an in-memory temporary table using the files written above? The main questions: # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a valid one? # If it is, what is the expected semantics? was: The following Spark shell snippet shows that currently temporary table creation writes files to file system: {code} sqlContext range 10 registerTempTable "t" sqlContext sql "create temporary table s using parquet as select * from t" {code} The problematic code is [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137]. > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Spark 2.0 master allows DDL statements like {{CREATE TEMPORARY > TABLE ... USING ... AS SELECT ...}}, which imposes weird behavior and weird > semantics. > Let's try the following Spark shell snippet: > {code} > sqlContext range 10 registerTempTable "x" > // The problematic DDL statement: > sqlContext sql "CREATE TEMPORARY TABLE y USING PARQUET AS SELECT * FROM x" > sqlContext.tables().show() > {code} > It shows the following result: > {noformat} > +-+---+ > |tableName|isTemporary| > +-+---+ > |y| false| > |x| true| > +-+---+ > {noformat} > *Weird behavior* > Note that {{y}} is NOT temporary although it's created using {{CREATE > TEMPORARY TABLE ...}}, and the query result is written in Parquet format > under default Hive warehouse location, whichi is {{/user/hive/warehouse/y}} > on my local machine. > *Weird semantics* > Secondly, even if this DDL statement does create a temporary table, the > semantics is still somewhat weird: > # It has a {{AS SELECT ...}} clause, which is supposed to run a given query > instead of loading data from existing files. > # It has a {{USING }} clause, which is supposed to, I guess, > converting the result of the above query into the given format. And by > "converting", we have to write out the data into file system. > # It has a {{TEMPORARY}} key word, which is supposed to, I guess, create an > in-memory temporary table using the files written above? > The main questions: > # Is the above combination ({{TEMPORARY}} + {{USING}} + {{AS SELECT}}) a > valid one? > # If it is, what is the expected semantics? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14488) Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..."
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14488: --- Summary: Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." (was: Creating temporary table using SQL DDL shouldn't write files to file system) > Weird behavior of DDL "CREATE TEMPORARY TABLE ... USING ... AS SELECT ..." > -- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet shows that currently temporary table > creation writes files to file system: > {code} > sqlContext range 10 registerTempTable "t" > sqlContext sql "create temporary table s using parquet as select * from t" > {code} > The problematic code is > [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232030#comment-15232030 ] Cheng Lian edited comment on SPARK-14488 at 4/8/16 11:10 AM: - Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to write the query result using given data source format on disk, and use written files to create a temporary table? So basically this DDL is used to save a query result using a specific data source format to disk? I find this one quite confusing... cc [~yhuai] [~marmbrus] was (Author: lian cheng): Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to write the query result using given data source format on disk, and use written files to create a temporary table? So basically this DDL is used to save a query result using a specific data source format to disk? I find this one quite confusing... > Creating temporary table using SQL DDL shouldn't write files to file system > --- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet shows that currently temporary table > creation writes files to file system: > {code} > sqlContext range 10 registerTempTable "t" > sqlContext sql "create temporary table s using parquet as select * from t" > {code} > The problematic code is > [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system
[ https://issues.apache.org/jira/browse/SPARK-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232030#comment-15232030 ] Cheng Lian commented on SPARK-14488: Oh, wait... Since there's a {{USING}} in the DDL statement, are we supposed to write the query result using given data source format on disk, and use written files to create a temporary table? So basically this DDL is used to save a query result using a specific data source format to disk? I find this one quite confusing... > Creating temporary table using SQL DDL shouldn't write files to file system > --- > > Key: SPARK-14488 > URL: https://issues.apache.org/jira/browse/SPARK-14488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet shows that currently temporary table > creation writes files to file system: > {code} > sqlContext range 10 registerTempTable "t" > sqlContext sql "create temporary table s using parquet as select * from t" > {code} > The problematic code is > [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14488) Creating temporary table using SQL DDL shouldn't write files to file system
Cheng Lian created SPARK-14488: -- Summary: Creating temporary table using SQL DDL shouldn't write files to file system Key: SPARK-14488 URL: https://issues.apache.org/jira/browse/SPARK-14488 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian The following Spark shell snippet shows that currently temporary table creation writes files to file system: {code} sqlContext range 10 registerTempTable "t" sqlContext sql "create temporary table s using parquet as select * from t" {code} The problematic code is [here|https://github.com/apache/spark/blob/73b56a3c6c5c590219b42884c8bbe88b0a236987/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L137]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14458) Wrong data schema is passed to FileFormat data sources that can't infer schema
Cheng Lian created SPARK-14458: -- Summary: Wrong data schema is passed to FileFormat data sources that can't infer schema Key: SPARK-14458 URL: https://issues.apache.org/jira/browse/SPARK-14458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian When instantiating a {{FileFormat}} data source that is not able to infer its schema from data files, {{DataSource}} passes the full schema including partition columns to {{HadoopFsRelation}}. We should filter out partition columns and only preserve data columns actually live in data files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
[ https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13589. Resolution: Resolved Fixed by SPARK-13537 > Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType > --- > > Key: SPARK-13589 > URL: https://issues.apache.org/jira/browse/SPARK-13589 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 > Reporter: Cheng Lian > Labels: flaky-test > > Here are a few sample build failures caused by this test case: > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > # > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ > (I've pinned these builds on Jenkins so that they won't be cleaned up.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14407) Hide HadoopFsRelation related data source API to execution package
Cheng Lian created SPARK-14407: -- Summary: Hide HadoopFsRelation related data source API to execution package Key: SPARK-14407 URL: https://issues.apache.org/jira/browse/SPARK-14407 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14404) HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't recover it
[ https://issues.apache.org/jira/browse/SPARK-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14404. Resolution: Not A Bug The scheme of {{FakeFileSystem}} is randomly generated, so it doesn't interfere normal execution of other tests. > HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't > recover it > -- > > Key: SPARK-14404 > URL: https://issues.apache.org/jira/browse/SPARK-14404 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Test case {{HDFSMetadataLog: fallback from FileContext to FileSystem}} > doesn't recover the orignal {{FileSystem}} implementation after overriding it > with {{FakeFileSystem}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14404) HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't recover it
Cheng Lian created SPARK-14404: -- Summary: HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't recover it Key: SPARK-14404 URL: https://issues.apache.org/jira/browse/SPARK-14404 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Test case {{HDFSMetadataLog: fallback from FileContext to FileSystem}} doesn't recover the orignal {{FileSystem}} implementation after overriding it with {{FakeFileSystem}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14404) HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't recover it
[ https://issues.apache.org/jira/browse/SPARK-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14404: --- Component/s: SQL > HDFSMetadataLogSuite overrides Hadoop FileSystem implementation but doesn't > recover it > -- > > Key: SPARK-14404 > URL: https://issues.apache.org/jira/browse/SPARK-14404 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Test case {{HDFSMetadataLog: fallback from FileContext to FileSystem}} > doesn't recover the orignal {{FileSystem}} implementation after overriding it > with {{FakeFileSystem}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14372) Dataset.randomSplit() needs a Java version
Cheng Lian created SPARK-14372: -- Summary: Dataset.randomSplit() needs a Java version Key: SPARK-14372 URL: https://issues.apache.org/jira/browse/SPARK-14372 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian {{Dataset.randomSplit()}} now returns {{Array\[Dataset\[T\]\]}}, which doesn't work for Java users since Java methods can't return generic arrays. We may want something like {{randomSplitAsList()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14369) Implement preferredLocations() for FileScanRDD
Cheng Lian created SPARK-14369: -- Summary: Implement preferredLocations() for FileScanRDD Key: SPARK-14369 URL: https://issues.apache.org/jira/browse/SPARK-14369 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Implement {{FileScanRDD.preferredLocations()}} to add locality support for {{HadoopFsRelation}} based data sources. We should avoid extra block location related RPC costs for S3, which doesn't provide valid locality information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14295) buildReader implementation for LibSVM
Cheng Lian created SPARK-14295: -- Summary: buildReader implementation for LibSVM Key: SPARK-14295 URL: https://issues.apache.org/jira/browse/SPARK-14295 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14274) Add FileFormat.prepareRead to collect necessary global information
[ https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14274: --- Summary: Add FileFormat.prepareRead to collect necessary global information (was: Replaces inferSchema with prepareRead to collect necessary global information) > Add FileFormat.prepareRead to collect necessary global information > -- > > Key: SPARK-14274 > URL: https://issues.apache.org/jira/browse/SPARK-14274 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > One problem of our newly introduced {{FileFormat.buildReader()}} method is > that it only sees pieces of input files. On the other hand, data sources like > CSV and LibSVM requires some sort of global information: > - CSV: the content of the header line if {{header}} option is set to true, so > that we can filter out header lines within each input file. This is > considered as a global information because it's possible that the header > appears in the middle of a file after blocks of comments and empty lines, > although this is just a rare/contrived corner case. > - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset > to infer the total number of features to construct result {{LabeledPoint}} > instances. > Unfortunately, with our current API, this kind of global information can't be > gathered. > The solution proposed here is to add a {{prepareRead}} method, which accepts > the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which > contains an {{Option\[StructType\]}} for the inferred schema and a > {{Map\[String, Any\]}} for any gathered global information. This > {{ReadContext}} is then passed to {{buildReader()}}. By default, > {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema > itself can be considered as a sort of global information). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14274) Replaces inferSchema with prepareRead to collect necessary global information
[ https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14274: --- Description: One problem of our newly introduced {{FileFormat.buildReader()}} method is that it only sees pieces of input files. On the other hand, data sources like CSV and LibSVM requires some sort of global information: - CSV: the content of the header line if {{header}} option is set to true, so that we can filter out header lines within each input file. This is considered as a global information because it's possible that the header appears in the middle of a file after blocks of comments and empty lines, although this is just a rare/contrived corner case. - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to infer the total number of features to construct result {{LabeledPoint}} instances. Unfortunately, with our current API, this kind of global information can't be gathered. The solution proposed here is to add a {{prepareRead}} method, which accepts the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which contains an {{Option\[StructType\]}} for the inferred schema and a {{Map\[String, Any\]}} for any gathered global information. This {{ReadContext}} is then passed to {{buildReader()}}. By default, {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema itself can be considered as a sort of global information). was: One problem of our newly introduced {{FileFormat.buildReader()}} method is that it only sees pieces of input files. On the other hand, data sources like CSV and LibSVM requires some sort of global information: - CSV: the content of the header line if {{header}} option is set to true, so that we can filter out header lines within each input file. This is considered as a global information because it's possible that the header appears in the middle of a file after blocks of comments and empty lines, although this is just a rare/contrived corner case. - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to infer the total number of features to construct result {{LabeledPoint}}s. Unfortunately, with our current API, this kind of global information can't be gathered. The solution proposed here is to add a {{prepareRead}} method, which accepts the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which contains an {{Option\[StructType\]}} for the inferred schema and a {{Map\[String, Any\]}} for any gathered global information. This {{ReadContext}} is then passed to {{buildReader()}}. By default, {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema itself can be considered as a sort of global information). > Replaces inferSchema with prepareRead to collect necessary global information > - > > Key: SPARK-14274 > URL: https://issues.apache.org/jira/browse/SPARK-14274 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > One problem of our newly introduced {{FileFormat.buildReader()}} method is > that it only sees pieces of input files. On the other hand, data sources like > CSV and LibSVM requires some sort of global information: > - CSV: the content of the header line if {{header}} option is set to true, so > that we can filter out header lines within each input file. This is > considered as a global information because it's possible that the header > appears in the middle of a file after blocks of comments and empty lines, > although this is just a rare/contrived corner case. > - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset > to infer the total number of features to construct result {{LabeledPoint}} > instances. > Unfortunately, with our current API, this kind of global information can't be > gathered. > The solution proposed here is to add a {{prepareRead}} method, which accepts > the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which > contains an {{Option\[StructType\]}} for the inferred schema and a > {{Map\[String, Any\]}} for any gathered global information. This > {{ReadContext}} is then passed to {{buildReader()}}. By default, > {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema > itself can be considered as a sort of global information). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14274) Replaces inferSchema with prepareRead to collect necessary global information
Cheng Lian created SPARK-14274: -- Summary: Replaces inferSchema with prepareRead to collect necessary global information Key: SPARK-14274 URL: https://issues.apache.org/jira/browse/SPARK-14274 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian One problem of our newly introduced {{FileFormat.buildReader()}} method is that it only sees pieces of input files. On the other hand, data sources like CSV and LibSVM requires some sort of global information: - CSV: the content of the header line if {{header}} option is set to true, so that we can filter out header lines within each input file. This is considered as a global information because it's possible that the header appears in the middle of a file after blocks of comments and empty lines, although this is just a rare/contrived corner case. - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to infer the total number of features to construct result {{LabeledPoint}}s. Unfortunately, with our current API, this kind of global information can't be gathered. The solution proposed here is to add a {{prepareRead}} method, which accepts the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which contains an {{Option\[StructType\]}} for the inferred schema and a {{Map\[String, Any\]}} for any gathered global information. This {{ReadContext}} is then passed to {{buildReader()}}. By default, {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema itself can be considered as a sort of global information). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14273) Add FileFormat.isSplittable to indicate whether a format is splittable
Cheng Lian created SPARK-14273: -- Summary: Add FileFormat.isSplittable to indicate whether a format is splittable Key: SPARK-14273 URL: https://issues.apache.org/jira/browse/SPARK-14273 Project: Spark Issue Type: Sub-task Affects Versions: 2.0.0 Reporter: Cheng Lian {{FileSourceStrategy}} assumes that all data source formats are splittable and always splits data files by fixed partition size. However, not all HDSF based formats are splittable. We need a flag to indicate that and ensure that non-splittable files won't be split into multiple Spark partitions. (PS: Is it "splitable" or "splittable"? Probably the latter one? Hadoop uses the former one though...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14114) implement buildReader for text data source
[ https://issues.apache.org/jira/browse/SPARK-14114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14114. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11934 [https://github.com/apache/spark/pull/11934] > implement buildReader for text data source > -- > > Key: SPARK-14114 > URL: https://issues.apache.org/jira/browse/SPARK-14114 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14244) Physical Window operator uses global SizeBasedWindowFunction.n attribute generated on both driver and executor side
Cheng Lian created SPARK-14244: -- Summary: Physical Window operator uses global SizeBasedWindowFunction.n attribute generated on both driver and executor side Key: SPARK-14244 URL: https://issues.apache.org/jira/browse/SPARK-14244 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1, 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian To reproduce this issue, first start a local cluster with at least one worker. Then try the following Spark shell snippet: {code} import org.apache.spark.sql.expressions._ import org.apache.spark.sql.functions._ sqlContext. range(10). select( 'id, cume_dist() over (Window orderBy 'id) as 'cdist ). orderBy('cdist). show() {code} Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 11, 192.168.1.101): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: window__partition__size#4 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:92) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:301) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:350) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:301) at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:350) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:248) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:91
[jira] [Resolved] (SPARK-14208) Rename "spark.sql.parquet.fileScan"
[ https://issues.apache.org/jira/browse/SPARK-14208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14208. Resolution: Fixed Issue resolved by pull request 12003 [https://github.com/apache/spark/pull/12003] > Rename "spark.sql.parquet.fileScan" > --- > > Key: SPARK-14208 > URL: https://issues.apache.org/jira/browse/SPARK-14208 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 2.0.0 > > > This option should be renamed since {{FileScanRDD}} is now used by all > {{HadoopFsRelation}} based data sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations
Cheng Lian created SPARK-14237: -- Summary: De-duplicate partition value appending logic in various buildReader() implementations Key: SPARK-14237 URL: https://issues.apache.org/jira/browse/SPARK-14237 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Various data sources share approximately the same code for partition value appending. Would be nice to make it a utility method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14206) buildReader implementation for CSV
[ https://issues.apache.org/jira/browse/SPARK-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14206: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Fix Version/s: (was: 2.0.0) > buildReader implementation for CSV > -- > > Key: SPARK-14206 > URL: https://issues.apache.org/jira/browse/SPARK-14206 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14206) buildReader implementation for CSV
[ https://issues.apache.org/jira/browse/SPARK-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-14206: -- Assignee: Cheng Lian > buildReader implementation for CSV > -- > > Key: SPARK-14206 > URL: https://issues.apache.org/jira/browse/SPARK-14206 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14208) Rename "spark.sql.parquet.fileScan"
Cheng Lian created SPARK-14208: -- Summary: Rename "spark.sql.parquet.fileScan" Key: SPARK-14208 URL: https://issues.apache.org/jira/browse/SPARK-14208 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor This option should be renamed since {{FileScanRDD}} is now used by all {{HadoopFsRelation}} based data sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14206) buildReader implementation for CSV
Cheng Lian created SPARK-14206: -- Summary: buildReader implementation for CSV Key: SPARK-14206 URL: https://issues.apache.org/jira/browse/SPARK-14206 Project: Spark Issue Type: Sub-task Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13456) Cannot create encoders for case classes defined in Spark shell after upgrading to Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-13456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13456. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11931 [https://github.com/apache/spark/pull/11931] > Cannot create encoders for case classes defined in Spark shell after > upgrading to Scala 2.11 > > > Key: SPARK-13456 > URL: https://issues.apache.org/jira/browse/SPARK-13456 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 2.0.0 > > > Spark 2.0 started to use Scala 2.11 by default since [PR > #10608|https://github.com/apache/spark/pull/10608]. Unfortunately, after > this upgrade, Spark fails to create encoders for case classes defined in REPL: > {code} > import sqlContext.implicits._ > case class T(a: Int, b: Double) > val ds = Seq(1 -> T(1, 1D), 2 -> T(2, 2D)).toDS() > {code} > Exception thrown: > {noformat} > org.apache.spark.sql.AnalysisException: Unable to generate an encoder for > inner class `T` without access to the scope that this class was defined in. > Try moving this class out of its parent class.; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:565) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$resolveDeserializer$1.applyOrElse(Analyzer.scala:561) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:262) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:304) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:353) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:333) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at scala.collection.mutable.ArrayBuffer
[jira] [Updated] (SPARK-14146) Imported implicits can't be found in Spark REPL in some cases
[ https://issues.apache.org/jira/browse/SPARK-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14146: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Component/s: Spark Core Summary: Imported implicits can't be found in Spark REPL in some cases (was: imported implicit can't be found in Spark REPL in some case) > Imported implicits can't be found in Spark REPL in some cases > - > > Key: SPARK-14146 > URL: https://issues.apache.org/jira/browse/SPARK-14146 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > {code} > class I(i: Int) { > def double: Int = i * 2 > } > class Context { > implicit def toI(i: Int): I = new I(i) > } > val c = new Context > import c._ > // OK > 1.double > // Fail > class A; 1.double > {code} > The above code snippets can work in Scala REPL however. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14114) implement buildReader for text data source
[ https://issues.apache.org/jira/browse/SPARK-14114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14114: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Fix Version/s: (was: 2.0.0) > implement buildReader for text data source > -- > > Key: SPARK-14114 > URL: https://issues.apache.org/jira/browse/SPARK-14114 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14116) buildReader implementation for ORC
Cheng Lian created SPARK-14116: -- Summary: buildReader implementation for ORC Key: SPARK-14116 URL: https://issues.apache.org/jira/browse/SPARK-14116 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13549) Refactor the Optimizer Rule CollapseProject
[ https://issues.apache.org/jira/browse/SPARK-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13549. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11427 [https://github.com/apache/spark/pull/11427] > Refactor the Optimizer Rule CollapseProject > --- > > Key: SPARK-13549 > URL: https://issues.apache.org/jira/browse/SPARK-13549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 2.0.0 > > > Duplicate codes exist in CollapseProject. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13549) Refactor the Optimizer Rule CollapseProject
[ https://issues.apache.org/jira/browse/SPARK-13549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13549: --- Assignee: Xiao Li > Refactor the Optimizer Rule CollapseProject > --- > > Key: SPARK-13549 > URL: https://issues.apache.org/jira/browse/SPARK-13549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > > Duplicate codes exist in CollapseProject. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13772. Resolution: Fixed Fix Version/s: 1.6.2 Issue resolved by pull request 11605 [https://github.com/apache/spark/pull/11605] > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai >Assignee: cen yuhai > Fix For: 1.6.2 > > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Assignee: cen yuhai > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai >Assignee: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Target Version/s: 1.6.2 > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13772) DataType mismatch about decimal
[ https://issues.apache.org/jira/browse/SPARK-13772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13772: --- Description: Code snippet to reproduce this issue using 1.6.0: {code} select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test {code} It will throw exceptions like this: {noformat} Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; line 1 pos 37 {noformat} I also tested: {code} select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; {code} {noformat} Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) and decimal(19,6)).; line 1 pos 38 {noformat} was: I found a bug: select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test It will throw exceptions like this: Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and decimal(10,0)).; line 1 pos 37 I also test: select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > DataType mismatch about decimal > --- > > Key: SPARK-13772 > URL: https://issues.apache.org/jira/browse/SPARK-13772 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: spark1.6.0 hadoop2.2.0 jdk1.7.0_79 >Reporter: cen yuhai > > Code snippet to reproduce this issue using 1.6.0: > {code} > select if(1=1, cast(1 as double), cast(1.1 as decimal) as a from test > {code} > It will throw exceptions like this: > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as double) else cast(1.1 > as decimal(10,0))' due to data type mismatch: differing types in 'if ((1 = > 1)) cast(1 as double) else cast(1.1 as decimal(10,0))' (double and > decimal(10,0)).; line 1 pos 37 > {noformat} > I also tested: > {code} > select if(1=1,cast(1 as decimal),cast(1 as decimal(19,6))) from test; > {code} > {noformat} > Error in query: cannot resolve 'if ((1 = 1)) cast(1 as decimal(10,0)) else > cast(1 as decimal(19,6))' due to data type mismatch: differing types in 'if > ((1 = 1)) cast(1 as decimal(10,0)) else cast(1 as decimal(19,6))' > (decimal(10,0) and decimal(19,6)).; line 1 pos 38 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
[ https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13774. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11775 [https://github.com/apache/spark/pull/11775] > IllegalArgumentException: Can not create a Path from an empty string for > incorrect file path > > > Key: SPARK-13774 > URL: https://issues.apache.org/jira/browse/SPARK-13774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Sunitha Kambhampati >Priority: Minor > Fix For: 2.0.0 > > > Think the error message should be improved for files that could not be found. > The {{Path}} seems given. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.take(RDD.scala:1246) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.first(RDD.scala:1285) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) > at org.apache.spark.sql.DataFrameReader.l
[jira] [Updated] (SPARK-13774) IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
[ https://issues.apache.org/jira/browse/SPARK-13774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13774: --- Assignee: Sunitha Kambhampati > IllegalArgumentException: Can not create a Path from an empty string for > incorrect file path > > > Key: SPARK-13774 > URL: https://issues.apache.org/jira/browse/SPARK-13774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Sunitha Kambhampati >Priority: Minor > > Think the error message should be improved for files that could not be found. > The {{Path}} seems given. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT > /_/ > Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:976) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.take(RDD.scala:1246) > at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:352) > at org.apache.spark.rdd.RDD.first(RDD.scala:1285) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.findFirstLine(DefaultSource.scala:156) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:58) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$13.apply(DataSource.scala:213) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:212) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Description: Release note update: {quote} Starting from 2.0.0, Spark SQL handles views natively by default. When defining a view, now Spark SQL canonicalizes view definition by generating a canonical SQL statement from the parsed logical query plan, and then stores it into the catalog. If you hit any problems, you may try to turn off native view by setting {{spark.sql.nativeView}} to false. {quote} > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: releasenotes > > Release note update: > {quote} > Starting from 2.0.0, Spark SQL handles views natively by default. When > defining a view, now Spark SQL canonicalizes view definition by generating a > canonical SQL statement from the parsed logical query plan, and then stores > it into the catalog. If you hit any problems, you may try to turn off native > view by setting {{spark.sql.nativeView}} to false. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Labels: releasenotes (was: ) > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: releasenotes > > Release note update: > {quote} > Starting from 2.0.0, Spark SQL handles views natively by default. When > defining a view, now Spark SQL canonicalizes view definition by generating a > canonical SQL statement from the parsed logical query plan, and then stores > it into the catalog. If you hit any problems, you may try to turn off native > view by setting {{spark.sql.nativeView}} to false. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14038) Enable native view by default
[ https://issues.apache.org/jira/browse/SPARK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14038: --- Assignee: Wenchen Fan > Enable native view by default > - > > Key: SPARK-14038 > URL: https://issues.apache.org/jira/browse/SPARK-14038 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14000) case class with a tuple field can't work in Dataset
[ https://issues.apache.org/jira/browse/SPARK-14000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14000. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11816 [https://github.com/apache/spark/pull/11816] > case class with a tuple field can't work in Dataset > --- > > Key: SPARK-14000 > URL: https://issues.apache.org/jira/browse/SPARK-14000 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Wenchen Fan > Fix For: 2.0.0 > > > for example, `case class TupleClass(data: (Int, String))`, we can create > encoder for it, but when we create Dataset with it, we will fail while > validating the encoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
[ https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14004. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11820 [https://github.com/apache/spark/pull/11820] > AttributeReference and Alias should only use their first qualifier to build > SQL representations > --- > > Key: SPARK-14004 > URL: https://issues.apache.org/jira/browse/SPARK-14004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 2.0.0 > > > Current implementation joins all qualifiers, which is wrong. > However, this doesn't cause any real SQL generation bugs as there is always > at most one qualifier for any given {{AttributeReference}} or {{Alias}}. > We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to > represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13972) hive tests should fail if SQL generation failed
[ https://issues.apache.org/jira/browse/SPARK-13972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13972: --- Assignee: Wenchen Fan > hive tests should fail if SQL generation failed > --- > > Key: SPARK-13972 > URL: https://issues.apache.org/jira/browse/SPARK-13972 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14001) support multi-children Union in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14001: --- Assignee: Wenchen Fan > support multi-children Union in SQLBuilder > -- > > Key: SPARK-14001 > URL: https://issues.apache.org/jira/browse/SPARK-14001 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)
[ https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12719: --- Assignee: Wenchen Fan > SQL generation support for generators (including UDTF) > -- > > Key: SPARK-12719 > URL: https://issues.apache.org/jira/browse/SPARK-12719 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Wenchen Fan > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary
[ https://issues.apache.org/jira/browse/SPARK-14002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14002. Resolution: Duplicate Fix Version/s: 2.0.0 This issue is actually covered by SPARK-13976. > SQLBuilder should add subquery to Aggregate child when necessary > > > Key: SPARK-14002 > URL: https://issues.apache.org/jira/browse/SPARK-14002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this > issue: > {code} > test("bug") { > checkHiveQl( > """SELECT COUNT(id) > |FROM > |( > | SELECT id FROM t0 > |) subq > """.stripMargin > ) > } > {code} > Generated wrong SQL is: > {code:sql} > SELECT `gen_attr_46` AS `count(id)` > FROM > ( > SELECT count(`gen_attr_45`) AS `gen_attr_46` > FROM > SELECT `gen_attr_45`-- > FROM-- > ( -- A subquery > SELECT `id` AS `gen_attr_45`-- is missing > FROM `default`.`t0` -- > ) AS gen_subquery_0 -- > ) AS gen_subquery_1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13974) sub-query names do not need to be globally unique while generate SQL
[ https://issues.apache.org/jira/browse/SPARK-13974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-13974: --- Assignee: Wenchen Fan > sub-query names do not need to be globally unique while generate SQL > > > Key: SPARK-13974 > URL: https://issues.apache.org/jira/browse/SPARK-13974 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
[ https://issues.apache.org/jira/browse/SPARK-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14004: --- Priority: Minor (was: Major) > AttributeReference and Alias should only use their first qualifier to build > SQL representations > --- > > Key: SPARK-14004 > URL: https://issues.apache.org/jira/browse/SPARK-14004 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > Current implementation joins all qualifiers, which is wrong. > However, this doesn't cause any real SQL generation bugs as there is always > at most one qualifier for any given {{AttributeReference}} or {{Alias}}. > We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to > represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14004) AttributeReference and Alias should only use their first qualifier to build SQL representations
Cheng Lian created SPARK-14004: -- Summary: AttributeReference and Alias should only use their first qualifier to build SQL representations Key: SPARK-14004 URL: https://issues.apache.org/jira/browse/SPARK-14004 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Current implementation joins all qualifiers, which is wrong. However, this doesn't cause any real SQL generation bugs as there is always at most one qualifier for any given {{AttributeReference}} or {{Alias}}. We can probably use {{Option\[String\]}} instead of {{Seq\[String\]}} to represent qualifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14002) SQLBuilder should add subquery to Aggregate child when necessary
Cheng Lian created SPARK-14002: -- Summary: SQLBuilder should add subquery to Aggregate child when necessary Key: SPARK-14002 URL: https://issues.apache.org/jira/browse/SPARK-14002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Adding the following test case to {{LogicalPlanToSQLSuite}} to reproduce this issue: {code} test("bug") { checkHiveQl( """SELECT COUNT(id) |FROM |( | SELECT id FROM t0 |) subq """.stripMargin ) } {code} Generated wrong SQL is: {code:sql} SELECT `gen_attr_46` AS `count(id)` FROM ( SELECT count(`gen_attr_45`) AS `gen_attr_46` FROM SELECT `gen_attr_45`-- FROM-- ( -- A subquery SELECT `id` AS `gen_attr_45`-- is missing FROM `default`.`t0` -- ) AS gen_subquery_0 -- ) AS gen_subquery_1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org