[jira] [Resolved] (SPARK-6546) Using the wrong code that will make spark compile failed!!
[ https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6546. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5198 [https://github.com/apache/spark/pull/5198] Using the wrong code that will make spark compile failed!! --- Key: SPARK-6546 URL: https://issues.apache.org/jira/browse/SPARK-6546 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: DoingDone9 Assignee: DoingDone9 Fix For: 1.4.0 wrong code : val tmpDir = Files.createTempDir() not Files should Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6546) Using the wrong code that will make spark compile failed!!
[ https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6546: -- Affects Version/s: 1.4.0 Using the wrong code that will make spark compile failed!! --- Key: SPARK-6546 URL: https://issues.apache.org/jira/browse/SPARK-6546 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: DoingDone9 Assignee: DoingDone9 Fix For: 1.4.0 wrong code : val tmpDir = Files.createTempDir() not Files should Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6546) Build failure caused by PR #5029 together with #4289
[ https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381606#comment-14381606 ] Cheng Lian commented on SPARK-6546: --- Updated ticket title and description to refect the root cause. Build failure caused by PR #5029 together with #4289 Key: SPARK-6546 URL: https://issues.apache.org/jira/browse/SPARK-6546 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Pei, Zhongshuai Assignee: Pei, Zhongshuai Fix For: 1.4.0 PR [#4289|https://github.com/apache/spark/pull/4289] was using Guava's {{com.google.common.io.Files}} according to the first commit of that PR, see [here|https://github.com/jeanlyn/spark/blob/3b27af36f82580c2171df965140c9a14e62fd5f0/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L22]. However, [PR #5029|https://github.com/apache/spark/pull/5029] was merged earlier, and deprecated Guava {{Files}} by {{Utils.Files}}. These two combined caused this build failure. (There're no conflicts in the eyes of Git, but there do exist semantic conflicts.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6555) Override equals and hashCode in MetastoreRelation
Cheng Lian created SPARK-6555: - Summary: Override equals and hashCode in MetastoreRelation Key: SPARK-6555 URL: https://issues.apache.org/jira/browse/SPARK-6555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2 Reporter: Cheng Lian This is a follow-up of SPARK-6450. As explained in [this comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499] of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 release. But overriding {{equals}} and {{hashCode}} is the proper fix to that problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6554: -- Target Version/s: 1.3.1, 1.4.0 Cannot use partition columns in where clause Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon) INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started INFO Remoting Starting remoting INFO Remoting Remoting started; listening on addresses
[jira] [Assigned] (SPARK-6554) Cannot use partition columns in where clause
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-6554: - Assignee: Cheng Lian Cannot use partition columns in where clause Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon) INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started INFO Remoting Starting remoting INFO Remoting Remoting started; listening on addresses
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381703#comment-14381703 ] Cheng Lian commented on SPARK-6481: --- Maybe unrelated to this issue, but I saw a lot of JIRA notifications about Assignee updates, jumping between a normal user and Apache Spark. Is this behavior a side effect of the In Progress PR? (Seems caused by [this code block|https://github.com/databricks/spark-pr-dashboard/pull/49/files#diff-6f3562e8b8a773341837373ab53b5462R34].) Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):
[ https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6465. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5191 [https://github.com/apache/spark/pull/5191] GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor): -- Key: SPARK-6465 URL: https://issues.apache.org/jira/browse/SPARK-6465 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark 1.3, YARN 2.6.0, CentOS Reporter: Earthson Lu Assignee: Michael Armbrust Priority: Critical Fix For: 1.3.1, 1.4.0 Original Estimate: 0.5h Remaining Estimate: 0.5h I can not find a issue for this. register for GenericRowWithSchema is lost in org.apache.spark.sql.execution.SparkSqlSerializer. Is this the only thing we need to do? Here is the log {code} 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050) at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062) at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6554: -- Description: I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon) INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started INFO Remoting Starting remoting INFO Remoting Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.134:62493] INFO org.apache.spark.util.Utils Successfully started service 'sparkDriver' on port 62493. INFO org.apache.spark.SparkEnv Registering MapOutputTracker INFO org.apache.spark.SparkEnv Registering BlockManagerMaster INFO o.a.spark.storage.DiskBlockManager Created local directory at /var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4 INFO
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381850#comment-14381850 ] Cheng Lian commented on SPARK-6554: --- Marked this as critical rather than blocker mostly because Parquet filter push-down is not enabled by default in 1.3.0. Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381862#comment-14381862 ] Cheng Lian commented on SPARK-6554: --- Parquet filter push-down isn't enabled by default in 1.3.0 because the most recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two bugs (PARQUET-136 PARQUET-173). So it's generally not recommended to be used in production yet. These two bugs have been fixed in Parquet master, and the official 1.6.0 release should be out pretty soon. We probably will upgrade to Parquet 1.6.0 in Spark 1.4.0. Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load
[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6554: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-5463 Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon) INFO
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5463: -- Affects Version/s: 1.3.0 Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381811#comment-14381811 ] Cheng Lian commented on SPARK-6481: --- Aha, so I'm not the only one! Although I just started doing this pretty recently :P Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6554: -- Summary: Cannot use partition columns in where clause when Parquet filter push-down is enabled (was: Cannot use partition columns in where clause) Cannot use partition columns in where clause when Parquet filter push-down is enabled - Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with
[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381846#comment-14381846 ] Cheng Lian commented on SPARK-6471: --- Bumped to blocker level since this is actually a regression from 1.2. Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Assignee: Yash Datta Priority: Blocker Fix For: 1.3.1, 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6471: -- Priority: Blocker (was: Major) Target Version/s: 1.3.1, 1.4.0 Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Assignee: Yash Datta Priority: Blocker Fix For: 1.3.1, 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381788#comment-14381788 ] Cheng Lian commented on SPARK-6554: --- Hi [~jonchase], did you happen to turn on Parquet filter push-down by setting spark.sql.parquet.filterPushdown to true? The reason behind this is that, in your case, the partition column doesn't exist in the Parquet data file, thus Parquet filter push-down logics sees it as an invalid column. We should remove all predicates that touch those partition columns which don't exist in Parquet data files before doing the push-down optimization. Cannot use partition columns in where clause Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO
[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause
[ https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6554: -- Priority: Critical (was: Major) Cannot use partition columns in where clause Key: SPARK-6554 URL: https://issues.apache.org/jira/browse/SPARK-6554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase Assignee: Cheng Lian Priority: Critical I'm having trouble referencing partition columns in my queries with Parquet. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: {noformat} /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... {noformat} I see the column when I reference load a DF using the /mydata directory and call df.printSchema(): {noformat} |-- probeTypeId: integer (nullable = true) {noformat} Parquet is also aware of the column: {noformat} optional int32 probeTypeId; {noformat} And this works fine: {code} sqlContext.sql(select probeTypeId from df limit 1); {code} ...as does {{df.show()}} - it shows the correct values for the partition column. However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema: {noformat} sqlContext.sql(select probeTypeId from df where probeTypeId = 1 limit 1); ... ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ... ... {noformat} Here's the full stack trace: {noformat} using local[*] for master 06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT] 06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT] 06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration. 06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point INFO org.apache.spark.SparkContext Running Spark version 1.3.0 WARN o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable INFO org.apache.spark.SecurityManager Changing view acls to: jon INFO org.apache.spark.SecurityManager Changing modify acls to: jon INFO org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon) INFO akka.event.slf4j.Slf4jLogger Slf4jLogger started INFO Remoting Starting remoting INFO Remoting Remoting started;
[jira] [Resolved] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6471. --- Resolution: Fixed Fix Version/s: 1.3.1 Issue resolved by pull request 5141 [https://github.com/apache/spark/pull/5141] Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.3.1, 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6471: -- Assignee: Yash Datta Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Assignee: Yash Datta Fix For: 1.3.1, 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6575: -- Description: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. was: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. Add configuration to disable schema merging while converting metastore Parquet tables - Key: SPARK-6575 URL: https://issues.apache.org/jira/browse/SPARK-6575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6608) Make DataFrame.rdd a lazy val
[ https://issues.apache.org/jira/browse/SPARK-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-6608: - Assignee: Cheng Lian Make DataFrame.rdd a lazy val - Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6607: -- Assignee: Liang-Chi Hsieh Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6607) Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6607: -- Target Version/s: 1.4.0 Affects Version/s: 1.1.1 1.2.1 1.3.0 Aggregation attribute name including special chars '(' and ')' should be replaced before generating Parquet schema -- Key: SPARK-6607 URL: https://issues.apache.org/jira/browse/SPARK-6607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh '(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as MAX(a). If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string. Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes. Another possible method might be modifying all aggregation expression names from func(column) to func[column]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6608) Make DataFrame.rdd a lazy val
Cheng Lian created SPARK-6608: - Summary: Make DataFrame.rdd a lazy val Key: SPARK-6608 URL: https://issues.apache.org/jira/browse/SPARK-6608 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Before 1.3.0, {{SchemaRDD.id}} works as a unique identifier of each {{SchemaRDD}}. In 1.3.0, unlike {{SchemaRDD}}, {{DataFrame}} is no longer an RDD, and {{DataFrame.rdd}} is actually a function which always return a new RDD instance. Making {{DataFrame.rdd}} a {{lazy val}} should bring the unique identifier back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6595) DataFrame self joins with MetastoreRelations fail
[ https://issues.apache.org/jira/browse/SPARK-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6595. --- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/5251 DataFrame self joins with MetastoreRelations fail - Key: SPARK-6595 URL: https://issues.apache.org/jira/browse/SPARK-6595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4226: -- Description: I have a test table defined in Hive as follows: {code:sql} CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; {code} and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: {code} val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) {code} I get the following error: {noformat} java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid + org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) {noformat} [This thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html] also brings up lack of subquery support in SparkSQL. It would be nice to have subquery predicate support in a near, future release (1.3, maybe?). was: I have a test table defined in Hive as follows: CREATE TABLE sparkbug ( id INT, event STRING ) STORED AS PARQUET; and insert some sample data with ids 1, 2, 3. In a Spark shell, I then create a HiveContext and then execute the following HQL to test out subquery predicates: val hc = HiveContext(hc) hc.hql(select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))) I get the following error: java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL customerid 2 3 TOK_TABLE_OR_COL customerid scala.NotImplementedError: No parse rules for ASTNode type: 817, text: TOK_SUBQUERY_EXPR : TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP in TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL customerid TOK_WHERE TOK_FUNCTION in TOK_TABLE_OR_COL
[jira] [Resolved] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6369. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5139 [https://github.com/apache/spark/pull/5139] InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.1, 1.4.0 Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6555) Override equals and hashCode in MetastoreRelation
[ https://issues.apache.org/jira/browse/SPARK-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-6555: - Assignee: Cheng Lian Override equals and hashCode in MetastoreRelation - Key: SPARK-6555 URL: https://issues.apache.org/jira/browse/SPARK-6555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian This is a follow-up of SPARK-6450. As explained in [this comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499] of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 release. But overriding {{equals}} and {{hashCode}} is the proper fix to that problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6618. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 HiveMetastoreCatalog.lookupRelation should use fine-grained lock Key: SPARK-6618 URL: https://issues.apache.org/jira/browse/SPARK-6618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.1, 1.4.0 Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) and the scope of lock will cover resolving data source tables (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). So, lookupRelation can be extremely expensive when we are doing expensive operations like parquet schema discovery. So, we should use fine-grained lock for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6542) Add CreateStruct as an Expression
[ https://issues.apache.org/jira/browse/SPARK-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6542. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5195 [https://github.com/apache/spark/pull/5195] Add CreateStruct as an Expression - Key: SPARK-6542 URL: https://issues.apache.org/jira/browse/SPARK-6542 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 Similar to CreateArray, we can add CreateStruct as an Expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition
[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Summary: After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL (was: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)) sql(ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table
[jira] [Commented] (SPARK-6566) Update Spark to use the latest version of Parquet libraries
[ https://issues.apache.org/jira/browse/SPARK-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383627#comment-14383627 ] Cheng Lian commented on SPARK-6566: --- Hi [~k.shaposhni...@gmail.com], as described in SPARK-5463, we do want to upgrade Parquet. However, currently we have two concerns: # The most recent Parquet RC release introduces subtle API incompatibilities related to filter push-down and Parquet metadata gathering, which I believe requires more work than the patch you provided if we want everything works perfectly with the best performance. # We'd like to wait for the official release of Parquet 1.6.0. This is the first release for Parquet as an Apache top-level project, so it takes more time than usual. We probably will first try to upgrade to a most recent 1.6.0 RC release in Spark master, and then switch to the official 1.6.0 release in Spark 1.4.0 (and Spark 1.3.2 if there will be one). Update Spark to use the latest version of Parquet libraries --- Key: SPARK-6566 URL: https://issues.apache.org/jira/browse/SPARK-6566 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov There are a lot of bug fixes in the latest version of parquet (1.6.0rc7). E.g. PARQUET-136 It would be good to update Spark to use the latest parquet version. The following changes are required: {code} diff --git a/pom.xml b/pom.xml index 5ad39a9..095b519 100644 --- a/pom.xml +++ b/pom.xml @@ -132,7 +132,7 @@ !-- Version used for internal directory structure -- hive.version.short0.13.1/hive.version.short derby.version10.10.1.1/derby.version -parquet.version1.6.0rc3/parquet.version +parquet.version1.6.0rc7/parquet.version jblas.version1.2.3/jblas.version jetty.version8.1.14.v20131031/jetty.version orbit.version3.0.0.v201112011016/orbit.version {code} and {code} --- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala @@ -480,7 +480,7 @@ private[parquet] class FilteringParquetRowInputFormat globalMetaData = new GlobalMetaData(globalMetaData.getSchema, mergedMetadata, globalMetaData.getCreatedBy) -val readContext = getReadSupport(configuration).init( +val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( new InitContext(configuration, globalMetaData.getKeyValueMetaData, globalMetaData.getSchema)) {code} I am happy to prepare a pull request if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6565) Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF
Cheng Lian created SPARK-6565: - Summary: Deprecate jsonRDD and replace it by jsonDataFrame / jsonDF Key: SPARK-6565 URL: https://issues.apache.org/jira/browse/SPARK-6565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Since 1.3.0, {{SQLContext.jsonRDD}} actually returns a {{DataFrame}}, the original name becomes confusing. Would be better to deprecate it and add {{jsonDataFrame}} or {{jsonDF}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6587: -- Description: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. was: (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {quote} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at
[jira] [Commented] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695 ] Cheng Lian commented on SPARK-6587: --- This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. Inferring schema for case class hierarchy fails with mysterious message --- Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
[ https://issues.apache.org/jira/browse/SPARK-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385695#comment-14385695 ] Cheng Lian edited comment on SPARK-6587 at 3/29/15 10:32 AM: - This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. was (Author: lian cheng): This behavior is expected. There are two problems in your case: # Because {{things}} contains instances of all three case classes, the type of {{things}} is {{Seq[MyHolder]}}. Since {{MyHolder}} doesn't extend {{Product}}, can't be recognized by {{ScalaReflection}}. # You can only use a single concrete case class {{T}} when converting {{RDD[T]}} or {{Seq[T]}} to a DataFrame. For {{things}}, we can't figure out what data type should the {{foo}} field in the reflected schema have. Inferring schema for case class hierarchy fails with mysterious message --- Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {noformat} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-6579) save as parquet with overwrite failed
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385694#comment-14385694 ] Cheng Lian commented on SPARK-6579: --- Here's another Parquet issue with Hadoop 1.0.4: SPARK-6581. save as parquet with overwrite failed - Key: SPARK-6579 URL: https://issues.apache.org/jira/browse/SPARK-6579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical {code} df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 2,)).toDF(['int', 'str']) df.save(test_data, source=parquet, mode='overwrite') df.save(test_data, source=parquet, mode='overwrite') {code} it failed with: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call toBytes() more than once without calling reset() at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) at parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) at parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) at parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} run it again, it failed with: {code} 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:402) at
[jira] [Updated] (SPARK-6579) save as parquet with overwrite failed when linking with Hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6579: -- Summary: save as parquet with overwrite failed when linking with Hadoop 1.0.4 (was: save as parquet with overwrite failed) save as parquet with overwrite failed when linking with Hadoop 1.0.4 Key: SPARK-6579 URL: https://issues.apache.org/jira/browse/SPARK-6579 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical {code} df = sc.parallelize(xrange(n), 4).map(lambda x: (x, str(x) * 2,)).toDF(['int', 'str']) df.save(test_data, source=parquet, mode='overwrite') df.save(test_data, source=parquet, mode='overwrite') {code} it failed with: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 6, localhost): java.lang.IllegalArgumentException: You cannot call toBytes() more than once without calling reset() at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.column.values.rle.RunLengthBitPackingHybridEncoder.toBytes(RunLengthBitPackingHybridEncoder.java:254) at parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.getBytes(RunLengthBitPackingHybridValuesWriter.java:68) at parquet.column.impl.ColumnWriterImpl.writePage(ColumnWriterImpl.java:147) at parquet.column.impl.ColumnWriterImpl.flush(ColumnWriterImpl.java:236) at parquet.column.impl.ColumnWriteStoreImpl.flush(ColumnWriteStoreImpl.java:113) at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:663) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:677) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1399) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1360) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} run it again, it failed with: {code} 15/03/27 13:26:16 WARN FSInputChecker: Problem opening checksum file: file:/Users/davies/work/spark/tmp/test_data/_temporary/_attempt_201503271324_0011_r_03_0/part-r-4.parquet. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:134) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at
[jira] [Updated] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6581: -- Target Version/s: 1.4.0 Metadata is missing when saving parquet file using hadoop 1.0.4 --- Key: SPARK-6581 URL: https://issues.apache.org/jira/browse/SPARK-6581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: hadoop 1.0.4 Reporter: Pei-Lun Lee When saving parquet file with {code}df.save(foo, parquet){code} It generates only _common_data while _metadata is missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both _metadata and _common_metadata are missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet
[ https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6570: -- Target Version/s: 1.4.0 Spark SQL arrays: explode() fails and cannot save array type to Parquet - Key: SPARK-6570 URL: https://issues.apache.org/jira/browse/SPARK-6570 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase {code} @Rule public TemporaryFolder tmp = new TemporaryFolder(); @Test public void testPercentileWithExplode() throws Exception { StructType schema = DataTypes.createStructType(Lists.newArrayList( DataTypes.createStructField(col1, DataTypes.StringType, false), DataTypes.createStructField(col2s, DataTypes.createArrayType(DataTypes.IntegerType, true), true) )); JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList( RowFactory.create(test, new int[]{1, 2, 3}) )); DataFrame df = sql.createDataFrame(rowRDD, schema); df.registerTempTable(df); df.printSchema(); Listint[] ints = sql.sql(select col2s from df).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, ints.size()); assertArrayEquals(new int[]{1, 2, 3}, ints.get(0)); // fails: lateral view explode does not work: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq ListInteger explodedInts = sql.sql(select col2 from df lateral view explode(col2s) splode as col2).javaRDD() .map(row - row.getInt(0)).collect(); assertEquals(3, explodedInts.size()); assertEquals(Lists.newArrayList(1, 2, 3), explodedInts); // fails: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet); DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + /parquet); loadedDf.registerTempTable(loadedDf); Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, moreInts.size()); assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0)); } {code} {code} root |-- col1: string (nullable = false) |-- col2s: array (nullable = true) ||-- element: integer (containsNull = true) ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 (TID 15) java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) ~[spark-catalyst_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6450) Self joining query failure
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375343#comment-14375343 ] Cheng Lian commented on SPARK-6450: --- Ah, I see. Will handle this ASAP. [~chinnitv] Thanks for reporting this! Self joining query failure -- Key: SPARK-6450 URL: https://issues.apache.org/jira/browse/SPARK-6450 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Anand Mohan Tumuluri Assignee: Cheng Lian Priority: Blocker The below query was working fine till 1.3 commit 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this commit although this commit is completely unrelated) It got broken in 1.3.0 release with an AnalysisException: resolved attributes ... missing from (although this list contains the fields which it reports missing) {code} at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} {code} select Orders.Country, Orders.ProductCategory,count(1) from Orders join (select Orders.Country, count(1) CountryOrderCount from Orders where to_date(Orders.PlacedDate) '2015-01-01' group by Orders.Country order by CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = Orders.Country where to_date(Orders.PlacedDate) '2015-01-01' group by Orders.Country,Orders.ProductCategory; {code} The temporary workaround is to add explicit alias for the table Orders {code} select o.Country, o.ProductCategory,count(1) from Orders o join (select r.Country, count(1) CountryOrderCount from Orders r where to_date(r.PlacedDate) '2015-01-01' group by r.Country order by CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = o.Country where to_date(o.PlacedDate) '2015-01-01' group by o.Country,o.ProductCategory; {code} However this change not only affects self joins, it also seems to affect union queries as well, like the below query which was again working before(commit 9a151ce) got broken {code} select Orders.Country,null,count(1) OrderCount from Orders group by Orders.Country,null union all select null,Orders.ProductCategory,count(1) OrderCount from Orders group by null, Orders.ProductCategory {code} also fails with a Analysis exception. The workaround is to add different aliases for the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SPARK-4985) Parquet support for date type
[ https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-4985. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 3822 [https://github.com/apache/spark/pull/3822] Parquet support for date type - Key: SPARK-4985 URL: https://issues.apache.org/jira/browse/SPARK-4985 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Fix For: 1.3.1, 1.4.0 Parquet serde support for DATE type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4985) Parquet support for date type
[ https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4985: -- Assignee: Adrian Wang Parquet support for date type - Key: SPARK-4985 URL: https://issues.apache.org/jira/browse/SPARK-4985 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.1, 1.4.0 Parquet serde support for DATE type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4985) Parquet support for date type
[ https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4985: -- Target Version/s: 1.3.1, 1.4.0 Parquet support for date type - Key: SPARK-4985 URL: https://issues.apache.org/jira/browse/SPARK-4985 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.1, 1.4.0 Parquet serde support for DATE type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6020) Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite
[ https://issues.apache.org/jira/browse/SPARK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344602#comment-14344602 ] Cheng Lian commented on SPARK-6020: --- Hey [~andrewor14], I think [PR #4835|https://github.com/apache/spark/pull/4835] has already fixed this PR. {{InMemoryColumnarTableScan}} uses accumulators to generate debugging information for testing purposes. Test failures related to this JIRA ticket showed that accumulator updates got lost nondeterministically, which was also exactly what PR #4835 fixed. Also, according to [your amazing statistics sheets|https://docs.google.com/spreadsheets/d/1VSCTXLBqnglk0XMd0R4IhvUPb2MEDQUaRNwxcHBtSlk/edit#gid=52877182], this test suite hadn't been flaky for a week. Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite - Key: SPARK-6020 URL: https://issues.apache.org/jira/browse/SPARK-6020 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Cheng Lian Priority: Critical Observed in the following builds, only one of which has something to do with SQL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27931/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27930/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27929/ org.apache.spark.sql.columnar.PartitionBatchPruningSuite.SELECT key FROM pruningData WHERE NOT (key IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)) {code} Error Message 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 'Project ['key] 'Filter NOT 'key IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) 'UnresolvedRelation [pruningData], None == Analyzed Logical Plan == Project [key#5245] Filter NOT key#5245 IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35 == Optimized Logical Plan == Project [key#5245] Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData) == Physical Plan == Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)], (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData)) Code Generation: false == RDD == Stacktrace sbt.ForkMain$ForkError: 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 'Project ['key] 'Filter NOT 'key IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) 'UnresolvedRelation [pruningData], None == Analyzed Logical Plan == Project [key#5245] Filter NOT key#5245 IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35 == Optimized Logical Plan == Project [key#5245] Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData) == Physical Plan == Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)], (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData)) Code Generation: false == RDD == at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at
[jira] [Created] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities
Cheng Lian created SPARK-6136: - Summary: Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities Key: SPARK-6136 URL: https://issues.apache.org/jira/browse/SPARK-6136 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Integration test suites in the JDBC data sources ({{MySQLIntegration}} and {{PostgresIntegration}}) depends on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary compatibility issues {code} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 ... sql/test-only *.ParquetDataSourceOffIOSuite ... [info] ParquetDataSourceOffIOSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 milliseconds) [info] java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat [info] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261) [info] at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437) [info] at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at
[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
[ https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5707: -- Description: Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} SQL: {code:sql} INSERT INTO TABLE ${hiveconf:TEMP_TABLE} SELECT s_store_name, pr_review_date, pr_review_content FROM ( --select store_name for stores with flat or declining sales in 3 consecutive months. SELECT s_store_name FROM store s JOIN ( -- linear regression part SELECT temp.cat AS cat, --SUM(temp.x)as sumX, --SUM(temp.y)as sumY, --SUM(temp.xy)as sumXY, --SUM(temp.xx)as sumXSquared, --count(temp.x) as N, --N * sumXY - sumX * sumY AS numerator, --N * sumXSquared - sumX*sumX AS denom
[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
[ https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5707: -- Description: Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} SQL: {code:sql} INSERT INTO TABLE ${hiveconf:TEMP_TABLE} SELECT s_store_name, pr_review_date, pr_review_content FROM ( --select store_name for stores with flat or declining sales in 3 consecutive months. SELECT s_store_name FROM store s JOIN ( -- linear regression part SELECT temp.cat AS cat, --SUM(temp.x)as sumX, --SUM(temp.y)as sumY, --SUM(temp.xy)as sumXY, --SUM(temp.xx)as sumXSquared, --count(temp.x) as N, --N * sumXY - sumX * sumY AS numerator, --N * sumXSquared - sumX*sumX AS
[jira] [Updated] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities
[ https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6136: -- Description: Integration test suites in the JDBC data source ({{MySQLIntegration}} and {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary incompatibility issues when Spark is compiled against Hadoop 2.4. {code} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 ... sql/test-only *.ParquetDataSourceOffIOSuite ... [info] ParquetDataSourceOffIOSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 milliseconds) [info] java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat [info] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261) [info] at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437) [info] at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at
[jira] [Resolved] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5775. --- Resolution: Fixed Fix Version/s: 1.3.0 GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table -- Key: SPARK-5775 URL: https://issues.apache.org/jira/browse/SPARK-5775 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Ayoub Benali Assignee: Cheng Lian Priority: Blocker Labels: hivecontext, nested, parquet, partition Fix For: 1.3.0 Using the LOAD sql command in Hive context to load parquet files into a partitioned table causes exceptions during query time. The bug requires the table to have a column of *type Array of struct* and to be *partitioned*. The example bellow shows how to reproduce the bug and you can see that if the table is not partitioned the query works fine. {noformat} scala val data1 = {data_array:[{field1:1,field2:2}]} scala val data2 = {data_array:[{field1:3,field2:4}]} scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val schemaRDD = hiveContext.jsonRDD(jsonRDD) scala schemaRDD.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) scala hiveContext.sql(create external table if not exists partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) Partitioned by (date STRING) STORED AS PARQUET Location 'hdfs:///partitioned_table') scala hiveContext.sql(create external table if not exists none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) STORED AS PARQUET Location 'hdfs:///none_partitioned_table') scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1) scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE partitioned_table PARTITION(date='2015-02-12')) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE none_partitioned_table) scala hiveContext.sql(select data.field1 from none_partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data 15/02/12 16:21:03 INFO ParseDriver: Parse Completed 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with curMem=0, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 254.6 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with curMem=260661, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 27.9 KB, free 267.0 MB) 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on *:51990 (size: 27.9 KB, free: 267.2 MB) 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block broadcast_18_piece0 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD at ParquetTableOperations.scala:119 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side Metadata Split Strategy 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at SparkPlan.scala:84 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at SparkPlan.scala:84) with 3 output partitions (allowLocal=false) 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at SparkPlan.scala:84) 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at map at SparkPlan.scala:84), which has no missing parents 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with curMem=289276, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 7.5 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with curMem=296908, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block
[jira] [Resolved] (SPARK-6073) Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect
[ https://issues.apache.org/jira/browse/SPARK-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6073. --- Resolution: Fixed Fix Version/s: 1.3.0 Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect -- Key: SPARK-6073 URL: https://issues.apache.org/jira/browse/SPARK-6073 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 We should drop the metadata cache in CreateMetastoreDataSourceAsSelect after we append data. Otherwise, users have to manually call HiveContext.refreshTable to drop the cached metadata entry from the catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0
Cheng Lian created SPARK-6109: - Summary: Unit tests fail when compiled against Hive 0.12.0 Key: SPARK-6109 URL: https://issues.apache.org/jira/browse/SPARK-6109 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several Hive 0.13.1 specific test cases always fail against Hive 0.12.0. Need to blacklist them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true
[ https://issues.apache.org/jira/browse/SPARK-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6052. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4806 [https://github.com/apache/spark/pull/4806] In JSON schema inference, we should always set containsNull of an ArrayType to true --- Key: SPARK-6052 URL: https://issues.apache.org/jira/browse/SPARK-6052 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 We should not try to figure out if an array contains null or not because we may miss arrays with null if we do sampling or future data may have nulls in the array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver
[ https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6153: -- Assignee: Adrian Wang intellij import from maven cannot debug sparksqlclidriver - Key: SPARK-6153 URL: https://issues.apache.org/jira/browse/SPARK-6153 Project: Spark Issue Type: Improvement Components: Deploy, SQL Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.3.0 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver
[ https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6153. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4884 [https://github.com/apache/spark/pull/4884] intellij import from maven cannot debug sparksqlclidriver - Key: SPARK-6153 URL: https://issues.apache.org/jira/browse/SPARK-6153 Project: Spark Issue Type: Improvement Components: Deploy, SQL Reporter: Adrian Wang Priority: Minor Fix For: 1.3.0 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver
[ https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6153: -- Description: The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. (was: The {{hive-thriftserver}} module depends on Guava indirectly via {{hive]} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module.) intellij import from maven cannot debug sparksqlclidriver - Key: SPARK-6153 URL: https://issues.apache.org/jira/browse/SPARK-6153 Project: Spark Issue Type: Improvement Components: Deploy, SQL Reporter: Adrian Wang Priority: Minor Fix For: 1.3.0 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver
[ https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6153: -- Description: The {{hive-thriftserver}} module depends on Guava indirectly via {{hive]} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. (was: we need to promote guava dependency to a proper level manually.) intellij import from maven cannot debug sparksqlclidriver - Key: SPARK-6153 URL: https://issues.apache.org/jira/browse/SPARK-6153 Project: Spark Issue Type: Improvement Components: Deploy, SQL Reporter: Adrian Wang Priority: Minor The {{hive-thriftserver}} module depends on Guava indirectly via {{hive]} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6153) intellij import from maven cannot debug sparksqlclidriver
[ https://issues.apache.org/jira/browse/SPARK-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348407#comment-14348407 ] Cheng Lian commented on SPARK-6153: --- Updated JIRA description to reflect the actual change made in PR 4884. intellij import from maven cannot debug sparksqlclidriver - Key: SPARK-6153 URL: https://issues.apache.org/jira/browse/SPARK-6153 Project: Spark Issue Type: Improvement Components: Deploy, SQL Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.3.0 The {{hive-thriftserver}} module depends on Guava indirectly via {{hive}} module. However, the scope of Guava was explicitly set to {{provided}} in the root POM. This makes developers not able to run the Spark SQL CLI tool within IntelliJ IDEA for debugging purposes. Should promote Guava dependency scope to {{runtime}} for {{hive-thriftserver}} module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6147) Move JDBC data source integration tests to the Spark integration tests project
Cheng Lian created SPARK-6147: - Summary: Move JDBC data source integration tests to the Spark integration tests project Key: SPARK-6147 URL: https://issues.apache.org/jira/browse/SPARK-6147 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian In [PR #4872|https://github.com/apache/spark/pull/4872], we removed JDBC integration tests from Spark because of Guava dependency hell. These two test suites should be moved to the [Spark integration tests project|github.com/databricks/spark-integration-tests] 'cause that's where we do complex / time-consuming integration tests, and that project has been well dockerized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 because of runtime incompatibility issues caused by Guava (15?)
Cheng Lian created SPARK-6149: - Summary: Spark SQL CLI doesn't work when compiled against Hive 12 because of runtime incompatibility issues caused by Guava (15?) Key: SPARK-6149 URL: https://issues.apache.org/jira/browse/SPARK-6149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following description is based on [a recent master revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433]. {noformat} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 clean assembly/assembly ... $ ./bin/spark-sql ... spark-sql CREATE TABLE hive_test(key INT, value STRING); 15/03/03 21:28:08 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:596) ... 34 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at
[jira] [Commented] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15
[ https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346414#comment-14346414 ] Cheng Lian commented on SPARK-6149: --- Pointed out by [~pwendell], this is a Maven-vs-SBT issue. In Maven, Guava 15 is properly shaded and Guava 14.0.1 is used. However, SBT still chooses the highest version. Verified that the Maven build is not affected. Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15 -- Key: SPARK-6149 URL: https://issues.apache.org/jira/browse/SPARK-6149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following description is based on [a recent master revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433]. {noformat} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 clean assembly/assembly ... $ ./bin/spark-sql ... spark-sql CREATE TABLE hive_test(key INT, value STRING); 15/03/03 21:28:08 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at
[jira] [Updated] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava (15?)
[ https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6149: -- Summary: Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava (15?) (was: Spark SQL CLI doesn't work when compiled against Hive 12 because of runtime incompatibility issues caused by Guava (15?) ) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava (15?) -- Key: SPARK-6149 URL: https://issues.apache.org/jira/browse/SPARK-6149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following description is based on [a recent master revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433]. {noformat} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 clean assembly/assembly ... $ ./bin/spark-sql ... spark-sql CREATE TABLE hive_test(key INT, value STRING); 15/03/03 21:28:08 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at
[jira] [Updated] (SPARK-6149) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15
[ https://issues.apache.org/jira/browse/SPARK-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6149: -- Summary: Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15 (was: Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava (15?) ) Spark SQL CLI doesn't work when compiled against Hive 12 with SBT because of runtime incompatibility issues caused by Guava 15 -- Key: SPARK-6149 URL: https://issues.apache.org/jira/browse/SPARK-6149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following description is based on [a recent master revision|https://github.com/apache/spark/tree/159b24a1e47e4fa8118e4b81049fbc7bc3406433]. {noformat} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 clean assembly/assembly ... $ ./bin/spark-sql ... spark-sql CREATE TABLE hive_test(key INT, value STRING); 15/03/03 21:28:08 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:602) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3661) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:308) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:37) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:62) at
[jira] [Resolved] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities
[ https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6136. --- Resolution: Fixed Fix Version/s: 1.3.0 Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities -- Key: SPARK-6136 URL: https://issues.apache.org/jira/browse/SPARK-6136 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 Integration test suites in the JDBC data source ({{MySQLIntegration}} and {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary incompatibility issues when Spark is compiled against Hadoop 2.4. {code} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 ... sql/test-only *.ParquetDataSourceOffIOSuite ... [info] ParquetDataSourceOffIOSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 milliseconds) [info] java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat [info] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261) [info] at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437) [info] at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83)
[jira] [Resolved] (SPARK-6134) Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive
[ https://issues.apache.org/jira/browse/SPARK-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6134. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4870 [https://github.com/apache/spark/pull/4870] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive --- Key: SPARK-6134 URL: https://issues.apache.org/jira/browse/SPARK-6134 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Fix For: 1.3.0 In CodeGenerator, the casting on FloatType should use FloatType instead of IntegerType. Besides, defaultPrimitive for LongType should be -1L instead of 1L. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5938) Generate row from json efficiently
[ https://issues.apache.org/jira/browse/SPARK-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5938: -- Assignee: Liang-Chi Hsieh Generate row from json efficiently -- Key: SPARK-5938 URL: https://issues.apache.org/jira/browse/SPARK-5938 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Generate row from json efficiently in JsonRDD object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6136) Docker client library introduces Guava 17.0, which causes runtime binary incompatibilities
[ https://issues.apache.org/jira/browse/SPARK-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6136: -- Description: Integration test suites in the JDBC data source ({{MySQLIntegration}} and {{PostgresIntegration}}) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing runtime binary compatibility issues {code} $ ./build/sbt -Pyarn,hadoop-2.4,hive,hive-0.12.0,scala-2.10 -Dhadoop.version=2.4.1 ... sql/test-only *.ParquetDataSourceOffIOSuite ... [info] ParquetDataSourceOffIOSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.parquet.ParquetDataSourceOffIOSuite *** ABORTED *** (134 milliseconds) [info] java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat [info] at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:261) [info] at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:277) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:437) [info] at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1525) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:797) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$checkParquetFile$1.apply(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetDataFrame$1.apply(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:94) [info] at org.apache.spark.sql.parquet.ParquetTest$$anonfun$withParquetFile$1.apply(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempPath(ParquetTest.scala:71) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withTempPath(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetFile(ParquetTest.scala:92) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetFile(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withParquetDataFrame(ParquetTest.scala:105) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.withParquetDataFrame(ParquetIOSuite.scala:67) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase.checkParquetFile(ParquetIOSuite.scala:76) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply$mcV$sp(ParquetIOSuite.scala:83) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.apache.spark.sql.parquet.ParquetIOSuiteBase$$anonfun$1.apply(ParquetIOSuite.scala:79) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at
[jira] [Created] (SPARK-5947) First class partitioning support in data sources API
Cheng Lian created SPARK-5947: - Summary: First class partitioning support in data sources API Key: SPARK-5947 URL: https://issues.apache.org/jira/browse/SPARK-5947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian For file system based data sources, implementing Hive style partitioning support can be complex and error prone. To be specific, partitioning support include: # Partition discovery: Given a directory organized similar to Hive partitions, discover the directory structure and partitioning information automatically, including partition column names, data types, and values. # Reading from partitioned tables # Writing to partitioned tables It would be good to have first class partitioning support in the data sources API. For example, add a {{FileBasedScan}} trait with callbacks and default implementations for these features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5948) Support writing to partitioned table for the Parquet data source
Cheng Lian created SPARK-5948: - Summary: Support writing to partitioned table for the Parquet data source Key: SPARK-5948 URL: https://issues.apache.org/jira/browse/SPARK-5948 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian In 1.3.0, we added support for reading partitioned tables declared in Hive metastore for the Parquet data source. However, writing to partitioned tables is not supported yet. This feature should probably built upon SPARK-5947. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas
Cheng Lian created SPARK-6010: - Summary: Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas Key: SPARK-6010 URL: https://issues.apache.org/jira/browse/SPARK-6010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following test case added in {{ParquetPartitionDiscoverySuite}} can be used to reproduce this issue: {code} test(read partitioned table - merging compatible schemas) { withTempDir { base = makeParquetFile( (1 to 10).map(i = Tuple1(i)).toDF(intField), makePartitionDir(base, defaultPartitionName, pi - 1)) makeParquetFile( (1 to 10).map(i = (i, i.toString)).toDF(intField, stringField), makePartitionDir(base, defaultPartitionName, pi - 2)) load(base.getCanonicalPath, org.apache.spark.sql.parquet).registerTempTable(t) withTempTable(t) { checkAnswer( sql(SELECT * FROM t), (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, i.toString, 2))) } } } {code} Exception thrown: {code} [info] java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: [{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]}, {type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}] [info] at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) [info] at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484) [info] at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) [info] at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1518) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempDir(ParquetTest.scala:82) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempDir(ParquetPartitionDiscoverySuite.scala:35) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:325) [info] at
[jira] [Updated] (SPARK-5968) Parquet warning in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5968: -- Description: This may happen in the case of schema evolving, namely appending new Parquet data with different but compatible schema to existing Parquet files: {code} 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for rankings parquet.io.ParquetEncodingException: file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: all the files must be contained in the root rankings at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) {code} The reason is that the Spark SQL schemas stored in Parquet key-value metadata differ. Parquet doesn't know how to merge these opaque user-defined metadata, and just throw an exception and give up writing summary files. Since the Parquet data source in Spark 1.3.0 supports schema merging, it's harmless. But this is kind of scary for the user. We should try to suppress this through the logger. was: {code} 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for rankings parquet.io.ParquetEncodingException: file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: all the files must be contained in the root rankings at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) {code} it is only a warning, but kind of scary for the user. We should try to suppress this through the logger. Parquet warning in spark-shell -- Key: SPARK-5968 URL: https://issues.apache.org/jira/browse/SPARK-5968 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical This may happen in the case of schema evolving, namely appending new Parquet data with different but compatible schema to existing Parquet files: {code} 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file for rankings parquet.io.ParquetEncodingException: file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet invalid: all the files must be contained in the root rankings at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398) at parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51) {code} The reason is that the Spark SQL schemas stored in Parquet key-value metadata differ. Parquet doesn't know how to merge these opaque user-defined metadata, and just throw an exception and give up writing summary files. Since the Parquet data source in Spark 1.3.0 supports schema merging, it's harmless. But this is kind of scary for the user. We should try to suppress this through the logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335082#comment-14335082 ] Cheng Lian commented on SPARK-3850: --- Actually that was a Scala source file, where we put query strings and generate MD5 according to the content of query strings. But I agree that was a really rare case. Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas Priority: Minor Background discussions: * https://github.com/apache/spark/pull/2619 * http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html If you look at [the PR Cheng opened|https://github.com/apache/spark/pull/2619], you'll see a trailing white space seemed to mess up some SQL test. That's what spurred the creation of this issue. [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using this [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
[ https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6016. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4775 [https://github.com/apache/spark/pull/4775] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). To reproduce it, try... {code} import org.apache.spark.sql.SaveMode import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}), 2)) // we will save to 2 parquet files. df1.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) // Warm the FilteringParquetRowInputFormat.footerCache val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{b:$i}), 4)) // we will save to 4 parquet files. df2.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) {code} For this example, we have two outdated footers for df1 in footerCache and since we have four parquet files for the new test table, we picked up 2 new footers for df2. Then, we hit the bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6023) ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2
[ https://issues.apache.org/jira/browse/SPARK-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6023. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4782 [https://github.com/apache/spark/pull/4782] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2 Key: SPARK-6023 URL: https://issues.apache.org/jira/browse/SPARK-6023 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 {code} import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}))) df1.registerTempTable(jt) sql(create table test (a bigint) stored as parquet ) sql(explain insert into table test select a from jt).collect.foreach(println) {code} The plan will be {code} [== Physical Plan ==] [InsertIntoHiveTable (MetastoreRelation default, test, None), Map(), false] [ PhysicalRDD [a#34L], MapPartitionsRDD[17] at map at JsonRDD.scala:41] {code} However, the write path should be converted to our own data source path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339010#comment-14339010 ] Cheng Lian commented on SPARK-5775: --- Hey [~avignon], sorry for the delay. I've left comments on the PR page. Thanks a lot for working on this! GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table -- Key: SPARK-5775 URL: https://issues.apache.org/jira/browse/SPARK-5775 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Ayoub Benali Assignee: Cheng Lian Priority: Blocker Labels: hivecontext, nested, parquet, partition Using the LOAD sql command in Hive context to load parquet files into a partitioned table causes exceptions during query time. The bug requires the table to have a column of *type Array of struct* and to be *partitioned*. The example bellow shows how to reproduce the bug and you can see that if the table is not partitioned the query works fine. {noformat} scala val data1 = {data_array:[{field1:1,field2:2}]} scala val data2 = {data_array:[{field1:3,field2:4}]} scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val schemaRDD = hiveContext.jsonRDD(jsonRDD) scala schemaRDD.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) scala hiveContext.sql(create external table if not exists partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) Partitioned by (date STRING) STORED AS PARQUET Location 'hdfs:///partitioned_table') scala hiveContext.sql(create external table if not exists none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) STORED AS PARQUET Location 'hdfs:///none_partitioned_table') scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1) scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE partitioned_table PARTITION(date='2015-02-12')) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE none_partitioned_table) scala hiveContext.sql(select data.field1 from none_partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data 15/02/12 16:21:03 INFO ParseDriver: Parse Completed 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with curMem=0, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 254.6 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with curMem=260661, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 27.9 KB, free 267.0 MB) 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on *:51990 (size: 27.9 KB, free: 267.2 MB) 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block broadcast_18_piece0 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD at ParquetTableOperations.scala:119 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side Metadata Split Strategy 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at SparkPlan.scala:84 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at SparkPlan.scala:84) with 3 output partitions (allowLocal=false) 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at SparkPlan.scala:84) 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at map at SparkPlan.scala:84), which has no missing parents 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with curMem=289276, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 7.5 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with
[jira] [Resolved] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out
[ https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5751. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4720 [https://github.com/apache/spark/pull/4720] Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out -- Key: SPARK-5751 URL: https://issues.apache.org/jira/browse/SPARK-5751 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Labels: flaky-test Fix For: 1.3.0 The Test JDBC query execution test case times out occasionally, all other test cases are just fine. The failure output only contains service startup command line without any log output. Guess somehow the test case misses the log file path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6037) Avoiding duplicate Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6037: -- Assignee: Liang-Chi Hsieh Avoiding duplicate Parquet schema merging - Key: SPARK-6037 URL: https://issues.apache.org/jira/browse/SPARK-6037 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor FilteringParquetRowInputFormat manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in ParquetRelation2. We don't need to re-merge them at InputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6037) Avoiding duplicate Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6037. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4786 [https://github.com/apache/spark/pull/4786] Avoiding duplicate Parquet schema merging - Key: SPARK-6037 URL: https://issues.apache.org/jira/browse/SPARK-6037 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 FilteringParquetRowInputFormat manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in ParquetRelation2. We don't need to re-merge them at InputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6572) When I build Spark 1.3 sbt gives me to following error : unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1
[ https://issues.apache.org/jira/browse/SPARK-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383999#comment-14383999 ] Cheng Lian commented on SPARK-6572: --- Would you please provide exact command line you used to invoke SBT? When I build Spark 1.3 sbt gives me to following error: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.1.1: not found org.scalamacros#quasiquotes_2.11;2.0.1: not found [error] Total time: 27 s, completed 27-Mar-2015 14:24:39 Key: SPARK-6572 URL: https://issues.apache.org/jira/browse/SPARK-6572 Project: Spark Issue Type: Bug Reporter: Frank Domoney -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6575: -- Description: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. Add configuration to disable schema merging while converting metastore Parquet tables - Key: SPARK-6575 URL: https://issues.apache.org/jira/browse/SPARK-6575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6575: -- Description: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. was: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. Add configuration to disable schema merging while converting metastore Parquet tables - Key: SPARK-6575 URL: https://issues.apache.org/jira/browse/SPARK-6575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow
[ https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6483. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5154 [https://github.com/apache/spark/pull/5154] Spark SQL udf(ScalaUdf) is very slow Key: SPARK-6483 URL: https://issues.apache.org/jira/browse/SPARK-6483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0 Environment: 1. Spark version is 1.3.0 2. 3 node per 80G/20C 3. read 250G parquet files from hdfs Reporter: zzc Assignee: zzc Fix For: 1.4.0 Test case: 1. register floor func with command: sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan, floor(ts), *it takes 17 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23500], [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS c2#23495L] Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) Aggregate true, [chan#23015,scalaUDF(ts#23016)], [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS PartialSum#23499L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at map at newParquet.scala:562 {quote} 2. run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 group by chan, (ts - ts % 300), *it takes only 5 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23349], [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS c2#23344L] Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54) Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map at newParquet.scala:562 {quote} 3. use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) *it takes only 5 minutes too. * {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23108L], [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS _c2#23103L] Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) Aggregate true, [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300)))], [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS PartialSum#23107L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map at newParquet.scala:562 {quote} *Why? ScalaUdf is so slow?? How to improve it?* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow
[ https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6483: -- Assignee: zzc Spark SQL udf(ScalaUdf) is very slow Key: SPARK-6483 URL: https://issues.apache.org/jira/browse/SPARK-6483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0 Environment: 1. Spark version is 1.3.0 2. 3 node per 80G/20C 3. read 250G parquet files from hdfs Reporter: zzc Assignee: zzc Fix For: 1.4.0 Test case: 1. register floor func with command: sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan, floor(ts), *it takes 17 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23500], [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS c2#23495L] Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) Aggregate true, [chan#23015,scalaUDF(ts#23016)], [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS PartialSum#23499L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at map at newParquet.scala:562 {quote} 2. run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 group by chan, (ts - ts % 300), *it takes only 5 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23349], [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS c2#23344L] Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54) Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map at newParquet.scala:562 {quote} 3. use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) *it takes only 5 minutes too. * {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23108L], [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS _c2#23103L] Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) Aggregate true, [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300)))], [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS PartialSum#23107L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map at newParquet.scala:562 {quote} *Why? ScalaUdf is so slow?? How to improve it?* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6538) Add missing nullable Metastore fields when merging a Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6538. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5214 [https://github.com/apache/spark/pull/5214] Add missing nullable Metastore fields when merging a Parquet schema --- Key: SPARK-6538 URL: https://issues.apache.org/jira/browse/SPARK-6538 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Adam Budde Fix For: 1.3.1, 1.4.0 When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore. In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an ALTER TABLE... ADD PARTITION... statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema. In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The mergeMetastoreParquetSchema() method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore. I propose altering this implementation in order to allow for any missing metastore fields marked as nullable to be merged in to the Parquet file's schema before continuing with the checks present in mergeMetastoreParquetSchema(). Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you feel this should be an improvement or new feature instead, please feel free to reclassify this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
Cheng Lian created SPARK-6467: - Summary: Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis Key: SPARK-6467 URL: https://issues.apache.org/jira/browse/SPARK-6467 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Description: Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. (was: Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis.) Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481 ] Cheng Lian edited comment on SPARK-6456 at 3/23/15 7:21 AM: How many partitions are there? Also, what's the version of the Hive metastore? For now, Spark SQL only support Hive 0.12.0 and 0.13.1. Spark 1.1 and prior versions only support Hive 0.12.0. was (Author: lian cheng): How many partitions are there? Spark Sql throwing exception on large partitioned data -- Key: SPARK-6456 URL: https://issues.apache.org/jira/browse/SPARK-6456 Project: Spark Issue Type: Bug Components: SQL Reporter: pankaj Fix For: 1.2.1 Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Summary: Exclude virtual columns from QueryPlan.missingInput (was: Override QueryPlan.missingInput when necessary and rely on CheckAnalysis) Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Affects Version/s: 1.3.0 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6397: -- Assignee: Yadong Qi Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375680#comment-14375680 ] Cheng Lian commented on SPARK-6397: --- Hey [~smolav], after some discussion with [~waterman] in his PRs, we decided to fix the GROUPING__ID virtual column issue first. So I updated the title and description of this JIRA ticket, and created SPARK-6467 for the original one. You may link your PR to that one. Thanks! I should have created another JIRA ticket for the fix introduced in [~waterman]'s PR, but I realized the problem too late after merging it. Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6397. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 5132 [https://github.com/apache/spark/pull/5132] Exclude virtual columns from QueryPlan.missingInput --- Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Priority: Minor Fix For: 1.3.1, 1.4.0 Virtual columns like GROUPING__ID should never be considered as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6456: -- Description: Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} was: Observation: Spark connects with hive Metastore. i am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at
[jira] [Commented] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481 ] Cheng Lian commented on SPARK-6456: --- How many partitions are there? Spark Sql throwing exception on large partitioned data -- Key: SPARK-6456 URL: https://issues.apache.org/jira/browse/SPARK-6456 Project: Spark Issue Type: Bug Components: SQL Reporter: pankaj Fix For: 1.2.1 Spark connects with Hive Metastore. I am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {noformat} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5203. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4004 [https://github.com/apache/spark/pull/4004] union with different decimal type report error -- Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei Fix For: 1.4.0 Test case like this: {code:sql} create table test (a decimal(10,1)); select a from test union all select a*2 from test; {code} Exception thown: {noformat} 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4521) Parquet fails to read columns with spaces in the name
[ https://issues.apache.org/jira/browse/SPARK-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-4521. --- Resolution: Done This ticket is covered by SPARK-6607. Parquet fails to read columns with spaces in the name - Key: SPARK-4521 URL: https://issues.apache.org/jira/browse/SPARK-4521 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust I think this is actually a bug in parquet, but it would be good to track it here as well. To reproduce: {code} jsonRDD(sparkContext.parallelize({number of clusters: 1}::Nil)).saveAsParquetFile(test) parquetFile(test).collect() {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 13, localhost): java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'of' at line 1: optional int32 number of at parquet.schema.MessageTypeParser.check(MessageTypeParser.java:209) at parquet.schema.MessageTypeParser.addPrimitiveType(MessageTypeParser.java:182) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:108) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:189) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org