Re: Unable to use HiveContext in spark-shell
What version of Spark are you using? Did you compile your Spark version and if so, what compile options did you use? On 11/6/14, 9:22 AM, tridib tridib.sama...@live.com wrote: Help please! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveCont ext-in-spark-shell-tp18261p18280.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to use HiveContext in spark-shell
Those are the same options I used, except I had —tgz to package it and I built off of the master branch. Unfortunately, my only guess is that these errors stem from your build environment. In your spark assembly, do you have any classes which belong to the org.apache.hadoop.hive package? From: Tridib Samanta tridib.sama...@live.commailto:tridib.sama...@live.com Date: Thursday, November 6, 2014 at 9:49 AM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com, u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: RE: Unable to use HiveContext in spark-shell I am using spark 1.1.0. I built it using: ./make-distribution.sh -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests My ultimate goal is to execute a query on parquet file with nested structure and cast a date string to Date. This is required to calculate the age of Person entity. but I am even unable to pass this line: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) I made sure that org.apache.hadoop package is in the spark assembly jar. Re-attaching the stack trace for quick reference. scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term hive in package org.apache.hadoop which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. error: while compiling: console during phase: erasure library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: last tree to typer: Apply(value $outer) symbol: value $outer (flags: method synthetic stable expandedname triedcooking) symbol definition: val $outer(): $iwC.$iwC.type tpe: $iwC.$iwC.type symbol owners: value $outer - class $iwC - class $iwC - class $iwC - class $read - package $line5 context owners: class $iwC - class $iwC - class $iwC - class $iwC - class $read - package $line5 == Enclosing template or block == ClassDef( // class $iwC extends Serializable 0 $iwC [] Template( // val local $iwC: notype, tree.tpe=$iwC java.lang.Object, scala.Serializable // parents ValDef( private _ tpt empty ) // 5 statements DefDef( // def init(arg$outer: $iwC.$iwC.$iwC.type): $iwC method triedcooking init [] // 1 parameter list ValDef( // $outer: $iwC.$iwC.$iwC.type $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) tpt // tree.tpe=$iwC Block( // tree.tpe=Unit Apply( // def init(): Object in class Object, tree.tpe=Object $iwC.super.init // def init(): Object in class Object, tree.tpe=()Object Nil ) () ) ) ValDef( // private[this] val sqlContext: org.apache.spark.sql.hive.HiveContext private local triedcooking sqlContext tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext Apply( // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext new org.apache.spark.sql.hive.HiveContext.init // def init(sc: org.apache.spark.SparkContext): org.apache.spark.sql.hive.HiveContext in class HiveContext, tree.tpe=(sc: org.apache.spark.SparkContext)org.apache.spark.sql.hive.HiveContext Apply( // val sc(): org.apache.spark.SparkContext, tree.tpe=org.apache.spark.SparkContext $iwC.this.$line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$iwC$$$outer().$line5$$read$$iwC$$iwC$$$outer().$VAL1().$iw().$iw().sc // val sc(): org.apache.spark.SparkContext, tree.tpe=()org.apache.spark.SparkContext Nil ) ) ) DefDef( // val sqlContext(): org.apache.spark.sql.hive.HiveContext method stable accessor sqlContext [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.hive.HiveContext $iwC.this.sqlContext // private[this] val sqlContext: org.apache.spark.sql.hive.HiveContext, tree.tpe=org.apache.spark.sql.hive.HiveContext ) ValDef( // protected val $outer: $iwC.$iwC.$iwC.type protected synthetic paramaccessor triedcooking $outer tpt // tree.tpe=$iwC.$iwC.$iwC.type empty ) DefDef( // val $outer(): $iwC.$iwC.$iwC.type method synthetic stable expandedname triedcooking $line5$$read$$iwC$$iwC$$iwC$$iwC$$$outer [] List(Nil) tpt // tree.tpe=Any $iwC.this.$outer // protected val $outer: $iwC.$iwC.$iwC.type, tree.tpe=$iwC.$iwC.$iwC.type ) ) ) == Expanded type of tree == ThisType(class $iwC) uncaught exception during compilation: scala.reflect.internal.Types$TypeError scala.reflect.internal.Types
NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile
I just built the 1.2 snapshot current as of commit 76386e1a23c using: $ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests -Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn I drop in my Hive configuration files into the conf directory, launch spark-shell, and then create my HiveContext, hc. I then issue a “use db” command: scala hc.hql(“use db”) and receive the following class-not-found error: java.lang.NoClassDefFoundError: com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy at org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1224) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:315) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:286) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:111) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:115) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:40) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:42) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44) at $iwC$$iwC$$iwC$$iwC.init(console:46) at $iwC$$iwC$$iwC.init(console:48) at $iwC$$iwC.init(console:50) at $iwC.init(console:52) at init(console:54) at .init(console:58) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:8 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILola:968) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoadla:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:353) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at
ParquetFilters and StringType support for GT, GTE, LT, LTE
Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE operations? I was able to previously have a predicate where my column type was a string and execute a filter with one of the above operators in SparkSQL w/o any problems. However, I synced up to the latest code this morning and now the same query will give me a MatchError for this column of string type. Thanks, -Terry
Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile
Thanks, Kousuke. I’ll wait till this pull request makes it into the master branch. -Terry From: Kousuke Saruta saru...@oss.nttdata.co.jpmailto:saru...@oss.nttdata.co.jp Date: Monday, November 3, 2014 at 11:11 AM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: NoClassDefFoundError encountered in Spark 1.2-snapshot build with hive-0.13.1 profile Hi Terry I think the issue you mentioned will be resolved by following PR. https://github.com/apache/spark/pull/3072 - Kousuke (2014/11/03 10:42), Terry Siu wrote: I just built the 1.2 snapshot current as of commit 76386e1a23c using: $ ./make-distribution.sh —tgz —name my-spark —skip-java-test -DskipTests -Phadoop-2.4 -Phive -Phive-0.13.1 -Pyarn I drop in my Hive configuration files into the conf directory, launch spark-shell, and then create my HiveContext, hc. I then issue a “use db” command: scala hc.hql(“use db”) and receive the following class-not-found error: java.lang.NoClassDefFoundError: com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy at org.apache.hadoop.hive.ql.exec.Utilities.clinit(Utilities.java:925) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1224) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:315) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:286) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:424) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:424) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:111) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:115) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:40) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:42) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44) at $iwC$$iwC$$iwC$$iwC.init(console:46) at $iwC$$iwC$$iwC.init(console:48) at $iwC$$iwC.init(console:50) at $iwC.init(console:52) at init(console:54) at .init(console:58) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIva:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:8 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILola:968) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scal at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoadla:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916
Re: ParquetFilters and StringType support for GT, GTE, LT, LTE
Done. https://issues.apache.org/jira/browse/SPARK-4213 Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, November 3, 2014 at 1:37 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: ParquetFilters and StringType support for GT, GTE, LT, LTE That sounds like a regression. Could you open a JIRA with steps to reproduce (https://issues.apache.org/jira/browse/SPARK)? We'll want to fix this before the 1.2 release. On Mon, Nov 3, 2014 at 11:04 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Is there any reason why StringType is not a supported type the GT, GTE, LT, LTE operations? I was able to previously have a predicate where my column type was a string and execute a filter with one of the above operators in SparkSQL w/o any problems. However, I synced up to the latest code this morning and now the same query will give me a MatchError for this column of string type. Thanks, -Terry
Spark Build
I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classes... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry
Re: Spark Build
Thanks for the update, Shivaram. -Terry On 10/31/14, 12:37 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I¹ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn -DskipTests -Phive-0.13-1 -Phadoop-2.4 -Pyarn clean package The build fails at the following stage: INFO] Using incremental compilation [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) [INFO] Compiling 5 Scala sources to /home/terrys/Applications/spark/yarn/stable/target/scala-2.10/test-classe s... [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:20: object MemLimitLogger is not a member of package org.apache.spark.deploy.yarn [ERROR] import org.apache.spark.deploy.yarn.MemLimitLogger._ [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:29: not found: value memLimitExceededLogMessage [ERROR] val vmemMsg = memLimitExceededLogMessage(diagnostics, VMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] /home/terrys/Applications/spark/yarn/common/src/test/scala/org/apache/spa rk/deploy/yarn/YarnAllocatorSuite.scala:30: not found: value memLimitExceededLogMessage [ERROR] val pmemMsg = memLimitExceededLogMessage(diagnostics, PMEM_EXCEEDED_PATTERN) [ERROR] ^ [ERROR] three errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.758s] [INFO] Spark Project Common Network Code . SUCCESS [6.716s] [INFO] Spark Project Core SUCCESS [2:46.610s] [INFO] Spark Project Bagel ... SUCCESS [16.776s] [INFO] Spark Project GraphX .. SUCCESS [52.159s] [INFO] Spark Project Streaming ... SUCCESS [1:09.883s] [INFO] Spark Project ML Library .. SUCCESS [1:18.932s] [INFO] Spark Project Tools ... SUCCESS [10.210s] [INFO] Spark Project Catalyst SUCCESS [1:12.499s] [INFO] Spark Project SQL . SUCCESS [1:10.561s] [INFO] Spark Project Hive SUCCESS [1:08.571s] [INFO] Spark Project REPL SUCCESS [32.377s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.317s] [INFO] Spark Project YARN Stable API . FAILURE [25.918s] [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 11:15.889s [INFO] Finished at: Fri Oct 31 12:08:55 PDT 2014 [INFO] Final Memory: 73M/829M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-yarn_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :spark-yarn_2.10 I could not find MemLimitLogger anywhere in the Spark code. Anybody else seen/encounter this? Thanks, -Terry
Re: Ambiguous references to id : what does it mean ?
Found this as I am having the same issue. I have exactly the same usage as shown in Michael's join example. I tried executing a SQL statement against the join data set with two columns that have the same name and tried to unambiguate the column name with the table alias, but I would still get an Unresolved attributes error back. Is there any way around this short of renaming the columns in the join sources? Thanks -Terry Michael Armbrust wrote Yes, but if both tagCollection and selectedVideos have a column named id then Spark SQL does not know which one you are referring to in the where clause. Here's an example with aliases: val x = testData2.as('x) val y = testData2.as('y) val join = x.join(y, Inner, Some(x.a.attr === y.a.attr)) On Wed, Jul 16, 2014 at 2:47 AM, Jaonary Rabarisoa lt; jaonary@ gt; wrote: My query is just a simple query that use the spark sql dsl : tagCollection.join(selectedVideos).where('videoId === 'id) On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai lt; huaiyin.thu@ gt; wrote: Hi Jao, Seems the SQL analyzer cannot resolve the references in the Join condition. What is your query? Did you use the Hive Parser (your query was submitted through hql(...)) or the basic SQL Parser (your query was submitted through sql(...)). Thanks, Yin On Tue, Jul 15, 2014 at 8:52 AM, Jaonary Rabarisoa lt; jaonary@ gt; wrote: Hi all, When running a join operation with Spark SQL I got the following error : Exception in thread main org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to id: (id#303,List()),(id#0,List()), tree: Filter ('videoId = 'id) Join Inner, None ParquetRelation data/tags.parquet Filter (name#1 = P1/cam1) ParquetRelation data/videos.parquet What does it mean ? Cheers, jao
Re: SparkSQL - TreeNodeException for unresolved attributes
Just to follow up, the queries worked against master and I got my whole flow rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with the next release of CDH5 :P -Terry From: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Date: Monday, October 20, 2014 at 12:22 PM To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Hi Michael, Thanks again for the reply. Was hoping it was something I was doing wrong in 1.1.0, but I’ll try master. Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 20, 2014 at 12:11 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with GROUP BY to write back out to a Hive rollup table that has two partitions. This task is an effort to simulate the unsupported GROUPING SETS functionality in SparkSQL. In my first attempt, I got really close using SchemaRDD.groupBy until I realized that SchemaRDD.insertTo API does not support partitioned tables yet. This prompted my second attempt to pass in SQL to the HiveContext.sql API instead. Here’s a rundown of the commands I executed on the spark-shell: val hc = new HiveContext(sc) hc.setConf(spark.sql.hive.convertMetastoreParquet, true”) hc.setConf(spark.sql.parquet.compression.codec, snappy”) // For implicit conversions to Expression val sqlContext = new SQLContext(sc) import sqlContext._ val segCusts = hc.hql(“select …”) val segTxns = hc.hql(“select …”) val sc = segCusts.as('sc) val st = segTxns.as(‘st) // Join the segCusts and segTxns tables val rup = sc.join(st, Inner, Some(sc.segcustomerid.attr===st.customerid.attr)) rup.registerAsTable(“rupbrand”) If I do a printSchema on the rup, I get: root |-- segcustomerid: string (nullable = true) |-- sales: double (nullable = false) |-- tx_count: long (nullable = false) |-- storeid: string (nullable = true) |-- transdate: long (nullable = true) |-- transdate_ts: string (nullable = true) |-- transdate_dt: string (nullable = true) |-- unitprice: double (nullable = true) |-- translineitem: string (nullable = true) |-- offerid: string (nullable = true) |-- customerid: string (nullable = true) |-- customerkey: string (nullable = true) |-- sku: string (nullable = true) |-- quantity: double (nullable = true) |-- returnquantity: double (nullable = true) |-- channel: string (nullable = true) |-- unitcost: double (nullable = true) |-- transid: string (nullable = true) |-- productid: string (nullable = true) |-- id: string (nullable = true) |-- campaign_campaigncost: double (nullable = true) |-- campaign_begindate: long (nullable = true) |-- campaign_begindate_ts: string (nullable = true) |-- campaign_begindate_dt: string (nullable = true) |-- campaign_enddate: long (nullable = true) |-- campaign_enddate_ts: string (nullable = true) |-- campaign_enddate_dt: string (nullable = true) |-- campaign_campaigntitle: string (nullable = true) |-- campaign_campaignname: string (nullable = true) |-- campaign_id: string (nullable = true) |-- product_categoryid: string (nullable = true) |-- product_company: string (nullable = true) |-- product_brandname: string (nullable = true) |-- product_vendorid: string (nullable = true) |-- product_color: string (nullable = true) |-- product_brandid: string (nullable = true) |-- product_description: string (nullable = true) |-- product_size: string (nullable = true) |-- product_subcategoryid: string (nullable = true) |-- product_departmentid: string (nullable = true) |-- product_productname: string (nullable = true) |-- product_categoryname: string (nullable = true) |-- product_vendorname: string (nullable = true) |-- product_sku: string (nullable = true) |-- product_subcategoryname: string (nullable = true) |-- product_status: string (nullable = true) |-- product_departmentname: string (nullable = true) |-- product_style: string (nullable = true) |-- product_id: string (nullable = true) |-- customer_lastname: string (nullable
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Yin, Sorry for the delay, but I’ll try the code change when I get a chance, but Michael’s initial response did solve my problem. In the meantime, I’m hitting another issue with SparkSQL which I will probably post another message if I can’t figure a workaround. Thanks, -Terry From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Thursday, October 16, 2014 at 7:08 AM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, I guess you hit this bughttps://issues.apache.org/jira/browse/SPARK-3559. The list of needed column ids was messed up. Can you try the master branch or apply the code changehttps://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4 to your 1.1 and see if the problem is resolved?/ Thanks, Yin On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed that when I populated it with a single INSERT OVERWRITE over all the partitions and then executed the Spark code, it would report an illegal index value of 29. However, if I manually did INSERT OVERWRITE for every single partition, I would get an illegal index value of 21. I don’t know if this will help in debugging, but here’s the DESCRIBE output for pqt_segcust_snappy: OK col_namedata_type comment customer_id string from deserializer age_range string from deserializer gender string from deserializer last_tx_datebigint from deserializer last_tx_date_ts string from deserializer last_tx_date_dt string from deserializer first_tx_date bigint from deserializer first_tx_date_tsstring from deserializer first_tx_date_dtstring from deserializer second_tx_date bigint from deserializer second_tx_date_ts string from deserializer second_tx_date_dt string from deserializer third_tx_date bigint from deserializer third_tx_date_tsstring from deserializer third_tx_date_dtstring from deserializer frequency double from deserializer tx_size double from deserializer recency double from deserializer rfm double from deserializer tx_countbigint from deserializer sales double from deserializer coll_def_id string None seg_def_id string None # Partition Information # col_name data_type comment coll_def_id string None seg_def_id string None Time taken: 0.788 seconds, Fetched: 29 row(s) As you can see, I have 21 data columns, followed by the 2 partition columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the console output. Let me know if you need more information. Thanks -Terry From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Tuesday, October 14, 2014 at 6:29 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
SparkSQL - TreeNodeException for unresolved attributes
Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with GROUP BY to write back out to a Hive rollup table that has two partitions. This task is an effort to simulate the unsupported GROUPING SETS functionality in SparkSQL. In my first attempt, I got really close using SchemaRDD.groupBy until I realized that SchemaRDD.insertTo API does not support partitioned tables yet. This prompted my second attempt to pass in SQL to the HiveContext.sql API instead. Here’s a rundown of the commands I executed on the spark-shell: val hc = new HiveContext(sc) hc.setConf(spark.sql.hive.convertMetastoreParquet, true”) hc.setConf(spark.sql.parquet.compression.codec, snappy”) // For implicit conversions to Expression val sqlContext = new SQLContext(sc) import sqlContext._ val segCusts = hc.hql(“select …”) val segTxns = hc.hql(“select …”) val sc = segCusts.as('sc) val st = segTxns.as(‘st) // Join the segCusts and segTxns tables val rup = sc.join(st, Inner, Some(sc.segcustomerid.attr===st.customerid.attr)) rup.registerAsTable(“rupbrand”) If I do a printSchema on the rup, I get: root |-- segcustomerid: string (nullable = true) |-- sales: double (nullable = false) |-- tx_count: long (nullable = false) |-- storeid: string (nullable = true) |-- transdate: long (nullable = true) |-- transdate_ts: string (nullable = true) |-- transdate_dt: string (nullable = true) |-- unitprice: double (nullable = true) |-- translineitem: string (nullable = true) |-- offerid: string (nullable = true) |-- customerid: string (nullable = true) |-- customerkey: string (nullable = true) |-- sku: string (nullable = true) |-- quantity: double (nullable = true) |-- returnquantity: double (nullable = true) |-- channel: string (nullable = true) |-- unitcost: double (nullable = true) |-- transid: string (nullable = true) |-- productid: string (nullable = true) |-- id: string (nullable = true) |-- campaign_campaigncost: double (nullable = true) |-- campaign_begindate: long (nullable = true) |-- campaign_begindate_ts: string (nullable = true) |-- campaign_begindate_dt: string (nullable = true) |-- campaign_enddate: long (nullable = true) |-- campaign_enddate_ts: string (nullable = true) |-- campaign_enddate_dt: string (nullable = true) |-- campaign_campaigntitle: string (nullable = true) |-- campaign_campaignname: string (nullable = true) |-- campaign_id: string (nullable = true) |-- product_categoryid: string (nullable = true) |-- product_company: string (nullable = true) |-- product_brandname: string (nullable = true) |-- product_vendorid: string (nullable = true) |-- product_color: string (nullable = true) |-- product_brandid: string (nullable = true) |-- product_description: string (nullable = true) |-- product_size: string (nullable = true) |-- product_subcategoryid: string (nullable = true) |-- product_departmentid: string (nullable = true) |-- product_productname: string (nullable = true) |-- product_categoryname: string (nullable = true) |-- product_vendorname: string (nullable = true) |-- product_sku: string (nullable = true) |-- product_subcategoryname: string (nullable = true) |-- product_status: string (nullable = true) |-- product_departmentname: string (nullable = true) |-- product_style: string (nullable = true) |-- product_id: string (nullable = true) |-- customer_lastname: string (nullable = true) |-- customer_familystatus: string (nullable = true) |-- customer_customertype: string (nullable = true) |-- customer_city: string (nullable = true) |-- customer_country: string (nullable = true) |-- customer_state: string (nullable = true) |-- customer_region: string (nullable = true) |-- customer_customergroup: string (nullable = true) |-- customer_maritalstatus: string (nullable = true) |-- customer_agerange: string (nullable = true) |-- customer_zip: string (nullable = true) |-- customer_age: double (nullable = true) |-- customer_address2: string (nullable = true) |-- customer_incomerange: string (nullable = true) |-- customer_gender: string (nullable = true) |-- customer_customerkey: string (nullable = true) |-- customer_address1: string (nullable = true) |-- customer_email: string (nullable = true) |-- customer_education: string (nullable = true) |-- customer_birthdate: long (nullable = true) |-- customer_birthdate_ts: string (nullable = true) |-- customer_birthdate_dt: string (nullable = true) |-- customer_id: string (nullable = true) |-- customer_firstname: string (nullable = true) |-- transnum: long (nullable = true) |-- transmonth: string (nullable = true) Nothing but a flat schema with no duplicated column names. I then
Re: SparkSQL - TreeNodeException for unresolved attributes
Hi Michael, Thanks again for the reply. Was hoping it was something I was doing wrong in 1.1.0, but I’ll try master. Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 20, 2014 at 12:11 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL - TreeNodeException for unresolved attributes Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with GROUP BY to write back out to a Hive rollup table that has two partitions. This task is an effort to simulate the unsupported GROUPING SETS functionality in SparkSQL. In my first attempt, I got really close using SchemaRDD.groupBy until I realized that SchemaRDD.insertTo API does not support partitioned tables yet. This prompted my second attempt to pass in SQL to the HiveContext.sql API instead. Here’s a rundown of the commands I executed on the spark-shell: val hc = new HiveContext(sc) hc.setConf(spark.sql.hive.convertMetastoreParquet, true”) hc.setConf(spark.sql.parquet.compression.codec, snappy”) // For implicit conversions to Expression val sqlContext = new SQLContext(sc) import sqlContext._ val segCusts = hc.hql(“select …”) val segTxns = hc.hql(“select …”) val sc = segCusts.as('sc) val st = segTxns.as(‘st) // Join the segCusts and segTxns tables val rup = sc.join(st, Inner, Some(sc.segcustomerid.attr===st.customerid.attr)) rup.registerAsTable(“rupbrand”) If I do a printSchema on the rup, I get: root |-- segcustomerid: string (nullable = true) |-- sales: double (nullable = false) |-- tx_count: long (nullable = false) |-- storeid: string (nullable = true) |-- transdate: long (nullable = true) |-- transdate_ts: string (nullable = true) |-- transdate_dt: string (nullable = true) |-- unitprice: double (nullable = true) |-- translineitem: string (nullable = true) |-- offerid: string (nullable = true) |-- customerid: string (nullable = true) |-- customerkey: string (nullable = true) |-- sku: string (nullable = true) |-- quantity: double (nullable = true) |-- returnquantity: double (nullable = true) |-- channel: string (nullable = true) |-- unitcost: double (nullable = true) |-- transid: string (nullable = true) |-- productid: string (nullable = true) |-- id: string (nullable = true) |-- campaign_campaigncost: double (nullable = true) |-- campaign_begindate: long (nullable = true) |-- campaign_begindate_ts: string (nullable = true) |-- campaign_begindate_dt: string (nullable = true) |-- campaign_enddate: long (nullable = true) |-- campaign_enddate_ts: string (nullable = true) |-- campaign_enddate_dt: string (nullable = true) |-- campaign_campaigntitle: string (nullable = true) |-- campaign_campaignname: string (nullable = true) |-- campaign_id: string (nullable = true) |-- product_categoryid: string (nullable = true) |-- product_company: string (nullable = true) |-- product_brandname: string (nullable = true) |-- product_vendorid: string (nullable = true) |-- product_color: string (nullable = true) |-- product_brandid: string (nullable = true) |-- product_description: string (nullable = true) |-- product_size: string (nullable = true) |-- product_subcategoryid: string (nullable = true) |-- product_departmentid: string (nullable = true) |-- product_productname: string (nullable = true) |-- product_categoryname: string (nullable = true) |-- product_vendorname: string (nullable = true) |-- product_sku: string (nullable = true) |-- product_subcategoryname: string (nullable = true) |-- product_status: string (nullable = true) |-- product_departmentname: string (nullable = true) |-- product_style: string (nullable = true) |-- product_id: string (nullable = true) |-- customer_lastname: string (nullable = true) |-- customer_familystatus: string (nullable = true) |-- customer_customertype: string (nullable = true) |-- customer_city: string (nullable = true) |-- customer_country: string (nullable = true) |-- customer_state: string (nullable = true) |-- customer_region: string (nullable = true) |-- customer_customergroup: string (nullable = true) |-- customer_maritalstatus: string (nullable = true) |-- customer_agerange: string (nullable = true) |-- customer_zip: string (nullable = true) |-- customer_age: double (nullable = true
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed that when I populated it with a single INSERT OVERWRITE over all the partitions and then executed the Spark code, it would report an illegal index value of 29. However, if I manually did INSERT OVERWRITE for every single partition, I would get an illegal index value of 21. I don’t know if this will help in debugging, but here’s the DESCRIBE output for pqt_segcust_snappy: OK col_namedata_type comment customer_id string from deserializer age_range string from deserializer gender string from deserializer last_tx_datebigint from deserializer last_tx_date_ts string from deserializer last_tx_date_dt string from deserializer first_tx_date bigint from deserializer first_tx_date_tsstring from deserializer first_tx_date_dtstring from deserializer second_tx_date bigint from deserializer second_tx_date_ts string from deserializer second_tx_date_dt string from deserializer third_tx_date bigint from deserializer third_tx_date_tsstring from deserializer third_tx_date_dtstring from deserializer frequency double from deserializer tx_size double from deserializer recency double from deserializer rfm double from deserializer tx_countbigint from deserializer sales double from deserializer coll_def_id string None seg_def_id string None # Partition Information # col_name data_type comment coll_def_id string None seg_def_id string None Time taken: 0.788 seconds, Fetched: 29 row(s) As you can see, I have 21 data columns, followed by the 2 partition columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the console output. Let me know if you need more information. Thanks -Terry From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Tuesday, October 14, 2014 at 6:29 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala val hc = new HiveContext(sc) scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 132537600 and translate = 134006399”) scala val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala txn.registerAsTable(“segTxns”) scala segcust.registerAsTable(“segCusts
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu terry@smartfocus.commailto:terry@smartfocus.com wrote: I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala val hc = new HiveContext(sc) scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 132537600 and translate = 134006399”) scala val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala txn.registerAsTable(“segTxns”) scala segcust.registerAsTable(“segCusts”) scala val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join segCusts c on t.customerid=c.customer_id”) Straight forward enough, but I get the following exception: 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51) java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:81) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:67) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) The number of columns in my table, pqt_segcust_snappy, has 21 columns and two partitions defined. Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? Thanks, -Terry
SparkSQL IndexOutOfBoundsException when reading from Parquet
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala val hc = new HiveContext(sc) scala val txn = hc.sql(“select * from pqt_rdt_snappy where transdate = 132537600 and translate = 134006399”) scala val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala txn.registerAsTable(“segTxns”) scala segcust.registerAsTable(“segCusts”) scala val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join segCusts c on t.customerid=c.customer_id”) Straight forward enough, but I get the following exception: 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51) java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:81) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.init(ParquetRecordReaderWrapper.java:67) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) The number of columns in my table, pqt_segcust_snappy, has 21 columns and two partitions defined. Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? Thanks, -Terry