Re: Contribute to Spark - Need a mentor.
Hi Michael, Unfortunately the Apache mailing list filters out attachments. That said, you can usually just start by looking at the JIRA for Spark and find issues tagged with the starter tag and work on them. You can submit pull requests to the github repo or email the dev list for feedbacks on specific issues. https://github.com/apache/spark On Tue, Jun 17, 2014 at 3:58 PM, Michael Giannakopoulos miccagi...@gmail.com wrote: Hi all, My name is Michael Giannakopoulos and I am a recent M.Sc. graduate from University of Toronto majoring in Computer Science. I would like to contribute in the development of this open source project. Is it possible to work under the supervision of a mentor? I specialize in Data Analytics, Data Management and Machine Learning. I am currently learning the Scala language and I have experience using Java, C++, C and Matlab. I have already read the Spark and Shark papers. Together with this mail you will find attached my resume. Thank you so much for your time and your help, Michael
Re: Java IO Stream Corrupted - Invalid Type AC?
Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up to 20 branches in the forks) to run end to end without any success, running into other problems besides this one as well. For example, we have run into situations where saving to HDFS just hangs on a couple of tasks, which are printing out nothing in their logs and not taking any CPU. For testing, our input data is 10 GB across 320 input splits and generates maybe around 200-300 GB of intermediate and final data. conf.set(spark.executor.memory, 14g) // TODO make this configurable // shuffle configs conf.set(spark.default.parallelism, 320) // TODO make this configurable conf.set(spark.shuffle.consolidateFiles,true) conf.set(spark.shuffle.file.buffer.kb, 200) conf.set(spark.reducer.maxMbInFlight, 96) conf.set(spark.rdd.compress,true // we ran into a problem with the default timeout of 60 seconds // this is also being set in the master's spark-env.sh. Not sure if it needs to be in both places conf.set(spark.worker.timeout,180) // akka settings conf.set(spark.akka.threads, 300) conf.set(spark.akka.timeout, 180) conf.set(spark.akka.frameSize, 100) conf.set(spark.akka.batchSize, 30) conf.set(spark.akka.askTimeout, 30) // block manager conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18) conf.set(spark.blockManagerHeartBeatMs, 8) -Suren On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51)
Re: Java IO Stream Corrupted - Invalid Type AC?
On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Use of shuffle consolidation is probably what is causing the issue. Would be good idea to try again with that turned off (which is the default). It should get fixed most likely in 1.1 timeframe. Regards, Mridul Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up to 20 branches in the forks) to run end to end without any success, running into other problems besides this one as well. For example, we have run into situations where saving to HDFS just hangs on a couple of tasks, which are printing out nothing in their logs and not taking any CPU. For testing, our input data is 10 GB across 320 input splits and generates maybe around 200-300 GB of intermediate and final data. conf.set(spark.executor.memory, 14g) // TODO make this configurable // shuffle configs conf.set(spark.default.parallelism, 320) // TODO make this configurable conf.set(spark.shuffle.consolidateFiles,true) conf.set(spark.shuffle.file.buffer.kb, 200) conf.set(spark.reducer.maxMbInFlight, 96) conf.set(spark.rdd.compress,true // we ran into a problem with the default timeout of 60 seconds // this is also being set in the master's spark-env.sh. Not sure if it needs to be in both places conf.set(spark.worker.timeout,180) // akka settings conf.set(spark.akka.threads, 300) conf.set(spark.akka.timeout, 180) conf.set(spark.akka.frameSize, 100) conf.set(spark.akka.batchSize, 30) conf.set(spark.akka.askTimeout, 30) // block manager conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18) conf.set(spark.blockManagerHeartBeatMs, 8) -Suren On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
question about Hive compatiblilty tests
Hi all, Does a Failed to generate golden answer for query message from HiveComparisonTests indicate that it isn't possible to run the query in question under Hive from Spark's test suite rather than anything about Spark's implementation of HiveQL? The stack trace I'm getting implicates Hive code and not Spark code, but I wanted to make sure I wasn't missing something. thanks, wb
Re: Java IO Stream Corrupted - Invalid Type AC?
Just wondering, do you get this particular exception if you are not consolidating shuffle data? On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote: On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Use of shuffle consolidation is probably what is causing the issue. Would be good idea to try again with that turned off (which is the default). It should get fixed most likely in 1.1 timeframe. Regards, Mridul Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up to 20 branches in the forks) to run end to end without any success, running into other problems besides this one as well. For example, we have run into situations where saving to HDFS just hangs on a couple of tasks, which are printing out nothing in their logs and not taking any CPU. For testing, our input data is 10 GB across 320 input splits and generates maybe around 200-300 GB of intermediate and final data. conf.set(spark.executor.memory, 14g) // TODO make this configurable // shuffle configs conf.set(spark.default.parallelism, 320) // TODO make this configurable conf.set(spark.shuffle.consolidateFiles,true) conf.set(spark.shuffle.file.buffer.kb, 200) conf.set(spark.reducer.maxMbInFlight, 96) conf.set(spark.rdd.compress,true // we ran into a problem with the default timeout of 60 seconds // this is also being set in the master's spark-env.sh. Not sure if it needs to be in both places conf.set(spark.worker.timeout,180) // akka settings conf.set(spark.akka.threads, 300) conf.set(spark.akka.timeout, 180) conf.set(spark.akka.frameSize, 100) conf.set(spark.akka.batchSize, 30) conf.set(spark.akka.askTimeout, 30) // block manager conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18) conf.set(spark.blockManagerHeartBeatMs, 8) -Suren On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192) org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
Re: Java IO Stream Corrupted - Invalid Type AC?
Good question. At this point, I'd have to re-run it to know for sure. We've been trying various different things, so I'd have to reset the flow config back to that state. I can say that by removing persist(DISK_ONLY), the flows are running more stably, probably due to removing disk contention. We won't be able to run our full production flows without some type of disk persistence but for testing, this is how we are continuing to try for now. I can try tomorrow if you'd like. -Suren On Wed, Jun 18, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Just wondering, do you get this particular exception if you are not consolidating shuffle data? On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote: On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Use of shuffle consolidation is probably what is causing the issue. Would be good idea to try again with that turned off (which is the default). It should get fixed most likely in 1.1 timeframe. Regards, Mridul Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up to 20 branches in the forks) to run end to end without any success, running into other problems besides this one as well. For example, we have run into situations where saving to HDFS just hangs on a couple of tasks, which are printing out nothing in their logs and not taking any CPU. For testing, our input data is 10 GB across 320 input splits and generates maybe around 200-300 GB of intermediate and final data. conf.set(spark.executor.memory, 14g) // TODO make this configurable // shuffle configs conf.set(spark.default.parallelism, 320) // TODO make this configurable conf.set(spark.shuffle.consolidateFiles,true) conf.set(spark.shuffle.file.buffer.kb, 200) conf.set(spark.reducer.maxMbInFlight, 96) conf.set(spark.rdd.compress,true // we ran into a problem with the default timeout of 60 seconds // this is also being set in the master's spark-env.sh. Not sure if it needs to be in both places conf.set(spark.worker.timeout,180) // akka settings conf.set(spark.akka.threads, 300) conf.set(spark.akka.timeout, 180) conf.set(spark.akka.frameSize, 100) conf.set(spark.akka.batchSize, 30) conf.set(spark.akka.askTimeout, 30) // block manager conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18) conf.set(spark.blockManagerHeartBeatMs, 8) -Suren On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote: Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway on this? My team is running into this also. Doesn't happen on smaller datasets. Our input set is about 10 GB but we generate 100s of GBs in the flow itself. -Suren On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com wrote: Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to yours https://imgur.com/7iBA4nJ ). 14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed 4 times; aborting job 14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at VertexRDD.scala:100 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.599:39 failed 4 times, most recent failure: Exception failure in TID 29735 on host node18: java.io.StreamCorruptedException: invalid type code: AC java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355) java.io.ObjectInputStream.readObject(ObjectInputStream.java:350) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
Re: Run ScalaTest inside Intellij IDEA
Here's the JIRA on this known issue: https://issues.apache.org/jira/browse/SPARK-1835 tl;dr: manually delete mesos-0.18.1.jar from lib_managed/jars after running sbt/sbt gen-idea. You should be able to run units inside Intellij after doing so. Doris On Tue, Jun 17, 2014 at 6:10 PM, Henry Saputra henry.sapu...@gmail.com wrote: I got stuck on this one too after did git pull from master. Have not been able to resolve it yet =( - Henry On Wed, Jun 11, 2014 at 6:51 AM, Yijie Shen henry.yijies...@gmail.com wrote: Thx Qiuzhuang, the problems disappeared after I add assembly jar at the head of list dependencies in *.iml, but while running test in Spark SQL(SQLQuerySuite in sql-core), another two error occurs: Error 1: Error:scalac: while compiling: /Users/yijie/code/apache.spark.master/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -Xmax-classfile-name 120 -deprecation -P:genjavadoc:out=/Users/yijie/code/apache.spark.master/sql/core/target/java -feature -classpath /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Conte… … ... /Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/classes:/Users/yijie/code/apache.spark.master/lib_managed/jars/scala-library-2.10.4.jar -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar last tree to typer: Literal(Constant(parquet.io.api.Converter)) symbol: null symbol definition: null tpe: Class(classOf[parquet.io.api.Converter]) symbol owners: context owners: object TestSQLContext - package test == Enclosing template or block == Template( // val local TestSQLContext: notype in object TestSQLContext, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type org.apache.spark.sql.SQLContext // parents ValDef( private _ tpt empty ) // 2 statements DefDef( // private def readResolve(): Object in object TestSQLContext method private synthetic readResolve [] List(Nil) tpt // tree.tpe=Object test.this.TestSQLContext // object TestSQLContext in package test, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type ) DefDef( // def init(): org.apache.spark.sql.test.TestSQLContext.type in object TestSQLContext method init [] List(Nil) tpt // tree.tpe=org.apache.spark.sql.test.TestSQLContext.type Block( // tree.tpe=Unit Apply( // def init(sparkContext: org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class SQLContext, tree.tpe=org.apache.spark.sql.SQLContext TestSQLContext.super.init // def init(sparkContext: org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class SQLContext, tree.tpe=(sparkContext: org.apache.spark.SparkContext)org.apache.spark.sql.SQLContext Apply( // def init(master: String,appName: String,conf: org.apache.spark.SparkConf): org.apache.spark.SparkContext in class SparkContext, tree.tpe=org.apache.spark.SparkContext new org.apache.spark.SparkContext.init // def init(master: String,appName: String,conf: org.apache.spark.SparkConf): org.apache.spark.SparkContext in class SparkContext, tree.tpe=(master: String, appName: String, conf: org.apache.spark.SparkConf)org.apache.spark.SparkContext // 3 arguments local TestSQLContext Apply( // def init(): org.apache.spark.SparkConf in class SparkConf, tree.tpe=org.apache.spark.SparkConf new org.apache.spark.SparkConf.init // def init(): org.apache.spark.SparkConf in class SparkConf, tree.tpe=()org.apache.spark.SparkConf Nil ) ) ) () ) ) ) == Expanded type of tree == ConstantType(value = Constant(parquet.io.api.Converter)) uncaught exception during compilation: java.lang.AssertionError Error 2:
Re: question about Hive compatiblilty tests
I assume you are adding tests? because that is the only time you should see that message. That error could mean a couple of things: 1) The query is invalid and hive threw an exception 2) Your Hive setup is bad. Regarding #2, you need to have the source for Hive 0.12.0 available and built as well as a hadoop installation. You also have to have the environment vars set as specified here: https://github.com/apache/spark/tree/master/sql Michael On Thu, Jun 19, 2014 at 12:22 AM, Will Benton wi...@redhat.com wrote: Hi all, Does a Failed to generate golden answer for query message from HiveComparisonTests indicate that it isn't possible to run the query in question under Hive from Spark's test suite rather than anything about Spark's implementation of HiveQL? The stack trace I'm getting implicates Hive code and not Spark code, but I wanted to make sure I wasn't missing something. thanks, wb
Re: question about Hive compatiblilty tests
I assume you are adding tests? because that is the only time you should see that message. Yes, I had added the HAVING test to the whitelist. That error could mean a couple of things: 1) The query is invalid and hive threw an exception 2) Your Hive setup is bad. Regarding #2, you need to have the source for Hive 0.12.0 available and built as well as a hadoop installation. You also have to have the environment vars set as specified here: https://github.com/apache/spark/tree/master/sql Thanks! The other Hive compatibility tests seem to work, so I'll dig in a bit more to see if I can figure out what's happening here. best, wb