Re: Contribute to Spark - Need a mentor.

2014-06-18 Thread Reynold Xin
Hi Michael,

Unfortunately the Apache mailing list filters out attachments. That said,
you can usually just start by looking at the JIRA for Spark and find issues
tagged with the starter tag and work on them. You can submit pull requests
to the github repo or email the dev list for feedbacks on specific issues.
https://github.com/apache/spark


On Tue, Jun 17, 2014 at 3:58 PM, Michael Giannakopoulos 
miccagi...@gmail.com wrote:

 Hi all,

 My name is Michael Giannakopoulos and I am a recent M.Sc. graduate
 from University of Toronto majoring in Computer Science. I would like to
 contribute in the development of this open source project. Is it possible
 to work
 under the supervision of a mentor? I specialize in Data Analytics,
 Data Management and Machine Learning. I am currently learning the
 Scala language and I have experience using Java, C++, C and Matlab. I have
 already read the Spark and Shark papers.

 Together with this mail you will find attached my resume.

 Thank you so much for your time and your help,
 Michael



Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman
Patrick,

My team is using shuffle consolidation but not speculation. We are also
using persist(DISK_ONLY) for caching.

Here are some config changes that are in our work-in-progress.

We've been trying for 2 weeks to get our production flow (maybe around
50-70 stages, a few forks and joins with up to 20 branches in the forks) to
run end to end without any success, running into other problems besides
this one as well. For example, we have run into situations where saving to
HDFS just hangs on a couple of tasks, which are printing out nothing in
their logs and not taking any CPU. For testing, our input data is 10 GB
across 320 input splits and generates maybe around 200-300 GB of
intermediate and final data.


conf.set(spark.executor.memory, 14g) // TODO make this
configurable

// shuffle configs
conf.set(spark.default.parallelism, 320) // TODO make this
configurable
conf.set(spark.shuffle.consolidateFiles,true)

conf.set(spark.shuffle.file.buffer.kb, 200)
conf.set(spark.reducer.maxMbInFlight, 96)

conf.set(spark.rdd.compress,true

// we ran into a problem with the default timeout of 60 seconds
// this is also being set in the master's spark-env.sh. Not sure if
it needs to be in both places
conf.set(spark.worker.timeout,180)

// akka settings
conf.set(spark.akka.threads, 300)
conf.set(spark.akka.timeout, 180)
conf.set(spark.akka.frameSize, 100)
conf.set(spark.akka.batchSize, 30)
conf.set(spark.akka.askTimeout, 30)

// block manager
conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
conf.set(spark.blockManagerHeartBeatMs, 8)

-Suren



On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  org.apache.spark.scheduler.Task.run(Task.scala:51)
 
  

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Mridul Muralidharan
On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
 Patrick,

 My team is using shuffle consolidation but not speculation. We are also
 using persist(DISK_ONLY) for caching.


Use of shuffle consolidation is probably what is causing the issue.
Would be good idea to try again with that turned off (which is the default).

It should get fixed most likely in 1.1 timeframe.


Regards,
Mridul



 Here are some config changes that are in our work-in-progress.

 We've been trying for 2 weeks to get our production flow (maybe around
 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
 run end to end without any success, running into other problems besides
 this one as well. For example, we have run into situations where saving to
 HDFS just hangs on a couple of tasks, which are printing out nothing in
 their logs and not taking any CPU. For testing, our input data is 10 GB
 across 320 input splits and generates maybe around 200-300 GB of
 intermediate and final data.


 conf.set(spark.executor.memory, 14g) // TODO make this
 configurable

 // shuffle configs
 conf.set(spark.default.parallelism, 320) // TODO make this
 configurable
 conf.set(spark.shuffle.consolidateFiles,true)

 conf.set(spark.shuffle.file.buffer.kb, 200)
 conf.set(spark.reducer.maxMbInFlight, 96)

 conf.set(spark.rdd.compress,true

 // we ran into a problem with the default timeout of 60 seconds
 // this is also being set in the master's spark-env.sh. Not sure if
 it needs to be in both places
 conf.set(spark.worker.timeout,180)

 // akka settings
 conf.set(spark.akka.threads, 300)
 conf.set(spark.akka.timeout, 180)
 conf.set(spark.akka.frameSize, 100)
 conf.set(spark.akka.batchSize, 30)
 conf.set(spark.akka.askTimeout, 30)

 // block manager
 conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
 conf.set(spark.blockManagerHeartBeatMs, 8)

 -Suren



 On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  

question about Hive compatiblilty tests

2014-06-18 Thread Will Benton
Hi all,

Does a Failed to generate golden answer for query message from 
HiveComparisonTests indicate that it isn't possible to run the query in 
question under Hive from Spark's test suite rather than anything about Spark's 
implementation of HiveQL?  The stack trace I'm getting implicates Hive code and 
not Spark code, but I wanted to make sure I wasn't missing something.


thanks,
wb


Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Patrick Wendell
Just wondering, do you get this particular exception if you are not
consolidating shuffle data?

On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote:
 On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
 Patrick,

 My team is using shuffle consolidation but not speculation. We are also
 using persist(DISK_ONLY) for caching.


 Use of shuffle consolidation is probably what is causing the issue.
 Would be good idea to try again with that turned off (which is the default).

 It should get fixed most likely in 1.1 timeframe.


 Regards,
 Mridul



 Here are some config changes that are in our work-in-progress.

 We've been trying for 2 weeks to get our production flow (maybe around
 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
 run end to end without any success, running into other problems besides
 this one as well. For example, we have run into situations where saving to
 HDFS just hangs on a couple of tasks, which are printing out nothing in
 their logs and not taking any CPU. For testing, our input data is 10 GB
 across 320 input splits and generates maybe around 200-300 GB of
 intermediate and final data.


 conf.set(spark.executor.memory, 14g) // TODO make this
 configurable

 // shuffle configs
 conf.set(spark.default.parallelism, 320) // TODO make this
 configurable
 conf.set(spark.shuffle.consolidateFiles,true)

 conf.set(spark.shuffle.file.buffer.kb, 200)
 conf.set(spark.reducer.maxMbInFlight, 96)

 conf.set(spark.rdd.compress,true

 // we ran into a problem with the default timeout of 60 seconds
 // this is also being set in the master's spark-env.sh. Not sure if
 it needs to be in both places
 conf.set(spark.worker.timeout,180)

 // akka settings
 conf.set(spark.akka.threads, 300)
 conf.set(spark.akka.timeout, 180)
 conf.set(spark.akka.frameSize, 100)
 conf.set(spark.akka.batchSize, 30)
 conf.set(spark.akka.askTimeout, 30)

 // block manager
 conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
 conf.set(spark.blockManagerHeartBeatMs, 8)

 -Suren



 On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
  

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman
Good question. At this point, I'd have to re-run it to know for sure. We've
been trying various different things, so I'd have to reset the flow config
back to that state.

I can say that by removing persist(DISK_ONLY), the flows are running more
stably, probably due to removing disk contention. We won't be able to run
our full production flows without some type of disk persistence but for
testing, this is how we are continuing to try for now.

I can try tomorrow if you'd like.

-Suren



On Wed, Jun 18, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote:

 Just wondering, do you get this particular exception if you are not
 consolidating shuffle data?

 On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com
 wrote:
  On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
  suren.hira...@velos.io wrote:
  Patrick,
 
  My team is using shuffle consolidation but not speculation. We are also
  using persist(DISK_ONLY) for caching.
 
 
  Use of shuffle consolidation is probably what is causing the issue.
  Would be good idea to try again with that turned off (which is the
 default).
 
  It should get fixed most likely in 1.1 timeframe.
 
 
  Regards,
  Mridul
 
 
 
  Here are some config changes that are in our work-in-progress.
 
  We've been trying for 2 weeks to get our production flow (maybe around
  50-70 stages, a few forks and joins with up to 20 branches in the
 forks) to
  run end to end without any success, running into other problems besides
  this one as well. For example, we have run into situations where saving
 to
  HDFS just hangs on a couple of tasks, which are printing out nothing in
  their logs and not taking any CPU. For testing, our input data is 10 GB
  across 320 input splits and generates maybe around 200-300 GB of
  intermediate and final data.
 
 
  conf.set(spark.executor.memory, 14g) // TODO make this
  configurable
 
  // shuffle configs
  conf.set(spark.default.parallelism, 320) // TODO make this
  configurable
  conf.set(spark.shuffle.consolidateFiles,true)
 
  conf.set(spark.shuffle.file.buffer.kb, 200)
  conf.set(spark.reducer.maxMbInFlight, 96)
 
  conf.set(spark.rdd.compress,true
 
  // we ran into a problem with the default timeout of 60 seconds
  // this is also being set in the master's spark-env.sh. Not
 sure if
  it needs to be in both places
  conf.set(spark.worker.timeout,180)
 
  // akka settings
  conf.set(spark.akka.threads, 300)
  conf.set(spark.akka.timeout, 180)
  conf.set(spark.akka.frameSize, 100)
  conf.set(spark.akka.batchSize, 30)
  conf.set(spark.akka.askTimeout, 30)
 
  // block manager
  conf.set(spark.storage.blockManagerTimeoutIntervalMs,
 18)
  conf.set(spark.blockManagerHeartBeatMs, 8)
 
  -Suren
 
 
 
  On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Out of curiosity - are you guys using speculation, shuffle
  consolidation, or any other non-default option? If so that would help
  narrow down what's causing this corruption.
 
  On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
  suren.hira...@velos.io wrote:
   Matt/Ryan,
  
   Did you make any headway on this? My team is running into this also.
   Doesn't happen on smaller datasets. Our input set is about 10 GB but
 we
   generate 100s of GBs in the flow itself.
  
   -Suren
  
  
  
  
   On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 
  wrote:
  
   Just ran into this today myself. I'm on branch-1.0 using a CDH3
   cluster (no modifications to Spark or its dependencies). The error
   appeared trying to run GraphX's .connectedComponents() on a ~200GB
   edge list (GraphX worked beautifully on smaller data).
  
   Here's the stacktrace (it's quite similar to yours
   https://imgur.com/7iBA4nJ ).
  
   14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39
 failed
   4 times; aborting job
   14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce
 at
   VertexRDD.scala:100
   Exception in thread main org.apache.spark.SparkException: Job
   aborted due to stage failure: Task 5.599:39 failed 4 times, most
   recent failure: Exception failure in TID 29735 on host node18:
   java.io.StreamCorruptedException: invalid type code: AC
  
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
  
  
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
  
  
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
  
  org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  
  
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
  
  
 
 

Re: Run ScalaTest inside Intellij IDEA

2014-06-18 Thread Doris Xin
Here's the JIRA on this known issue:
https://issues.apache.org/jira/browse/SPARK-1835

tl;dr: manually delete mesos-0.18.1.jar from lib_managed/jars after
running sbt/sbt
gen-idea. You should be able to run units inside Intellij after doing so.

Doris


On Tue, Jun 17, 2014 at 6:10 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 I got stuck on this one too after did git pull from master.

 Have not been able to resolve it yet =(


 - Henry

 On Wed, Jun 11, 2014 at 6:51 AM, Yijie Shen henry.yijies...@gmail.com
 wrote:
  Thx Qiuzhuang, the problems disappeared after I add assembly jar at the
 head of list dependencies in *.iml, but while running test in Spark
 SQL(SQLQuerySuite in sql-core), another two error occurs:
 
  Error 1:
  Error:scalac:
   while compiling:
 /Users/yijie/code/apache.spark.master/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
  during phase: jvm
   library version: version 2.10.4
  compiler version: version 2.10.4
reconstructed args: -Xmax-classfile-name 120 -deprecation
 -P:genjavadoc:out=/Users/yijie/code/apache.spark.master/sql/core/target/java
 -feature -classpath
 /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Conte…
  …
  ...
 
 /Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/classes:/Users/yijie/code/apache.spark.master/lib_managed/jars/scala-library-2.10.4.jar
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
last tree to typer: Literal(Constant(parquet.io.api.Converter))
symbol: null
 symbol definition: null
   tpe: Class(classOf[parquet.io.api.Converter])
 symbol owners:
context owners: object TestSQLContext - package test
  == Enclosing template or block ==
  Template( // val local TestSQLContext: notype in object
 TestSQLContext, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
org.apache.spark.sql.SQLContext // parents
ValDef(
  private
  _
  tpt
  empty
)
// 2 statements
DefDef( // private def readResolve(): Object in object TestSQLContext
  method private synthetic
  readResolve
  []
  List(Nil)
  tpt // tree.tpe=Object
  test.this.TestSQLContext // object TestSQLContext in package test,
 tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
)
DefDef( // def init(): org.apache.spark.sql.test.TestSQLContext.type
 in object TestSQLContext
  method
  init
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
  Block( // tree.tpe=Unit
Apply( // def init(sparkContext: org.apache.spark.SparkContext):
 org.apache.spark.sql.SQLContext in class SQLContext,
 tree.tpe=org.apache.spark.sql.SQLContext
  TestSQLContext.super.init // def init(sparkContext:
 org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class
 SQLContext, tree.tpe=(sparkContext:
 org.apache.spark.SparkContext)org.apache.spark.sql.SQLContext
  Apply( // def init(master: String,appName: String,conf:
 org.apache.spark.SparkConf): org.apache.spark.SparkContext in class
 SparkContext, tree.tpe=org.apache.spark.SparkContext
new org.apache.spark.SparkContext.init // def
 init(master: String,appName: String,conf: org.apache.spark.SparkConf):
 org.apache.spark.SparkContext in class SparkContext, tree.tpe=(master:
 String, appName: String, conf:
 org.apache.spark.SparkConf)org.apache.spark.SparkContext
// 3 arguments
local
TestSQLContext
Apply( // def init(): org.apache.spark.SparkConf in class
 SparkConf, tree.tpe=org.apache.spark.SparkConf
  new org.apache.spark.SparkConf.init // def init():
 org.apache.spark.SparkConf in class SparkConf,
 tree.tpe=()org.apache.spark.SparkConf
  Nil
)
  )
)
()
  )
)
  )
  == Expanded type of tree ==
  ConstantType(value = Constant(parquet.io.api.Converter))
  uncaught exception during compilation: java.lang.AssertionError
 
  Error 2:
 
 

Re: question about Hive compatiblilty tests

2014-06-18 Thread Michael Armbrust
I assume you are adding tests?  because that is the only time you should
see that message.

That error could mean a couple of things:
 1) The query is invalid and hive threw an exception
 2) Your Hive setup is bad.

Regarding #2, you need to have the source for Hive 0.12.0 available and
built as well as a hadoop installation.  You also have to have the
environment vars set as specified here:
https://github.com/apache/spark/tree/master/sql

Michael


On Thu, Jun 19, 2014 at 12:22 AM, Will Benton wi...@redhat.com wrote:

 Hi all,

 Does a Failed to generate golden answer for query message from
 HiveComparisonTests indicate that it isn't possible to run the query in
 question under Hive from Spark's test suite rather than anything about
 Spark's implementation of HiveQL?  The stack trace I'm getting implicates
 Hive code and not Spark code, but I wanted to make sure I wasn't missing
 something.


 thanks,
 wb



Re: question about Hive compatiblilty tests

2014-06-18 Thread Will Benton
 I assume you are adding tests?  because that is the only time you should
 see that message.

Yes, I had added the HAVING test to the whitelist.

 That error could mean a couple of things:
  1) The query is invalid and hive threw an exception
  2) Your Hive setup is bad.
 
 Regarding #2, you need to have the source for Hive 0.12.0 available and
 built as well as a hadoop installation.  You also have to have the
 environment vars set as specified here:
 https://github.com/apache/spark/tree/master/sql

Thanks!  The other Hive compatibility tests seem to work, so I'll dig in a bit 
more to see if I can figure out what's happening here.


best,
wb