date:20140618

Re: Contribute to Spark - Need a mentor.

2014-06-18 Thread Reynold Xin

Hi Michael,

Unfortunately the Apache mailing list filters out attachments. That said,
you can usually just start by looking at the JIRA for Spark and find issues
tagged with the starter tag and work on them. You can submit pull requests
to the github repo or email the dev list for feedbacks on specific issues.
https://github.com/apache/spark


On Tue, Jun 17, 2014 at 3:58 PM, Michael Giannakopoulos 
miccagi...@gmail.com wrote:

 Hi all,

 My name is Michael Giannakopoulos and I am a recent M.Sc. graduate
 from University of Toronto majoring in Computer Science. I would like to
 contribute in the development of this open source project. Is it possible
 to work
 under the supervision of a mentor? I specialize in Data Analytics,
 Data Management and Machine Learning. I am currently learning the
 Scala language and I have experience using Java, C++, C and Matlab. I have
 already read the Spark and Shark papers.

 Together with this mail you will find attached my resume.

 Thank you so much for your time and your help,
 Michael

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman

Patrick,

My team is using shuffle consolidation but not speculation. We are also
using persist(DISK_ONLY) for caching.

Here are some config changes that are in our work-in-progress.

We've been trying for 2 weeks to get our production flow (maybe around
50-70 stages, a few forks and joins with up to 20 branches in the forks) to
run end to end without any success, running into other problems besides
this one as well. For example, we have run into situations where saving to
HDFS just hangs on a couple of tasks, which are printing out nothing in
their logs and not taking any CPU. For testing, our input data is 10 GB
across 320 input splits and generates maybe around 200-300 GB of
intermediate and final data.


conf.set(spark.executor.memory, 14g) // TODO make this
configurable

// shuffle configs
conf.set(spark.default.parallelism, 320) // TODO make this
configurable
conf.set(spark.shuffle.consolidateFiles,true)

conf.set(spark.shuffle.file.buffer.kb, 200)
conf.set(spark.reducer.maxMbInFlight, 96)

conf.set(spark.rdd.compress,true

// we ran into a problem with the default timeout of 60 seconds
// this is also being set in the master's spark-env.sh. Not sure if
it needs to be in both places
conf.set(spark.worker.timeout,180)

// akka settings
conf.set(spark.akka.threads, 300)
conf.set(spark.akka.timeout, 180)
conf.set(spark.akka.frameSize, 100)
conf.set(spark.akka.batchSize, 30)
conf.set(spark.akka.askTimeout, 30)

// block manager
conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
conf.set(spark.blockManagerHeartBeatMs, 8)

-Suren



On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  org.apache.spark.scheduler.Task.run(Task.scala:51)

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Mridul Muralidharan

On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
 Patrick,

 My team is using shuffle consolidation but not speculation. We are also
 using persist(DISK_ONLY) for caching.


Use of shuffle consolidation is probably what is causing the issue.
Would be good idea to try again with that turned off (which is the default).

It should get fixed most likely in 1.1 timeframe.


Regards,
Mridul



 Here are some config changes that are in our work-in-progress.

 We've been trying for 2 weeks to get our production flow (maybe around
 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
 run end to end without any success, running into other problems besides
 this one as well. For example, we have run into situations where saving to
 HDFS just hangs on a couple of tasks, which are printing out nothing in
 their logs and not taking any CPU. For testing, our input data is 10 GB
 across 320 input splits and generates maybe around 200-300 GB of
 intermediate and final data.


 conf.set(spark.executor.memory, 14g) // TODO make this
 configurable

 // shuffle configs
 conf.set(spark.default.parallelism, 320) // TODO make this
 configurable
 conf.set(spark.shuffle.consolidateFiles,true)

 conf.set(spark.shuffle.file.buffer.kb, 200)
 conf.set(spark.reducer.maxMbInFlight, 96)

 conf.set(spark.rdd.compress,true

 // we ran into a problem with the default timeout of 60 seconds
 // this is also being set in the master's spark-env.sh. Not sure if
 it needs to be in both places
 conf.set(spark.worker.timeout,180)

 // akka settings
 conf.set(spark.akka.threads, 300)
 conf.set(spark.akka.timeout, 180)
 conf.set(spark.akka.frameSize, 100)
 conf.set(spark.akka.batchSize, 30)
 conf.set(spark.akka.askTimeout, 30)

 // block manager
 conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
 conf.set(spark.blockManagerHeartBeatMs, 8)

 -Suren



 On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
  scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

question about Hive compatiblilty tests

2014-06-18 Thread Will Benton

Hi all,

Does a Failed to generate golden answer for query message from 
HiveComparisonTests indicate that it isn't possible to run the query in 
question under Hive from Spark's test suite rather than anything about Spark's 
implementation of HiveQL?  The stack trace I'm getting implicates Hive code and 
not Spark code, but I wanted to make sure I wasn't missing something.


thanks,
wb

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Patrick Wendell

Just wondering, do you get this particular exception if you are not
consolidating shuffle data?

On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com wrote:
 On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
 Patrick,

 My team is using shuffle consolidation but not speculation. We are also
 using persist(DISK_ONLY) for caching.


 Use of shuffle consolidation is probably what is causing the issue.
 Would be good idea to try again with that turned off (which is the default).

 It should get fixed most likely in 1.1 timeframe.


 Regards,
 Mridul



 Here are some config changes that are in our work-in-progress.

 We've been trying for 2 weeks to get our production flow (maybe around
 50-70 stages, a few forks and joins with up to 20 branches in the forks) to
 run end to end without any success, running into other problems besides
 this one as well. For example, we have run into situations where saving to
 HDFS just hangs on a couple of tasks, which are printing out nothing in
 their logs and not taking any CPU. For testing, our input data is 10 GB
 across 320 input splits and generates maybe around 200-300 GB of
 intermediate and final data.


 conf.set(spark.executor.memory, 14g) // TODO make this
 configurable

 // shuffle configs
 conf.set(spark.default.parallelism, 320) // TODO make this
 configurable
 conf.set(spark.shuffle.consolidateFiles,true)

 conf.set(spark.shuffle.file.buffer.kb, 200)
 conf.set(spark.reducer.maxMbInFlight, 96)

 conf.set(spark.rdd.compress,true

 // we ran into a problem with the default timeout of 60 seconds
 // this is also being set in the master's spark-env.sh. Not sure if
 it needs to be in both places
 conf.set(spark.worker.timeout,180)

 // akka settings
 conf.set(spark.akka.threads, 300)
 conf.set(spark.akka.timeout, 180)
 conf.set(spark.akka.frameSize, 100)
 conf.set(spark.akka.batchSize, 30)
 conf.set(spark.akka.askTimeout, 30)

 // block manager
 conf.set(spark.storage.blockManagerTimeoutIntervalMs, 18)
 conf.set(spark.blockManagerHeartBeatMs, 8)

 -Suren



 On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com wrote:

 Out of curiosity - are you guys using speculation, shuffle
 consolidation, or any other non-default option? If so that would help
 narrow down what's causing this corruption.

 On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
 suren.hira...@velos.io wrote:
  Matt/Ryan,
 
  Did you make any headway on this? My team is running into this also.
  Doesn't happen on smaller datasets. Our input set is about 10 GB but we
  generate 100s of GBs in the flow itself.
 
  -Suren
 
 
 
 
  On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com
 wrote:
 
  Just ran into this today myself. I'm on branch-1.0 using a CDH3
  cluster (no modifications to Spark or its dependencies). The error
  appeared trying to run GraphX's .connectedComponents() on a ~200GB
  edge list (GraphX worked beautifully on smaller data).
 
  Here's the stacktrace (it's quite similar to yours
  https://imgur.com/7iBA4nJ ).
 
  14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39 failed
  4 times; aborting job
  14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce at
  VertexRDD.scala:100
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 5.599:39 failed 4 times, most
  recent failure: Exception failure in TID 29735 on host node18:
  java.io.StreamCorruptedException: invalid type code: AC
 
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
 
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
  scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 
 org.apache.spark.graphx.impl.VertexPartitionBaseOps.innerJoinKeepLeft(VertexPartitionBaseOps.scala:192)
 
 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:78)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 
 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman

Good question. At this point, I'd have to re-run it to know for sure. We've
been trying various different things, so I'd have to reset the flow config
back to that state.

I can say that by removing persist(DISK_ONLY), the flows are running more
stably, probably due to removing disk contention. We won't be able to run
our full production flows without some type of disk persistence but for
testing, this is how we are continuing to try for now.

I can try tomorrow if you'd like.

-Suren

On Wed, Jun 18, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote:

Just wondering, do you get this particular exception if you are not
consolidating shuffle data?

On Wed, Jun 18, 2014 at 12:15 PM, Mridul Muralidharan mri...@gmail.com
wrote:
On Wed, Jun 18, 2014 at 6:19 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Patrick,

My team is using shuffle consolidation but not speculation. We are also
using persist(DISK_ONLY) for caching.

Use of shuffle consolidation is probably what is causing the issue.
Would be good idea to try again with that turned off (which is the
default).

It should get fixed most likely in 1.1 timeframe.

Regards,
Mridul

Here are some config changes that are in our work-in-progress.

We've been trying for 2 weeks to get our production flow (maybe around
50-70 stages, a few forks and joins with up to 20 branches in the
forks) to
run end to end without any success, running into other problems besides
this one as well. For example, we have run into situations where saving
to
HDFS just hangs on a couple of tasks, which are printing out nothing in
their logs and not taking any CPU. For testing, our input data is 10 GB
across 320 input splits and generates maybe around 200-300 GB of
intermediate and final data.

conf.set(spark.executor.memory, 14g) // TODO make this
configurable

// shuffle configs
conf.set(spark.default.parallelism, 320) // TODO make this
configurable
conf.set(spark.shuffle.consolidateFiles,true)

conf.set(spark.shuffle.file.buffer.kb, 200)
conf.set(spark.reducer.maxMbInFlight, 96)

conf.set(spark.rdd.compress,true

// we ran into a problem with the default timeout of 60 seconds
// this is also being set in the master's spark-env.sh. Not
sure if
it needs to be in both places
conf.set(spark.worker.timeout,180)

// akka settings
conf.set(spark.akka.threads, 300)
conf.set(spark.akka.timeout, 180)
conf.set(spark.akka.frameSize, 100)
conf.set(spark.akka.batchSize, 30)
conf.set(spark.akka.askTimeout, 30)

// block manager
conf.set(spark.storage.blockManagerTimeoutIntervalMs,
18)
conf.set(spark.blockManagerHeartBeatMs, 8)

-Suren

On Wed, Jun 18, 2014 at 1:42 AM, Patrick Wendell pwend...@gmail.com
wrote:

Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.

On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Matt/Ryan,

Did you make any headway on this? My team is running into this also.
Doesn't happen on smaller datasets. Our input set is about 10 GB but
we
generate 100s of GBs in the flow itself.

-Suren

On Fri, Jun 6, 2014 at 5:19 PM, Ryan Compton compton.r...@gmail.com

wrote:

Just ran into this today myself. I'm on branch-1.0 using a CDH3
cluster (no modifications to Spark or its dependencies). The error
appeared trying to run GraphX's .connectedComponents() on a ~200GB
edge list (GraphX worked beautifully on smaller data).

Here's the stacktrace (it's quite similar to yours
https://imgur.com/7iBA4nJ ).

14/06/05 20:02:28 ERROR scheduler.TaskSetManager: Task 5.599:39
failed
4 times; aborting job
14/06/05 20:02:28 INFO scheduler.DAGScheduler: Failed to run reduce
at
VertexRDD.scala:100
Exception in thread main org.apache.spark.SparkException: Job
aborted due to stage failure: Task 5.599:39 failed 4 times, most
recent failure: Exception failure in TID 29735 on host node18:
java.io.StreamCorruptedException: invalid type code: AC

java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1355)

java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)

org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

Re: Run ScalaTest inside Intellij IDEA

2014-06-18 Thread Doris Xin

Here's the JIRA on this known issue:
https://issues.apache.org/jira/browse/SPARK-1835

tl;dr: manually delete mesos-0.18.1.jar from lib_managed/jars after
running sbt/sbt
gen-idea. You should be able to run units inside Intellij after doing so.

Doris


On Tue, Jun 17, 2014 at 6:10 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 I got stuck on this one too after did git pull from master.

 Have not been able to resolve it yet =(


 - Henry

 On Wed, Jun 11, 2014 at 6:51 AM, Yijie Shen henry.yijies...@gmail.com
 wrote:
  Thx Qiuzhuang, the problems disappeared after I add assembly jar at the
 head of list dependencies in *.iml, but while running test in Spark
 SQL(SQLQuerySuite in sql-core), another two error occurs:
 
  Error 1:
  Error:scalac:
   while compiling:
 /Users/yijie/code/apache.spark.master/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
  during phase: jvm
   library version: version 2.10.4
  compiler version: version 2.10.4
reconstructed args: -Xmax-classfile-name 120 -deprecation
 -P:genjavadoc:out=/Users/yijie/code/apache.spark.master/sql/core/target/java
 -feature -classpath
 /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-doclet.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Conte…
  …
  ...
 
 /Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre/classes:/Users/yijie/code/apache.spark.master/lib_managed/jars/scala-library-2.10.4.jar
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
 -Xplugin:/Users/yijie/code/apache.spark.master/lib_managed/jars/genjavadoc-plugin_2.10.4-0.5.jar
last tree to typer: Literal(Constant(parquet.io.api.Converter))
symbol: null
 symbol definition: null
   tpe: Class(classOf[parquet.io.api.Converter])
 symbol owners:
context owners: object TestSQLContext - package test
  == Enclosing template or block ==
  Template( // val local TestSQLContext: notype in object
 TestSQLContext, tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
org.apache.spark.sql.SQLContext // parents
ValDef(
  private
  _
  tpt
  empty
)
// 2 statements
DefDef( // private def readResolve(): Object in object TestSQLContext
  method private synthetic
  readResolve
  []
  List(Nil)
  tpt // tree.tpe=Object
  test.this.TestSQLContext // object TestSQLContext in package test,
 tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
)
DefDef( // def init(): org.apache.spark.sql.test.TestSQLContext.type
 in object TestSQLContext
  method
  init
  []
  List(Nil)
  tpt // tree.tpe=org.apache.spark.sql.test.TestSQLContext.type
  Block( // tree.tpe=Unit
Apply( // def init(sparkContext: org.apache.spark.SparkContext):
 org.apache.spark.sql.SQLContext in class SQLContext,
 tree.tpe=org.apache.spark.sql.SQLContext
  TestSQLContext.super.init // def init(sparkContext:
 org.apache.spark.SparkContext): org.apache.spark.sql.SQLContext in class
 SQLContext, tree.tpe=(sparkContext:
 org.apache.spark.SparkContext)org.apache.spark.sql.SQLContext
  Apply( // def init(master: String,appName: String,conf:
 org.apache.spark.SparkConf): org.apache.spark.SparkContext in class
 SparkContext, tree.tpe=org.apache.spark.SparkContext
new org.apache.spark.SparkContext.init // def
 init(master: String,appName: String,conf: org.apache.spark.SparkConf):
 org.apache.spark.SparkContext in class SparkContext, tree.tpe=(master:
 String, appName: String, conf:
 org.apache.spark.SparkConf)org.apache.spark.SparkContext
// 3 arguments
local
TestSQLContext
Apply( // def init(): org.apache.spark.SparkConf in class
 SparkConf, tree.tpe=org.apache.spark.SparkConf
  new org.apache.spark.SparkConf.init // def init():
 org.apache.spark.SparkConf in class SparkConf,
 tree.tpe=()org.apache.spark.SparkConf
  Nil
)
  )
)
()
  )
)
  )
  == Expanded type of tree ==
  ConstantType(value = Constant(parquet.io.api.Converter))
  uncaught exception during compilation: java.lang.AssertionError
 
  Error 2:

Re: question about Hive compatiblilty tests

2014-06-18 Thread Michael Armbrust

I assume you are adding tests?  because that is the only time you should
see that message.

That error could mean a couple of things:
 1) The query is invalid and hive threw an exception
 2) Your Hive setup is bad.

Regarding #2, you need to have the source for Hive 0.12.0 available and
built as well as a hadoop installation.  You also have to have the
environment vars set as specified here:
https://github.com/apache/spark/tree/master/sql

Michael


On Thu, Jun 19, 2014 at 12:22 AM, Will Benton wi...@redhat.com wrote:

 Hi all,

 Does a Failed to generate golden answer for query message from
 HiveComparisonTests indicate that it isn't possible to run the query in
 question under Hive from Spark's test suite rather than anything about
 Spark's implementation of HiveQL?  The stack trace I'm getting implicates
 Hive code and not Spark code, but I wanted to make sure I wasn't missing
 something.


 thanks,
 wb

Re: question about Hive compatiblilty tests

2014-06-18 Thread Will Benton

 I assume you are adding tests?  because that is the only time you should
 see that message.

Yes, I had added the HAVING test to the whitelist.

 That error could mean a couple of things:
  1) The query is invalid and hive threw an exception
  2) Your Hive setup is bad.
 
 Regarding #2, you need to have the source for Hive 0.12.0 available and
 built as well as a hadoop installation.  You also have to have the
 environment vars set as specified here:
 https://github.com/apache/spark/tree/master/sql

Thanks!  The other Hive compatibility tests seem to work, so I'll dig in a bit 
more to see if I can figure out what's happening here.


best,
wb

Re: Contribute to Spark - Need a mentor.

Re: Java IO Stream Corrupted - Invalid Type AC?

Re: Java IO Stream Corrupted - Invalid Type AC?

question about Hive compatiblilty tests

Re: Java IO Stream Corrupted - Invalid Type AC?

Re: Java IO Stream Corrupted - Invalid Type AC?

Re: Run ScalaTest inside Intellij IDEA

Re: question about Hive compatiblilty tests

Re: question about Hive compatiblilty tests

9 matches

Site Navigation

Mail list logo

Footer information