java.lang.NoClassDefFoundError: org/apache/spark/util/Vector

2014-03-27 Thread Kal El
I am getting this error when I try to run K-Means in spark-0.9.0:
java.lang.NoClassDefFoundError: org/apache/spark/util/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at 
scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:67)
        at 
scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
        at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
        at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
        at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
        at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
        at 
scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:74)
        at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
        at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
        at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.util.Vector
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 14 more


I have ran the same code on different machines but I haven't seen this error 
before and I haven't found a solution yet.

Thanks

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit :

 I hijack the thread, but my2c is that this feature is also important to
enable ad-hoc queries which is done at runtime. It doesn't remove interests
for such macro for precompiled jobs of course, but it may not be the first
use case envisioned with this Spark SQL.


I'm not sure to see what you call ad- hoc queries... Any sample?

 Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)

 Andy

 On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 Hi,
 Quite interesting!

 Suggestion: why not go even fancier  parse SQL queries at compile-time
with a macro ? ;)

 Pascal



 On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
mich...@databricks.com wrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer
here as well to a new feature we are pretty excited about for Spark 1.0.


http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael





Re: Change print() in JavaNetworkWordCount

2014-03-27 Thread Eduardo Costa Alfaia

Thank you very much  Sourav

BR

Em 3/26/14, 17:29, Sourav Chandra escreveu:

def print() {
def foreachFunc = (rdd: RDD[T], time: Time) = {
  val total = rdd.collect().toList
  println (---)
  println (Time:  + time)
  println (---)
  total.foreach(println)
//  val first11 = rdd.take(11)
//  println (---)
//  println (Time:  + time)
//  println (---)
//  first11.take(10).foreach(println)
//  if (first11.size  10) println(...)
  println()
}
new ForEachDStream(this, 
context.sparkContext.clean(foreachFunc)).register()

  }



--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


Ok that's what I thought! But for these runtime queries, is a macro useful
for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit
 :

 
  I hijack the thread, but my2c is that this feature is also important to
 enable ad-hoc queries which is done at runtime. It doesn't remove interests
 for such macro for precompiled jobs of course, but it may not be the first
 use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a pointer
 here as well to a new feature we are pretty excited about for Spark 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 





Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P)


On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro useful
 for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a pointer
 here as well to a new feature we are pretty excited about for Spark 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 






Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 11:08 AM, andy petrella andy.petre...@gmail.comwrote:

 nope (what I said :-P)


That's also my answer to my own question :D

but I didn't understand that in your sentence: my2c is that this feature
is also important to enable ad-hoc queries which is done at runtime.




 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 







Re: Announcing Spark SQL

2014-03-27 Thread yana
Does Shark not suit your needs? That's what we use at the moment and it's been 
good


Sent from my Samsung Galaxy S®4

 Original message 
From: andy petrella andy.petre...@gmail.com 
Date:03/27/2014  6:08 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Announcing Spark SQL 

nope (what I said :-P)


On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:



On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.com wrote:
I just mean queries sent at runtime ^^, like for any RDBMS.
In our project we have such requirement to have a layer to play with the data 
(custom and low level service layer of a lambda arch), and something like this 
is interesting.


Ok that's what I thought! But for these runtime queries, is a macro useful for 
you?
 


On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a écrit :



 I hijack the thread, but my2c is that this feature is also important to 
 enable ad-hoc queries which is done at runtime. It doesn't remove interests 
 for such macro for precompiled jobs of course, but it may not be the first 
 use case envisioned with this Spark SQL.

I'm not sure to see what you call ad- hoc queries... Any sample?

 Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)

 Andy

 On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:

 Hi,
 Quite interesting!

 Suggestion: why not go even fancier  parse SQL queries at compile-time with 
 a macro ? ;)

 Pascal



 On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust mich...@databricks.com 
 wrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here 
 as well to a new feature we are pretty excited about for Spark 1.0.

 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael








Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
Yes it could, of course. I didn't say that there is no tool to do it,
though ;-).

Andy


On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:

 Does Shark not suit your needs? That's what we use at the moment and it's
 been good


 Sent from my Samsung Galaxy S®4


  Original message 
 From: andy petrella
 Date:03/27/2014 6:08 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Re: Announcing Spark SQL

 nope (what I said :-P)


 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella 
 andy.petre...@gmail.comwrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with the
 data (custom and low level service layer of a lambda arch), and something
 like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 







Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
when there is something new, it's also cool to let imagination fly far away
;)


On Thu, Mar 27, 2014 at 2:20 PM, andy petrella andy.petre...@gmail.comwrote:

 Yes it could, of course. I didn't say that there is no tool to do it,
 though ;-).

 Andy


 On Thu, Mar 27, 2014 at 12:49 PM, yana yana.kadiy...@gmail.com wrote:

 Does Shark not suit your needs? That's what we use at the moment and it's
 been good


 Sent from my Samsung Galaxy S®4


  Original message 
 From: andy petrella
 Date:03/27/2014 6:08 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Re: Announcing Spark SQL

 nope (what I said :-P)


 On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:




 On Thu, Mar 27, 2014 at 10:22 AM, andy petrella andy.petre...@gmail.com
  wrote:

 I just mean queries sent at runtime ^^, like for any RDBMS.
 In our project we have such requirement to have a layer to play with
 the data (custom and low level service layer of a lambda arch), and
 something like this is interesting.


 Ok that's what I thought! But for these runtime queries, is a macro
 useful for you?




 On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:


 Le 27 mars 2014 09:47, andy petrella andy.petre...@gmail.com a
 écrit :

 
  I hijack the thread, but my2c is that this feature is also important
 to enable ad-hoc queries which is done at runtime. It doesn't remove
 interests for such macro for precompiled jobs of course, but it may not be
 the first use case envisioned with this Spark SQL.
 

 I'm not sure to see what you call ad- hoc queries... Any sample?

  Again, only my0.2c (ok I divided by 10 after writing my thoughts ^^)
 
  Andy
 
  On Thu, Mar 27, 2014 at 9:16 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:
 
  Hi,
  Quite interesting!
 
  Suggestion: why not go even fancier  parse SQL queries at
 compile-time with a macro ? ;)
 
  Pascal
 
 
 
  On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust 
 mich...@databricks.com wrote:
 
  Hey Everyone,
 
  This already went out to the dev list, but I wanted to put a
 pointer here as well to a new feature we are pretty excited about for 
 Spark
 1.0.
 
 
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 
  Michael
 
 
 








WikipediaPageRank Data Set

2014-03-27 Thread Niko Stahl
Hello,

I would like to run the
WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
but the Wikipedia dump XML files are no longer available on
Freebase. Does anyone know an alternative source for the data?

Thanks,
Niko


spark streaming: what is awaitTermination()?

2014-03-27 Thread Diana Carroll
The API docs for ssc.awaitTermination say simply Wait for the execution to
stop. Any exceptions that occurs during the execution will be thrown in
this thread.

Can someone help me understand what this means?  What causes execution to
stop?  Why do we need to wait for that to happen?

I tried removing it from my simple NetworkWordCount example (running
locally, not on a cluster) and nothing changed.  In both cases, I end my
program by hitting Ctrl-C.

Thanks for any insight you can give me.

Diana


spark streaming and the spark shell

2014-03-27 Thread Diana Carroll
I'm working with spark streaming using spark-shell, and hoping folks could
answer a few questions I have.

I'm doing WordCount on a socket stream:

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
var ssc = new StreamingContext(sc,Seconds(5))
var mystream = ssc.socketTextStream(localhost,)
var words = mystream.flatMap(line = line.split( ))
var wordCounts = words.map(x = (x, 1)).reduceByKey((x,y) = x+y)
wordCounts.print()
ssc.start()



1.  I'm assuming that using spark shell is an edge case, and that spark
streaming is really intended mostly for batch use.  True?

2.   I notice that once I start ssc.start(), my stream starts processing
and continues indefinitely...even if I close the socket on the server end
(I'm using unix command nc to mimic a server as explained in the
streaming programming guide .)  Can I tell my stream to detect if it's lost
a connection and therefore stop executing?  (Or even better, to attempt to
re-establish the connection?)

3.  I tried entering ssc.stop which resulted in an error:

Exception in thread Thread-43 org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down
14/03/27 07:36:13 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found

But it did stop the DStream execution.

4.  Then I tried restarting the ssc again (ssc.start) and got another error:
org.apache.spark.SparkException: JobScheduler already started
Is restarting an ssc supported?

5.  When I perform an operation like wordCounts.print(), that operation
will execution on each batch, ever n seconds.  Is there a way I can undo
that operation?  That is, I want it to *stop* executing that print ever n
seconds...without having to stop the stream.

What I'm really asking is...can I explore DStreams interactively the way I
can explore my data in regular Spark.  In regular Spark, I might perform
various operations on an RDD to see what happens.  So at first, I might
have used split( ) to tokenize my input text, but now I want to try
using split(,) instead, after the stream has already started running.
 Can I do that?

I did find out that if add a new operation to an existing dstream (say,
words.print()) *after *the ssc.start it works. It *will* add the second
print() call to the execution list every n seconds.

but if I try to add new dstreams, e.g.
...

ssc.start()

var testpairs = words.map(x = (x, TEST))
testpairs.print()


I get an error:

14/03/27 07:57:50 ERROR JobScheduler: Error generating jobs for time
139593227 ms
java.lang.Exception:
org.apache.spark.streaming.dstream.MappedDStream@84f0f92 has not been
initialized


Is this sort of interactive use just not supported?

Thanks!

Diana


StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Found this transform fn in StreamingContext which takes in a DStream[_] and a 
function which acts on each of its RDDs
Unfortunately I can't figure out how to transform my DStream[(String,Int)] into 
DStream[_]


/*** Create a new DStream in which each RDD is generated by applying a function 
on RDDs of the DStreams. */
  def transform[T: ClassTag](
  dstreams: Seq[DStream[_]],
  transformFunc: (Seq[RDD[_]], Time) = RDD[T]
): DStream[T] = {
new TransformedDStream[T](dstreams, sparkContext.clean(transformFunc))
  }

-Adrian



RE: StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Please disregard I didn't see the Seq wrapper.

From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 11:57 AM
To: u...@spark.incubator.apache.org
Subject: StreamingContext.transform on a DStream

Found this transform fn in StreamingContext which takes in a DStream[_] and a 
function which acts on each of its RDDs
Unfortunately I can't figure out how to transform my DStream[(String,Int)] into 
DStream[_]


/*** Create a new DStream in which each RDD is generated by applying a function 
on RDDs of the DStreams. */
  def transform[T: ClassTag](
  dstreams: Seq[DStream[_]],
  transformFunc: (Seq[RDD[_]], Time) = RDD[T]
): DStream[T] = {
new TransformedDStream[T](dstreams, sparkContext.clean(transformFunc))
  }

-Adrian



Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions  be executed once in each VM?

I am running  mapPartitions and the call function seems not to execute the
code?

JavaPairRDDString, String twos = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k));
twos.values().saveAsTextFile(args[2]);

JavaRDDString ls = twos.values().mapPartitions(new
FlatMapFunctionIteratorlt;String, String() {
@Override
public IterableString call(IteratorString arg0) throws Exception {
   System.out.println(Usage should call my jar once:  + arg0);
   return Lists.newArrayList();}
});



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: GC overhead limit exceeded

2014-03-27 Thread Ognen Duzlevski
Look at the tuning guide on Spark's webpage for strategies to cope with 
this.
I have run into quite a few memory issues like these, some are resolved 
by changing the StorageLevel strategy and employing things like Kryo, 
some are solved by specifying the number of tasks to break down a given 
operation into etc.


Ognen

On 3/27/14, 10:21 AM, Sai Prasanna wrote:

java.lang.OutOfMemoryError: GC overhead limit exceeded

What is the problem. The same code, i run, one instance it runs in 8 
second, next time it takes really long time, say 300-500 seconds...
I see the logs a lot of GC overhead limit exceeded is seen. What 
should be done ??


Please can someone throw some light on it ??



--
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*
*
Entire water in the ocean can never sink a ship, Unless it gets inside.
All the pressures of life can never hurt you, Unless you let them in.*




Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
I think now that this is because spark.local.dir is defaulting to /tmp, and
since the tasks are not running on the same machine, the file is not found
when the second task takes over.

How do you set spark.local.dir appropriately when running on mesos?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-Mesos-Marathon-strangeness-tp3285p3356.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: GC overhead limit exceeded

2014-03-27 Thread Andrew Or
Are you caching a lot of RDD's? If so, maybe you should unpersist() the
ones that you're not using. Also, if you're on 0.9, make sure
spark.shuffle.spill is enabled (which it is by default). This allows your
application to spill in-memory content to disk if necessary.

How much memory are you giving to your executors? The default,
spark.executor.memory is 512m, which is quite low. Consider raising this.
Checking the web UI is a good way to figure out your runtime memory usage.


On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski 
og...@plainvanillagames.com wrote:

  Look at the tuning guide on Spark's webpage for strategies to cope with
 this.
 I have run into quite a few memory issues like these, some are resolved by
 changing the StorageLevel strategy and employing things like Kryo, some are
 solved by specifying the number of tasks to break down a given operation
 into etc.

 Ognen


 On 3/27/14, 10:21 AM, Sai Prasanna wrote:

 java.lang.OutOfMemoryError: GC overhead limit exceeded

  What is the problem. The same code, i run, one instance it runs in 8
 second, next time it takes really long time, say 300-500 seconds...
 I see the logs a lot of GC overhead limit exceeded is seen. What should be
 done ??

  Please can someone throw some light on it ??



  --
  *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 * Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.*





KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
I have a simple streaming job that creates a kafka input stream on a topic
with 8 partitions, and does a forEachRDD

The job and tasks are running on mesos, and there are two tasks running, but
only 1 task doing anything.

I also set spark.streaming.concurrentJobs=8  but still there is only 1 task
doing work. I would have expected that each task took a subset of the
partitions.

Is there a way to make more than one task share the work here?  Are my
expectations off here?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Yea it's in a standalone mode and I did use SparkContext.addJar method and
tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it worked.

I finally made it work by modifying the ClientBase.scala code where I set
'appMasterOnly' to false before the addJars contents were added to
distCacheMgr. But this is not what I should be doing, right?

Is there a problem with addJar method in 0.9.0?


On Wed, Mar 26, 2014 at 1:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Sung,

 Are you using yarn-standalone mode?  Have you specified the --addJars
 option with your external jars?

 -Sandy


 On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung coded...@cs.stanford.edu
  wrote:

 Hello, (this is Yarn related)

 I'm able to load an external jar and use its classes within
 ApplicationMaster. I wish to use this jar within worker nodes, so I added
 sc.addJar(pathToJar) and ran.

 I get the following exception:

 org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4 times 
 (most recent failure: Exception failure: java.lang.NoClassDefFoundError: 
 org/opencv/objdetect/HOGDescriptor)
 Job aborted: Task 0.0:1 failed 4 times (most recent failure: Exception 
 failure: java.lang.NoClassDefFoundError: org/opencv/objdetect/HOGDescriptor)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 scala.Option.foreach(Option.scala:236)
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 And in worker node containers' stderr log (nothing in stdout log), I
 don't see any reference to loading jars:

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/gpphddata/1/yarn/nm-local-dir/usercache/yarn/filecache/7394400996676014282/spark-assembly-0.9.0-incubating-hadoop2.0.2-alpha-gphd-2.0.1.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/usr/lib/gphd/hadoop-2.0.2_alpha_gphd_2_0_1_0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/03/26 13:12:18 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/03/26 13:12:18 INFO Remoting: Starting remoting
 14/03/26 13:12:18 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: 
 akka.tcp://spark@alpinenode5.alpinenow.local:10314/user/CoarseGrainedScheduler
 14/03/26 13:12:18 ERROR executor.CoarseGrainedExecutorBackend: Driver 
 Disassociated [akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006] 
 - [akka.tcp://spark@alpinenode5.alpinenow.local:10314] disassociated! 
 Shutting down.






 Any idea what's going on?





Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
No i am running on 0.8.1.
Yes i am caching a lot, i am benchmarking a simple code in spark where in
512mb, 1g and 2g text files are taken, some basic intermediate operations
are done while the intermediate result which will be used in subsequent
operations are cached.

I thought that, we need not manually unpersist, if i need to cache
something and if cache is found full, automatically space will be created
by evacuating the earlier. Do i need to unpersist?

Moreover, if i run several times, will the previously cached RDDs still
remain in the cache? If so can i flush them manually out before the next
run? [something like complete cache flush]


On Thu, Mar 27, 2014 at 11:16 PM, Andrew Or and...@databricks.com wrote:

 Are you caching a lot of RDD's? If so, maybe you should unpersist() the
 ones that you're not using. Also, if you're on 0.9, make sure
 spark.shuffle.spill is enabled (which it is by default). This allows your
 application to spill in-memory content to disk if necessary.

 How much memory are you giving to your executors? The default,
 spark.executor.memory is 512m, which is quite low. Consider raising this.
 Checking the web UI is a good way to figure out your runtime memory usage.


 On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  Look at the tuning guide on Spark's webpage for strategies to cope with
 this.
 I have run into quite a few memory issues like these, some are resolved
 by changing the StorageLevel strategy and employing things like Kryo, some
 are solved by specifying the number of tasks to break down a given
 operation into etc.

 Ognen


 On 3/27/14, 10:21 AM, Sai Prasanna wrote:

 java.lang.OutOfMemoryError: GC overhead limit exceeded

  What is the problem. The same code, i run, one instance it runs in 8
 second, next time it takes really long time, say 300-500 seconds...
 I see the logs a lot of GC overhead limit exceeded is seen. What should
 be done ??

  Please can someone throw some light on it ??



  --
  *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 * Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.*






-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*


Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
Hey Rohit,

I think external tables based on Cassandra or other datastores will work
out-of-the box if you build Catalyst with Hive support.

Michael may have feelings about this but I'd guess the longer term design
for having schema support for Cassandra/HBase etc likely wouldn't rely on
hive external tables because it's an unnecessary layer of indirection.

Spark should be able to directly load an SchemaRDD from Cassandra by just
letting the user give relevant information about the Cassandra schema. And
it should let you write-back to Cassandra by giving a mapping of fields to
the respective cassandra columns. I think all of this would be fairly easy
to implement on SchemaRDD and likely will make it into Spark 1.1

- Patrick


On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote:

 Great work guys! Have been looking forward to this . . .

 In the blog it mentions support for reading from Hbase/Avro... What will
 be the recommended approach for this? Will it be writing custom wrappers
 for SQLContext like in HiveContext or using Hive's EXTERNAL TABLE support?

 I ask this because a few days back (based on your pull request in github)
 I started analyzing what it would take to support Spark SQL on Cassandra.
 One obvious approach will be to use Hive External Table support with our
 cassandra-hive handler. But second approach sounds tempting as it will give
 more fidelity.

 Regards,
 Rohit

 *Founder  CEO, **Tuplejump, Inc.*
 
 www.tuplejump.com
 *The Data Engineering Platform*


 On Thu, Mar 27, 2014 at 9:12 AM, Michael Armbrust 
 mich...@databricks.comwrote:

 Any plans to make the SQL typesafe using something like Slick (
 http://slick.typesafe.com/)


 I would really like to do something like that, and maybe we will in a
 couple of months. However, in the near term, I think the top priorities are
 going to be performance and stability.

 Michael





Re:

2014-03-27 Thread Mayur Rustagi
You have to raise the global limit as root. Also you have to do that on the
whole cluster.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Mar 27, 2014 at 4:07 AM, Hahn Jiang hahn.jiang@gmail.comwrote:

 I set ulimit -n 10 in conf/spark-env.sh, is it too small?


 On Thu, Mar 27, 2014 at 3:36 PM, Sonal Goyal sonalgoy...@gmail.comwrote:

 Hi Hahn,

 What's the ulimit on your systems? Please check the following link for a
 discussion on the too many files open.


 http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccangvg8qpn_wllsrcjegdb7hmza2ux7myxzhfvtz+b-sdxdk...@mail.gmail.com%3E


 Sent from my iPad

  On Mar 27, 2014, at 12:15 PM, Hahn Jiang hahn.jiang@gmail.com
 wrote:
 
  Hi, all
 
  I write a spark program on yarn. When I use small size input file, my
 program can run well. But my job will failed if input size is more than 40G.
 
  the error log:
  java.io.FileNotFoundException (java.io.FileNotFoundException:
 /home/work/data12/yarn/nodemanager/usercache/appcache/application_1392894597330_86813/spark-local-20140327144433-716b/24/shuffle_0_22_890
 (Too many open files))
  java.io.FileOutputStream.openAppend(Native Method)
  java.io.FileOutputStream.init(FileOutputStream.java:192)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:113)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.foreach(ExternalAppendOnlyMap.scala:239)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
  org.apache.spark.scheduler.Task.run(Task.scala:53)
 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  java.lang.Thread.run(Thread.java:662)
 
 
  my object:
  object Test {
 
def main(args: Array[String]) {
  val sc = new SparkContext(args(0), Test,
System.getenv(SPARK_HOME),
 SparkContext.jarOfClass(this.getClass))
 
  val mg = sc.textFile(/user/.../part-*)
  val mct = sc.textFile(/user/.../part-*)
 
  val pair1 = mg.map {
s =
  val cols = s.split(\t)
  (cols(0), cols(1))
  }
  val pair2 = mct.map {
s =
  val cols = s.split(\t)
  (cols(0), cols(1))
  }
  val merge = pair1.union(pair2)
  val result = merge.reduceByKey(_ + _)
  val outputPath = new Path(/user/xxx/temp/spark-output)
  outputPath.getFileSystem(new Configuration()).delete(outputPath,
 true)
  result.saveAsTextFile(outputPath.toString)
 
  System.exit(0)
}
 
  }
 
  My spark version is 0.9 and I run my job use this command
 /opt/soft/spark/bin/spark-class org.apache.spark.deploy.yarn.Client --jar
 ./spark-example_2.10-0.1-SNAPSHOT.jar --class Test --queue default --args
 yarn-standalone --num-workers 500 --master-memory 7g --worker-memory 7g
 --worker-cores 2
 





Re: GC overhead limit exceeded

2014-03-27 Thread Syed A. Hashmi
Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In
large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.

You can call unpersist on an RDD to remove it from Cache though.


On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 No i am running on 0.8.1.
 Yes i am caching a lot, i am benchmarking a simple code in spark where in
 512mb, 1g and 2g text files are taken, some basic intermediate operations
 are done while the intermediate result which will be used in subsequent
 operations are cached.

 I thought that, we need not manually unpersist, if i need to cache
 something and if cache is found full, automatically space will be created
 by evacuating the earlier. Do i need to unpersist?

 Moreover, if i run several times, will the previously cached RDDs still
 remain in the cache? If so can i flush them manually out before the next
 run? [something like complete cache flush]


 On Thu, Mar 27, 2014 at 11:16 PM, Andrew Or and...@databricks.com wrote:

 Are you caching a lot of RDD's? If so, maybe you should unpersist() the
 ones that you're not using. Also, if you're on 0.9, make sure
 spark.shuffle.spill is enabled (which it is by default). This allows your
 application to spill in-memory content to disk if necessary.

 How much memory are you giving to your executors? The default,
 spark.executor.memory is 512m, which is quite low. Consider raising this.
 Checking the web UI is a good way to figure out your runtime memory usage.


 On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  Look at the tuning guide on Spark's webpage for strategies to cope with
 this.
 I have run into quite a few memory issues like these, some are resolved
 by changing the StorageLevel strategy and employing things like Kryo, some
 are solved by specifying the number of tasks to break down a given
 operation into etc.

 Ognen


 On 3/27/14, 10:21 AM, Sai Prasanna wrote:

 java.lang.OutOfMemoryError: GC overhead limit exceeded

  What is the problem. The same code, i run, one instance it runs in 8
 second, next time it takes really long time, say 300-500 seconds...
 I see the logs a lot of GC overhead limit exceeded is seen. What should
 be done ??

  Please can someone throw some light on it ??



  --
  *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 * Entire water in the ocean can never sink a ship, Unless it gets
 inside. All the pressures of life can never hurt you, Unless you let them
 in.*






 --
 *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 *Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.*



how to create a DStream from bunch of RDDs

2014-03-27 Thread Adrian Mocanu
I create several RDDs by merging several consecutive RDDs from a DStream. Is 
there a way to add these new RDDs to a DStream?

-Adrian



Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
Heh sorry that wasnt a clear question, I know 'how' to set it but dont know
what value to use in a mesos cluster, since the processes are running in lxc
containers they wont be sharing a filesystem (or machine for that matter)  

I cant use an s3n:// url for local dir can I?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-Mesos-Marathon-strangeness-tp3285p3373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Actually looking closer it is stranger than I thought,

in the spark UI, one executor has executed 4 tasks, and one has executed
1928

Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
 
 2.   I notice that once I start ssc.start(), my stream starts processing and
 continues indefinitely...even if I close the socket on the server end (I'm
 using unix command nc to mimic a server as explained in the streaming
 programming guide .)  Can I tell my stream to detect if it's lost a
 connection and therefore stop executing?  (Or even better, to attempt to
 re-establish the connection?)
 
 
 
 Currently, not yet. But I am aware of this and this behavior will be
 improved in the future.

Now i understand why out spark streaming job starts to generate zero sized rdds 
from kafkainput, 
when one worker get OOM or crashes.

And we can’t detect it! Great. So spark streaming just doesn’t suite yet for 
24/7 operation =\

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Well, it says that the jar was successfully added but can't reference
classes from it. Does this have anything to do with this bug?

http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar


On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 I just tried this in CDH (only a few patches ahead of 0.9.0) and was able
 to include a dependency with --addJars successfully.

 Can you share how you're invoking SparkContext.addJar?  Anything
 interesting in the application master logs?

 -Sandy




 On Thu, Mar 27, 2014 at 11:35 AM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Yea it's in a standalone mode and I did use SparkContext.addJar method
 and tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it
 worked.

 I finally made it work by modifying the ClientBase.scala code where I set
 'appMasterOnly' to false before the addJars contents were added to
 distCacheMgr. But this is not what I should be doing, right?

 Is there a problem with addJar method in 0.9.0?


 On Wed, Mar 26, 2014 at 1:47 PM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Hi Sung,

 Are you using yarn-standalone mode?  Have you specified the --addJars
 option with your external jars?

 -Sandy


 On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Hello, (this is Yarn related)

 I'm able to load an external jar and use its classes within
 ApplicationMaster. I wish to use this jar within worker nodes, so I added
 sc.addJar(pathToJar) and ran.

 I get the following exception:

 org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4 times 
 (most recent failure: Exception failure: java.lang.NoClassDefFoundError: 
 org/opencv/objdetect/HOGDescriptor)
 Job aborted: Task 0.0:1 failed 4 times (most recent failure: Exception 
 failure: java.lang.NoClassDefFoundError: 
 org/opencv/objdetect/HOGDescriptor)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 scala.Option.foreach(Option.scala:236)
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 And in worker node containers' stderr log (nothing in stdout log), I
 don't see any reference to loading jars:

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/gpphddata/1/yarn/nm-local-dir/usercache/yarn/filecache/7394400996676014282/spark-assembly-0.9.0-incubating-hadoop2.0.2-alpha-gphd-2.0.1.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/usr/lib/gphd/hadoop-2.0.2_alpha_gphd_2_0_1_0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/03/26 13:12:18 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/03/26 13:12:18 INFO Remoting: Starting remoting
 14/03/26 13:12:18 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO executor.CoarseGrainedExecutorBackend: Connecting 
 to driver: 
 akka.tcp://spark@alpinenode5.alpinenow.local:10314/user/CoarseGrainedScheduler
 14/03/26 13:12:18 ERROR executor.CoarseGrainedExecutorBackend: Driver 
 Disassociated [akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006] 
 - [akka.tcp://spark@alpinenode5.alpinenow.local:10314] disassociated! 
 Shutting down.










Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 00:34, Scott Clasen scott.cla...@gmail.com wrote:
Actually looking closer it is stranger than I thought,

in the spark UI, one executor has executed 4 tasks, and one has executed
1928

Can anyone explain the workings of a KafkaInputStream wrt kafka partitions
and mapping to spark executors and tasks?


Well, there are some issues with kafkainput now.
When you do KafkaUtils.createStream, it creates kafka high level consumer on 
one node!
I don’t really know how many rdd it will generate during batch window.
But when this rdd are created, spark schedules consecutive transformations on 
that one node,
because of locality.

You can try to repartition() those rdds. Sometime it helps.

To try to consume from kafka on multiple machines you can do (1 to 
N).map(KafkaUtils.createStream)
But then arises issue with kafka high-level consumer! 
Those consumers operate in one consumer group, and they try to decide which 
consumer consumes which partitions.
And it may just fail to do syncpartitionrebalance, and then you have only a few 
consumers really consuming. 
To mitigate this problem, you can set rebalance retries very high, and pray it 
helps.

Then arises yet another feature — if your receiver dies (OOM, hardware failer), 
you just stop receiving from kafka!
Brilliant.
And another feature — if you ask spark’s kafkainput to begin with 
auto.offset.reset = smallest, it will reset you offsets every time you ran 
application!
It does not comply with documentation (reset to earliest offsets if it does not 
find offsets on zookeeper),
it just erase your offsets and start again from zero!

And remember that you should restart your streaming app when there is any 
failure on receiver!

So, at the bottom — kafka input stream just does not work.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Evgeniy Shishkin wrote
 So, at the bottom — kafka input stream just does not work.


That was the conclusion I was coming to as well.  Are there open tickets
around fixing this up?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sandy Ryza
That bug only appears to apply to spark-shell.

Do things work in yarn-client mode or on a standalone cluster?  Are you
passing a path with parent directories to addJar?


On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

 Well, it says that the jar was successfully added but can't reference
 classes from it. Does this have anything to do with this bug?


 http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar


 On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza sandy.r...@cloudera.comwrote:

 I just tried this in CDH (only a few patches ahead of 0.9.0) and was able
 to include a dependency with --addJars successfully.

 Can you share how you're invoking SparkContext.addJar?  Anything
 interesting in the application master logs?

 -Sandy




 On Thu, Mar 27, 2014 at 11:35 AM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Yea it's in a standalone mode and I did use SparkContext.addJar method
 and tried setting setExecutorEnv SPARK_CLASSPATH, etc. but none of it
 worked.

 I finally made it work by modifying the ClientBase.scala code where I
 set 'appMasterOnly' to false before the addJars contents were added to
 distCacheMgr. But this is not what I should be doing, right?

 Is there a problem with addJar method in 0.9.0?


 On Wed, Mar 26, 2014 at 1:47 PM, Sandy Ryza sandy.r...@cloudera.comwrote:

 Hi Sung,

 Are you using yarn-standalone mode?  Have you specified the --addJars
 option with your external jars?

 -Sandy


 On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Hello, (this is Yarn related)

 I'm able to load an external jar and use its classes within
 ApplicationMaster. I wish to use this jar within worker nodes, so I added
 sc.addJar(pathToJar) and ran.

 I get the following exception:

 org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4 times 
 (most recent failure: Exception failure: java.lang.NoClassDefFoundError: 
 org/opencv/objdetect/HOGDescriptor)
 Job aborted: Task 0.0:1 failed 4 times (most recent failure: Exception 
 failure: java.lang.NoClassDefFoundError: 
 org/opencv/objdetect/HOGDescriptor)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 scala.Option.foreach(Option.scala:236)
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 akka.actor.ActorCell.invoke(ActorCell.scala:456)
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 akka.dispatch.Mailbox.run(Mailbox.scala:219)
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 And in worker node containers' stderr log (nothing in stdout log), I
 don't see any reference to loading jars:

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/gpphddata/1/yarn/nm-local-dir/usercache/yarn/filecache/7394400996676014282/spark-assembly-0.9.0-incubating-hadoop2.0.2-alpha-gphd-2.0.1.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/usr/lib/gphd/hadoop-2.0.2_alpha_gphd_2_0_1_0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/03/26 13:12:18 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/03/26 13:12:18 INFO Remoting: Starting remoting
 14/03/26 13:12:18 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@alpinenode6.alpinenow.local:44006]
 14/03/26 13:12:18 INFO executor.CoarseGrainedExecutorBackend: Connecting 
 to driver: 
 

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the
worker has not been given enough memory or the allocation of the memory to
the RDD storage needs to be fixed. If configured correctly, the Spark
workers should not get OOMs.



On Thu, Mar 27, 2014 at 2:52 PM, Evgeny Shishkin itparan...@gmail.comwrote:


 2.   I notice that once I start ssc.start(), my stream starts processing
 and
 continues indefinitely...even if I close the socket on the server end (I'm
 using unix command nc to mimic a server as explained in the streaming
 programming guide .)  Can I tell my stream to detect if it's lost a
 connection and therefore stop executing?  (Or even better, to attempt to
 re-establish the connection?)



 Currently, not yet. But I am aware of this and this behavior will be
 improved in the future.


 Now i understand why out spark streaming job starts to generate zero sized
 rdds from kafkainput,
 when one worker get OOM or crashes.

 And we can't detect it! Great. So spark streaming just doesn't suite yet
 for 24/7 operation =\



Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote:

 The more I think about it the problem is not about /tmp, its more about the 
 workers not having enough memory. Blocks of received data could be falling 
 out of memory before it is getting processed. 
 BTW, what is the storage level that you are using for your input stream? If 
 you are using MEMORY_ONLY, then try MEMORY_AND_DISK. That is safer because it 
 ensure that if received data falls out of memory it will be at least saved to 
 disk.
 
 TD
 

And i saw such errors because of cleaner.rtt.
Thich erases everything. Even needed rdds.



 
 On Thu, Mar 27, 2014 at 2:29 PM, Scott Clasen scott.cla...@gmail.com wrote:
 Heh sorry that wasnt a clear question, I know 'how' to set it but dont know
 what value to use in a mesos cluster, since the processes are running in lxc
 containers they wont be sharing a filesystem (or machine for that matter)
 
 I cant use an s3n:// url for local dir can I?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-Mesos-Marathon-strangeness-tp3285p3373.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 



Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Thanks everyone for the discussion.

Just to note, I restarted the job yet again, and this time there are indeed
tasks being executed by both worker nodes. So the behavior does seem
inconsistent/broken atm.

Then I added a third node to the cluster, and a third executor came up, and
everything broke :|



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher

Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.

once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?

@dmpour 
rdd.mapPartitions and it would still work as code would only be executed
once in each VM, but was wondering if there is more efficient way of doing
this by using a generated RDD with one partition per executor. 
This remark was misleading, what I meant was that in conjunction with the
TaskNonce pattern, my function would be called only once per executor as
long as the RDD had atleast one partition on each executor

Deenar





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin

On 28 Mar 2014, at 02:10, Scott Clasen scott.cla...@gmail.com wrote:

 Thanks everyone for the discussion.
 
 Just to note, I restarted the job yet again, and this time there are indeed
 tasks being executed by both worker nodes. So the behavior does seem
 inconsistent/broken atm.
 
 Then I added a third node to the cluster, and a third executor came up, and
 everything broke :|
 
 

This is kafka’s high-level consumer. Try to raise rebalance retries.

Also, as this consumer is threaded, it have some protection against this 
failure - first it waits some time, and then rebalances.
But for spark cluster i think this time is not enough.
If there was a way to wait every spark executor to start, rebalance, and only 
when start to consume, this issue would be less visible.   



 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3391.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
I dint mention anything, so by default it should be MEMORY_AND_DISK right?

My doubt was, between two different experiments, are the RDDs cached in
memory need to be unpersisted???
Or it doesnt matter ?


On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi shas...@cloudera.comwrote:

 Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In
 large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.

 You can call unpersist on an RDD to remove it from Cache though.


 On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 No i am running on 0.8.1.
 Yes i am caching a lot, i am benchmarking a simple code in spark where in
 512mb, 1g and 2g text files are taken, some basic intermediate operations
 are done while the intermediate result which will be used in subsequent
 operations are cached.

 I thought that, we need not manually unpersist, if i need to cache
 something and if cache is found full, automatically space will be created
 by evacuating the earlier. Do i need to unpersist?

 Moreover, if i run several times, will the previously cached RDDs still
 remain in the cache? If so can i flush them manually out before the next
 run? [something like complete cache flush]


 On Thu, Mar 27, 2014 at 11:16 PM, Andrew Or and...@databricks.comwrote:

 Are you caching a lot of RDD's? If so, maybe you should unpersist() the
 ones that you're not using. Also, if you're on 0.9, make sure
 spark.shuffle.spill is enabled (which it is by default). This allows your
 application to spill in-memory content to disk if necessary.

 How much memory are you giving to your executors? The default,
 spark.executor.memory is 512m, which is quite low. Consider raising this.
 Checking the web UI is a good way to figure out your runtime memory usage.


 On Thu, Mar 27, 2014 at 9:22 AM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  Look at the tuning guide on Spark's webpage for strategies to cope
 with this.
 I have run into quite a few memory issues like these, some are resolved
 by changing the StorageLevel strategy and employing things like Kryo, some
 are solved by specifying the number of tasks to break down a given
 operation into etc.

 Ognen


 On 3/27/14, 10:21 AM, Sai Prasanna wrote:

 java.lang.OutOfMemoryError: GC overhead limit exceeded

  What is the problem. The same code, i run, one instance it runs in 8
 second, next time it takes really long time, say 300-500 seconds...
 I see the logs a lot of GC overhead limit exceeded is seen. What should
 be done ??

  Please can someone throw some light on it ??



  --
  *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 * Entire water in the ocean can never sink a ship, Unless it gets
 inside. All the pressures of life can never hurt you, Unless you let them
 in.*






 --
 *Sai Prasanna. AN*
 *II M.Tech (CS), SSSIHL*


 *Entire water in the ocean can never sink a ship, Unless it gets inside.
 All the pressures of life can never hurt you, Unless you let them in.*





-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*


Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll 
try to look into these, seems like a serious error.

Matei

On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote:

 Thanks, Matei.  I am running Spark 1.0.0-SNAPSHOT built for Hadoop
 1.0.4 from GitHub on 2014-03-18.
 
 I tried batchSizes of 512, 10, and 1 and each got me further but none
 have succeeded.
 
 I can get this to work -- with manual interventions -- if I omit
 `parsed.persist(StorageLevel.MEMORY_AND_DISK)` and set batchSize=1.  5
 of the 175 executors hung, and I had to kill the python process to get
 things going again.  The only indication of this in the logs was `INFO
 python.PythonRDD: stdin writer to Python finished early`.
 
 With batchSize=1 and persist, a new memory error came up in several
 tasks, before the app was failed:
 
 14/03/28 01:51:15 ERROR executor.Executor: Uncaught exception in
 thread Thread[stdin writer for python,5,main]
 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.init(String.java:203)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
at java.nio.CharBuffer.toString(CharBuffer.java:1201)
at org.apache.hadoop.io.Text.decode(Text.java:350)
at org.apache.hadoop.io.Text.decode(Text.java:327)
at org.apache.hadoop.io.Text.toString(Text.java:254)
at 
 org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
at 
 org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
at 
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
 
 There are other exceptions, but I think they all stem from the above,
 eg. org.apache.spark.SparkException: Error sending message to
 BlockManagerMaster
 
 Let me know if there are other settings I should try, or if I should
 try a newer snapshot.
 
 Thanks again!
 
 
 On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Hey Jim,
 
 In Spark 0.9 we added a batchSize parameter to PySpark that makes it group 
 multiple objects together before passing them between Java and Python, but 
 this may be too high by default. Try passing batchSize=10 to your 
 SparkContext constructor to lower it (the default is 1024). Or even 
 batchSize=1 to match earlier versions.
 
 Matei
 
 On Mar 21, 2014, at 6:18 PM, Jim Blomo jim.bl...@gmail.com wrote:
 
 Hi all, I'm wondering if there's any settings I can use to reduce the
 memory needed by the PythonRDD when computing simple stats.  I am
 getting OutOfMemoryError exceptions while calculating count() on big,
 but not absurd, records.  It seems like PythonRDD is trying to keep
 too many of these records in memory, when all that is needed is to
 stream through them and count.  Any tips for getting through this
 workload?
 
 
 Code:
 session = sc.textFile('s3://...json.gz') # ~54GB of compressed data
 
 # the biggest individual text line is ~3MB
 parsed = session.map(lambda l: l.split(\t,1)).map(lambda (y,s):
 (loads(y), loads(s)))
 parsed.persist(StorageLevel.MEMORY_AND_DISK)
 
 parsed.count()
 # will never finish: executor.Executor: Uncaught exception will FAIL
 all executors
 
 Incidentally the whole app appears to be killed, but this error is not
 propagated to the shell.
 
 Cluster:
 15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB)
 
 Exception:
 java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
 



Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help?

How can I configure a different spark.local.dir for each executor?


On 23 Mar, 2014, at 12:11 am, Tsai Li Ming mailingl...@ltsai.com wrote:

 Hi,
 
 Each of my worker node has its own unique spark.local.dir.
 
 However, when I run spark-shell, the shuffle writes are always written to 
 /tmp despite being set when the worker node is started.
 
 By specifying the spark.local.dir for the driver program, it seems to 
 override the executor? Is there a way to properly define it in the worker 
 node?
 
 Thanks!



Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi,

My worker nodes have more memory than the host that I’m submitting my driver 
program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell?

$ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell

Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x7f736e13, 205634994176, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 205634994176 bytes for 
committing reserved memory.

I want to allocate at least 100GB of memory per executor. The allocated memory 
on the executor seems to depend on the -Xmx heap size of the driver?

Thanks!