Re: java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-22 Thread Akhil Das
Hi Prabeesh,

Do a export _JAVA_OPTIONS="-Xmx10g" before starting the shark. Also you can
do a ps aux | grep shark and see how much memory it is being allocated,
mostly it should be 512mb, in that case increase the limit.

Thanks
Best Regards


On Fri, May 23, 2014 at 10:22 AM, prabeesh k  wrote:

>
> Hi,
>
> I am trying to apply  inner join in shark using 64MB and 27MB files. I am
> able to run the following queris on Mesos
>
>
>- "SELECT * FROM geoLocation1 "
>
>
>
>- """ SELECT * FROM geoLocation1  WHERE  country =  '"US"' """
>
>
> But while trying inner join as
>
>  "SELECT * FROM geoLocation1 g1 INNER JOIN geoBlocks1 g2 ON (g1.locId =
> g2.locId)"
>
>
>
> I am getting following error as follows.
>
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
> Task 1.0:7 failed 4 times (most recent failure: Exception failure:
> java.lang.OutOfMemoryError: Java heap space)
>  at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>  at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>  at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>  at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>  at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Please help me to resolve this.
>
> Thanks in adv
>
> regards,
> prabeesh
>


Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Hi,
I tried clearing maven and ivy cache and I am a bit confused at this point
in time.

1) Running the example from the spark directory and running using
bin/run-example. It works fine as well as it prints the word counts.

2) Trying to run the same code as a seperate job.
   *) Using the latest 1.0.0-SNAPSHOT it doesn't work and throws exception.
  *) Using 0.9.1 doesn't throws any exception but doesn't print any word
counts.

Thanks,
Shrikar


On Thu, May 22, 2014 at 9:19 PM, Soumya Simanta wrote:

> Try cleaning your maven (.m2) and ivy cache.
>
>
>
> On May 23, 2014, at 12:03 AM, Shrikar archak  wrote:
>
> Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
>
> Thanks,
> Shrikar
>
>
> On Thu, May 22, 2014 at 8:53 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you
>> publish Spark locally which allowed you to use it as a dependency?
>>
>> This is a weird indeed. SBT should take care of all the dependencies of
>> spark.
>>
>> In any case, you can try the last released Spark 0.9.1 and see if the
>> problem persists.
>>
>>
>> On Thu, May 22, 2014 at 3:59 PM, Shrikar archak wrote:
>>
>>> I am running as sbt run. I am running it locally .
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Thu, May 22, 2014 at 3:53 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 How are you launching the application? sbt run ? spark-submit? local
 mode or Spark standalone cluster? Are you packaging all your code into
 a jar?
 Looks to me that you seem to have spark classes in your execution
 environment but missing some of Spark's dependencies.

 TD



 On Thu, May 22, 2014 at 2:27 PM, Shrikar archak 
 wrote:
 > Hi All,
 >
 > I am trying to run the network count example as a seperate standalone
 job
 > and running into some issues.
 >
 > Environment:
 > 1) Mac Mavericks
 > 2) Latest spark repo from Github.
 >
 >
 > I have a structure like this
 >
 > Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
 > .
 > ./simple.sbt
 > ./src
 > ./src/main
 > ./src/main/scala
 > ./src/main/scala/NetworkWordCount.scala
 > ./src/main/scala/SimpleApp.scala.bk
 >
 >
 > simple.sbt
 > name := "Simple Project"
 >
 > version := "1.0"
 >
 > scalaVersion := "2.10.3"
 >
 > libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
 > "1.0.0-SNAPSHOT",
 > "org.apache.spark" %% "spark-streaming" %
 > "1.0.0-SNAPSHOT")
 >
 > resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
 >
 >
 > I am able to run the SimpleApp which is mentioned in the doc but when
 I try
 > to run the NetworkWordCount app I get error like this am I missing
 > something?
 >
 > [info] Running com.shrikar.sparkapps.NetworkWordCount
 > 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to:
 shrikar
 > 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
 > authentication disabled; ui acls disabled; users with view
 permissions:
 > Set(shrikar)
 > 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
 > 14/05/22 14:26:48 INFO Remoting: Starting remoting
 > 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on
 addresses
 > :[akka.tcp://spark@192.168.10.88:49963]
 > 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
 > [akka.tcp://spark@192.168.10.88:49963]
 > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
 > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
 > 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local
 directory at
 >
 /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
 > 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
 > capacity 911.6 MB.
 > 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to
 port 49964
 > with id = ConnectionManagerId(192.168.10.88,49964)
 > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
 > BlockManager
 > 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block
 manager
 > 192.168.10.88:49964 with 911.6 MB RAM
 > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered
 BlockManager
 > 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
 > [error] (run-main) java.lang.NoClassDefFoundError:
 > javax/servlet/http/HttpServletResponse
 > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 > at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 > at
 >
 org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
 > at
 >
 org.apache.spark.broadcast.HttpBroadcast$.ini

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread Chester
This is something we are interested as well. We are planning to investigate 
more on this. If someone has suggestions, we would love to hear.

Chester

Sent from my iPad

On May 22, 2014, at 8:02 AM, Pierre B 
 wrote:

> Hi Andy!
> 
> Yes Spark UI provides a lot of interesting informations for debugging 
> purposes.
> 
> Here I’m trying to integrate a simple progress monitoring in my app ui.
> 
> I’m typically running a few “jobs” (or rather actions), and I’d like to be 
> able to display the progress of each of those in my ui.
> 
> I don’t really see how i could do that using SparkListener for the moment …
> 
> Thanks for your help!
> 
> Cheers!
> 
> 
> 
> 
> Pierre Borckmans
> Software team
> 
> RealImpact Analytics | Brussels Office
> www.realimpactanalytics.com | [hidden email]
> 
> FR +32 485 91 87 31 | Skype pierre.borckmans
> 
> 
> 
> 
> 
> 
> On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] <[hidden 
> email]> wrote:
> 
>> SparkListener offers good stuffs.
>> But I also completed it with another metrics stuffs on my own that use Akka 
>> to aggregate metrics from anywhere I'd like to collect them (without any 
>> deps on ganglia yet on Codahale).
>> However, this was useful to gather some custom metrics (from within the 
>> tasks then) not really to collect overall monitoring information about the 
>> spark thingies themselves.
>> For that Spark UI offers already a pretty good insight no?
>> 
>> Cheers,
>> 
>> aℕdy ℙetrella
>> about.me/noootsab
>> 
>> 
>> 
>> 
>> On Thu, May 22, 2014 at 4:51 PM, Pierre B <> href="x-msg://7/user/SendEmail.jtp?type=node&node=6258&i=0" 
>> target="_top" rel="nofollow" link="external">[hidden email]> wrote:
>> Is there a simple way to monitor the overall progress of an action using
>> SparkListener or anything else?
>> 
>> I see that one can name an RDD... Could that be used to determine which
>> action triggered a stage, ... ?
>> 
>> 
>> Thanks
>> 
>> Pierre
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> 
>> 
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
>> To unsubscribe from Use SparkListener to get overall progress of an action, 
>> click here.
>> NAML
> 
> 
> View this message in context: Re: Use SparkListener to get overall progress 
> of an action
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-22 Thread prabeesh k
Hi,

I am trying to apply  inner join in shark using 64MB and 27MB files. I am
able to run the following queris on Mesos


   - "SELECT * FROM geoLocation1 "



   - """ SELECT * FROM geoLocation1  WHERE  country =  '"US"' """


But while trying inner join as

 "SELECT * FROM geoLocation1 g1 INNER JOIN geoBlocks1 g2 ON (g1.locId =
g2.locId)"



I am getting following error as follows.


Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Task 1.0:7 failed 4 times (most recent failure: Exception failure:
java.lang.OutOfMemoryError: Java heap space)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Please help me to resolve this.

Thanks in adv

regards,
prabeesh


Re: Unable to run a Standalone job

2014-05-22 Thread Soumya Simanta
Try cleaning your maven (.m2) and ivy cache. 



> On May 23, 2014, at 12:03 AM, Shrikar archak  wrote:
> 
> Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
> 
> Thanks,
> Shrikar
> 
> 
>> On Thu, May 22, 2014 at 8:53 PM, Tathagata Das  
>> wrote:
>> How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you publish 
>> Spark locally which allowed you to use it as a dependency?
>> 
>> This is a weird indeed. SBT should take care of all the dependencies of 
>> spark.
>> 
>> In any case, you can try the last released Spark 0.9.1 and see if the 
>> problem persists.
>> 
>> 
>>> On Thu, May 22, 2014 at 3:59 PM, Shrikar archak  wrote:
>>> I am running as sbt run. I am running it locally .
>>> 
>>> Thanks,
>>> Shrikar
>>> 
>>> 
 On Thu, May 22, 2014 at 3:53 PM, Tathagata Das 
  wrote:
 How are you launching the application? sbt run ? spark-submit? local
 mode or Spark standalone cluster? Are you packaging all your code into
 a jar?
 Looks to me that you seem to have spark classes in your execution
 environment but missing some of Spark's dependencies.
 
 TD
 
 
 
 On Thu, May 22, 2014 at 2:27 PM, Shrikar archak  
 wrote:
 > Hi All,
 >
 > I am trying to run the network count example as a seperate standalone job
 > and running into some issues.
 >
 > Environment:
 > 1) Mac Mavericks
 > 2) Latest spark repo from Github.
 >
 >
 > I have a structure like this
 >
 > Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
 > .
 > ./simple.sbt
 > ./src
 > ./src/main
 > ./src/main/scala
 > ./src/main/scala/NetworkWordCount.scala
 > ./src/main/scala/SimpleApp.scala.bk
 >
 >
 > simple.sbt
 > name := "Simple Project"
 >
 > version := "1.0"
 >
 > scalaVersion := "2.10.3"
 >
 > libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
 > "1.0.0-SNAPSHOT",
 > "org.apache.spark" %% "spark-streaming" %
 > "1.0.0-SNAPSHOT")
 >
 > resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
 >
 >
 > I am able to run the SimpleApp which is mentioned in the doc but when I 
 > try
 > to run the NetworkWordCount app I get error like this am I missing
 > something?
 >
 > [info] Running com.shrikar.sparkapps.NetworkWordCount
 > 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to: 
 > shrikar
 > 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
 > authentication disabled; ui acls disabled; users with view permissions:
 > Set(shrikar)
 > 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
 > 14/05/22 14:26:48 INFO Remoting: Starting remoting
 > 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses
 > :[akka.tcp://spark@192.168.10.88:49963]
 > 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
 > [akka.tcp://spark@192.168.10.88:49963]
 > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
 > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
 > 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory 
 > at
 > /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
 > 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
 > capacity 911.6 MB.
 > 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port 
 > 49964
 > with id = ConnectionManagerId(192.168.10.88,49964)
 > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
 > BlockManager
 > 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block 
 > manager
 > 192.168.10.88:49964 with 911.6 MB RAM
 > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered 
 > BlockManager
 > 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
 > [error] (run-main) java.lang.NoClassDefFoundError:
 > javax/servlet/http/HttpServletResponse
 > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 > at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 > at
 > org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
 > at
 > org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
 > at
 > org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
 > at
 > org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
 > at
 > org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
 > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 > at org.apache.spark.SparkContext.(SparkContext.scala:202)
 > at
 > org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:5

Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.

Thanks,
Shrikar


On Thu, May 22, 2014 at 8:53 PM, Tathagata Das
wrote:

> How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you
> publish Spark locally which allowed you to use it as a dependency?
>
> This is a weird indeed. SBT should take care of all the dependencies of
> spark.
>
> In any case, you can try the last released Spark 0.9.1 and see if the
> problem persists.
>
>
> On Thu, May 22, 2014 at 3:59 PM, Shrikar archak wrote:
>
>> I am running as sbt run. I am running it locally .
>>
>> Thanks,
>> Shrikar
>>
>>
>> On Thu, May 22, 2014 at 3:53 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> How are you launching the application? sbt run ? spark-submit? local
>>> mode or Spark standalone cluster? Are you packaging all your code into
>>> a jar?
>>> Looks to me that you seem to have spark classes in your execution
>>> environment but missing some of Spark's dependencies.
>>>
>>> TD
>>>
>>>
>>>
>>> On Thu, May 22, 2014 at 2:27 PM, Shrikar archak 
>>> wrote:
>>> > Hi All,
>>> >
>>> > I am trying to run the network count example as a seperate standalone
>>> job
>>> > and running into some issues.
>>> >
>>> > Environment:
>>> > 1) Mac Mavericks
>>> > 2) Latest spark repo from Github.
>>> >
>>> >
>>> > I have a structure like this
>>> >
>>> > Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
>>> > .
>>> > ./simple.sbt
>>> > ./src
>>> > ./src/main
>>> > ./src/main/scala
>>> > ./src/main/scala/NetworkWordCount.scala
>>> > ./src/main/scala/SimpleApp.scala.bk
>>> >
>>> >
>>> > simple.sbt
>>> > name := "Simple Project"
>>> >
>>> > version := "1.0"
>>> >
>>> > scalaVersion := "2.10.3"
>>> >
>>> > libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
>>> > "1.0.0-SNAPSHOT",
>>> > "org.apache.spark" %% "spark-streaming" %
>>> > "1.0.0-SNAPSHOT")
>>> >
>>> > resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>> >
>>> >
>>> > I am able to run the SimpleApp which is mentioned in the doc but when
>>> I try
>>> > to run the NetworkWordCount app I get error like this am I missing
>>> > something?
>>> >
>>> > [info] Running com.shrikar.sparkapps.NetworkWordCount
>>> > 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to:
>>> shrikar
>>> > 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
>>> > authentication disabled; ui acls disabled; users with view permissions:
>>> > Set(shrikar)
>>> > 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> > 14/05/22 14:26:48 INFO Remoting: Starting remoting
>>> > 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on
>>> addresses
>>> > :[akka.tcp://spark@192.168.10.88:49963]
>>> > 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
>>> > [akka.tcp://spark@192.168.10.88:49963]
>>> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
>>> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
>>> > 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local
>>> directory at
>>> >
>>> /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
>>> > 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
>>> > capacity 911.6 MB.
>>> > 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port
>>> 49964
>>> > with id = ConnectionManagerId(192.168.10.88,49964)
>>> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
>>> > BlockManager
>>> > 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block
>>> manager
>>> > 192.168.10.88:49964 with 911.6 MB RAM
>>> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered
>>> BlockManager
>>> > 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
>>> > [error] (run-main) java.lang.NoClassDefFoundError:
>>> > javax/servlet/http/HttpServletResponse
>>> > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>> > at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>> > at
>>> >
>>> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
>>> > at
>>> >
>>> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
>>> > at
>>> >
>>> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>>> > at
>>> >
>>> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>>> > at
>>> >
>>> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>>> > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> > at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> > at
>>> >
>>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
>>> > at
>>> >
>>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
>>> > at
>>> >
>>> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:91)
>>> >

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you
publish Spark locally which allowed you to use it as a dependency?

This is a weird indeed. SBT should take care of all the dependencies of
spark.

In any case, you can try the last released Spark 0.9.1 and see if the
problem persists.


On Thu, May 22, 2014 at 3:59 PM, Shrikar archak  wrote:

> I am running as sbt run. I am running it locally .
>
> Thanks,
> Shrikar
>
>
> On Thu, May 22, 2014 at 3:53 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> How are you launching the application? sbt run ? spark-submit? local
>> mode or Spark standalone cluster? Are you packaging all your code into
>> a jar?
>> Looks to me that you seem to have spark classes in your execution
>> environment but missing some of Spark's dependencies.
>>
>> TD
>>
>>
>>
>> On Thu, May 22, 2014 at 2:27 PM, Shrikar archak 
>> wrote:
>> > Hi All,
>> >
>> > I am trying to run the network count example as a seperate standalone
>> job
>> > and running into some issues.
>> >
>> > Environment:
>> > 1) Mac Mavericks
>> > 2) Latest spark repo from Github.
>> >
>> >
>> > I have a structure like this
>> >
>> > Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
>> > .
>> > ./simple.sbt
>> > ./src
>> > ./src/main
>> > ./src/main/scala
>> > ./src/main/scala/NetworkWordCount.scala
>> > ./src/main/scala/SimpleApp.scala.bk
>> >
>> >
>> > simple.sbt
>> > name := "Simple Project"
>> >
>> > version := "1.0"
>> >
>> > scalaVersion := "2.10.3"
>> >
>> > libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
>> > "1.0.0-SNAPSHOT",
>> > "org.apache.spark" %% "spark-streaming" %
>> > "1.0.0-SNAPSHOT")
>> >
>> > resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>> >
>> >
>> > I am able to run the SimpleApp which is mentioned in the doc but when I
>> try
>> > to run the NetworkWordCount app I get error like this am I missing
>> > something?
>> >
>> > [info] Running com.shrikar.sparkapps.NetworkWordCount
>> > 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to:
>> shrikar
>> > 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
>> > authentication disabled; ui acls disabled; users with view permissions:
>> > Set(shrikar)
>> > 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> > 14/05/22 14:26:48 INFO Remoting: Starting remoting
>> > 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on
>> addresses
>> > :[akka.tcp://spark@192.168.10.88:49963]
>> > 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
>> > [akka.tcp://spark@192.168.10.88:49963]
>> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
>> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
>> > 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local
>> directory at
>> >
>> /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
>> > 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
>> > capacity 911.6 MB.
>> > 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port
>> 49964
>> > with id = ConnectionManagerId(192.168.10.88,49964)
>> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
>> > BlockManager
>> > 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block
>> manager
>> > 192.168.10.88:49964 with 911.6 MB RAM
>> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered
>> BlockManager
>> > 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
>> > [error] (run-main) java.lang.NoClassDefFoundError:
>> > javax/servlet/http/HttpServletResponse
>> > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>> > at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>> > at
>> >
>> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
>> > at
>> >
>> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
>> > at
>> >
>> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>> > at
>> >
>> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>> > at
>> >
>> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>> > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>> > at org.apache.spark.SparkContext.(SparkContext.scala:202)
>> > at
>> >
>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
>> > at
>> >
>> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
>> > at
>> >
>> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:91)
>> > at
>> com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39)
>> > at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke

Unsubscribe

2014-05-22 Thread Donna-M Fernandez
Unsubscribe 


Broadcast Variables

2014-05-22 Thread Puneet Lakhina
Hi,

Im confused on what is the right way to use broadcast variables from java.

My code looks something like this:

Map<> val = //build Map to be broadcast
Broadcast> broadastVar = sc.broadcast(val);


sc.textFile(...).map(new SomeFunction()) {
//Do something here using broadcastVar
}

My question is, should I pass the broadcastVar to the SomeFunction as a
constructor parameter that it can keep around as an instance variable i.e.

sc.textFile(...).map(new SomeFunction(broadcastVar)) {
//Do something here using broadcastVar
}

class SomeFunction extends Function {
 public SomeFunction(Broadcast> var) {
   this.var = var
 }

 public T call() {
  //Do something
 }
}

Is above the right way to utilize broadcast Variables when not using
anonymous inner classes as functions?
-- 
Regards,
Puneet


Re: accessing partition i+1 from mapper of partition i

2014-05-22 Thread Mohit Jaggi
Austin,
I made up a mock example...my real use case is more complex. I used
foreach() instead of collect/cache..that forces the accumulable to be
evaluated. On another thread Xiangrui pointed me to a sliding window rdd in
mlllib that is a great alternative (although I did not switch to using it)

Mohit.


On Thu, May 22, 2014 at 2:30 PM, Austin Gibbons wrote:

> Mohit, if you want to end up with (1 .. N) , why don't you skip the logic
> for finding missing values, and generate it directly?
>
> val max = myCollection.reduce(math.max)
> sc.parallelize((0 until max))
>
> In either case, you don't need to call cache, which will force it into
> memory - you can do something like "count" which will not necessarily store
> the RDD in memory.
>
> Additionally, instead of an accumulable, you could consider mapping that
> value directly:
>
> rdd.mapPartitionWithIndex{case(index, partition) => index ->
> partition.reduce(math.max)}.collectAsMap()
>
>
> On Mon, May 19, 2014 at 9:50 PM, Mohit Jaggi  wrote:
>
>> Thanks Brian. This works. I used Accumulable to do the "collect" in step
>> B. While doing that I found that Accumulable.value is not a Spark "action",
>> I need to call "cache" in the underlying RDD for "value" to work. Not sure
>> if that is intentional or a bug.
>> The "collect" of Step B can be done as a new RDD too.
>>
>>
>> On Thu, May 15, 2014 at 5:47 PM, Brian Gawalt  wrote:
>>
>>> I don't think there's a direct way of bleeding elements across
>>> partitions. But you could write it yourself relatively succinctly:
>>>
>>> A) Sort the RDD
>>> B) Look at the sorted RDD's partitions with the
>>> .mapParititionsWithIndex( ) method. Map each partition to its partition ID,
>>> and its maximum element. Collect the (partID, maxElements) in the driver.
>>> C) Broadcast the collection of (partID, part's max element) tuples
>>> D) Look again at the sorted RDD's partitions with
>>> mapPartitionsWithIndex( ). For each partition *K:*
>>> D1) Find the immediately-preceding partition* K -1 , *and its
>>> associated maximum value. Use that to decide how many values are missing
>>> between the last element of part *K-1 *and the first element of part *K*
>>> .
>>> D2) Step through part *K*'s elements and find the rest of the missing
>>> elements in that part
>>>
>>> This approach sidesteps worries you might have over the hack of using
>>> .filter to remove the first element, as well as the zipping.
>>>
>>> --Brian
>>>
>>>
>>>
>>> On Tue, May 13, 2014 at 9:31 PM, Mohit Jaggi wrote:
>>>
 Hi,
 I am trying to find a way to fill in missing values in an RDD. The RDD
 is a sorted sequence.
 For example, (1, 2, 3, 5, 8, 11, ...)
 I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)

 One way to do this is to "slide and zip"
 rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
 x = rdd1.first
 rdd2 = rdd1 filter (_ != x)
 rdd3 = rdd2 zip rdd1
 rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and
 y }

 Another method which I think is more efficient is to use
 mapParititions() on rdd1 to be able to iterate on elements of rdd1 in each
 partition. However, that leaves the boundaries of the partitions to be
 "unfilled". *Is there a way within the function passed to
 mapPartitions, to read the first element in the next partition?*

 The latter approach also appears to work for a general "sliding window"
 calculation on the RDD. The former technique requires a lot of "sliding and
 zipping" and I believe it is not efficient. If only I could read the next
 partition...I have tried passing a pointer to rdd1 to the function passed
 to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
 Spark cannot deal with a mapper calling another mapper (since it happens on
 a worker not the driver)

 Mohit.


>>>
>>
>
>
> --
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Austin Gibbons
> Research | quantiFind  | 708 601 4894
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>


Re: How to turn off MetadataCleaner?

2014-05-22 Thread Tathagata Das
The cleaner should remain up while the sparkcontext is still active (not
stopped). However, here it seems you are stopping the sparkContext
("ssc.stop(true)"), the cleaner should be stopped. However, there was a bug
earlier where some of the cleaners may not have been stopped when the
context is stopped. What version are you using. If it is 0.9.1, I can see
that the cleaner in
ShuffleBlockManageris
not stopped, so it is a bug.

TD


On Thu, May 22, 2014 at 9:24 AM, Adrian Mocanu wrote:

>  Hi
>
> After using sparks TestSuiteBase to run some tests I’ve noticed that at
> the end, after finishing all tests the cleaner is still running and outputs
> the following perdiodically:
>
> INFO  o.apache.spark.util.MetadataCleaner  - Ran metadata cleaner for
> SHUFFLE_BLOCK_MANAGER
>
>
>
> I use method testOperation and I’ve changed it so that it stores the
> pointer to ssc after running setupStreams. Then using that pointer to turn
> things off, but the cleaner remains up.
>
>
>
> How to shut down all of spark, including cleaner?
>
>
>
> Here is how I changed testOperation method (changes in bold):
>
>
>
>   def testOperation[U: ClassTag, V: ClassTag](
>
>input: Seq[Seq[U]],
>
>operation: DStream[U] =>
> DStream[V],
>
>expectedOutput: Seq[Seq[V]],
>
>numBatches: Int,
>
>useSet: Boolean
>
>) {
>
> val numBatches_ = if (numBatches > 0) numBatches else
> expectedOutput.size
>
> *val ssc* = setupStreams[U, V](input, operation)
>
> val output = runStreams[V](ssc, numBatches_, expectedOutput.size)
>
> verifyOutput[V](output, expectedOutput, useSet)
>
> *ssc.awaitTermination(500)*
>
> *ssc.stop(true)*
>
>   }
>
>
>
> -Adrian
>
>
>


Re: How to Run Machine Learning Examples

2014-05-22 Thread Krishna Sankar
I couldn't find the classification.SVM class.

   - Most probably the command is something of the order of:
   - bin/spark-submit --class
  org.apache.spark.examples.mllib.BinaryClassification
  examples/target/scala-*/spark-examples-*.jar --algorithm SVM  train.csv
   - For more details try
  - ./bin/run-example mllib.BinaryClassification 
 - Usage: BinaryClassification [options] 
 -   --numIterationsnumber of iterations
 -   --stepSize initial step size, default: 1.0
 -   --algorithm  algorithm (SVM,LR), default: LR
 -   --regType  regularization type (L1,L2),
 default: L2
 -   --regParam  regularization parameter, default:
 0.1
 -input paths to labeled examples in LIBSVM
 format

HTH.

Cheers



P.S: I am using 1.0.0 rc10. Even for earlier release, just run the
classification class and it will tell you what the parameters are. Most
probably SVM is an algorithm parameter not a class by itself.


On Thu, May 22, 2014 at 2:12 PM, yxzhao  wrote:

> Thanks Stephen,
> I used the following commnad line to run the SVM, but it seems that the
> path is not correct. What the right path or command line should be? Thanks.
> *./bin/run-example org.apache.spark.mllib.classification.SVM
> spark://100.1.255.193:7077  train.csv 20*
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/mllib/classification/SVM
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.mllib.classification.SVM
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Could not find the main class: org.apache.spark.mllib.classification.SVM.
>  Program will exit.
>
>
>
>
>
>
>
> On Thu, May 22, 2014 at 3:05 PM, Stephen Boesch [via Apache Spark User
> List] <[hidden email] >
> wrote:
> >
> > There is a bin/run-example.sh  []
> >
> >
> > 2014-05-22 12:48 GMT-07:00 yxzhao <[hidden email]>:
> >>
> >> I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
> >> following directory on my data set. But I did not find the sample
> command
> >> line to run them. Anybody help? Thanks.
> >>
> >>
> spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
> >
> >
> > 
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6278.html
> > To unsubscribe from How to Run Machine Learning Examples, click here.
> > NAML
>
>
>
> --
> View this message in context: Re: How to Run Machine Learning 
> Examples
>
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: Error while launching ec2 spark cluster with HVM (r3.large)

2014-05-22 Thread Xiangrui Meng
Was the error message the same as you posted when you used `root` as
the user id? Could you try this:

1) Do not specify user id. (Default would be `root`.)
2) If it fails in the middle, try `spark-ec2  --resume launch
` to continue launching the cluster.

Best,
Xiangrui

On Thu, May 22, 2014 at 12:44 PM, adparker  wrote:
> I had this problem too and fixed it by setting the wait time-out to a larger
> value: --wait
>
> For example, in "stand alone" mode with default values, a time out of 480
> seconds worked for me:
>
> $ cd spark-0.9.1/ec2
> $ ./spark-ec2 --key-pair= --identity-file= --instance-type=r3.large
> --wait=480  launch 
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-launching-ec2-spark-cluster-with-HVM-r3-large-tp5862p6276.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
I am running as sbt run. I am running it locally .

Thanks,
Shrikar


On Thu, May 22, 2014 at 3:53 PM, Tathagata Das
wrote:

> How are you launching the application? sbt run ? spark-submit? local
> mode or Spark standalone cluster? Are you packaging all your code into
> a jar?
> Looks to me that you seem to have spark classes in your execution
> environment but missing some of Spark's dependencies.
>
> TD
>
>
>
> On Thu, May 22, 2014 at 2:27 PM, Shrikar archak 
> wrote:
> > Hi All,
> >
> > I am trying to run the network count example as a seperate standalone job
> > and running into some issues.
> >
> > Environment:
> > 1) Mac Mavericks
> > 2) Latest spark repo from Github.
> >
> >
> > I have a structure like this
> >
> > Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
> > .
> > ./simple.sbt
> > ./src
> > ./src/main
> > ./src/main/scala
> > ./src/main/scala/NetworkWordCount.scala
> > ./src/main/scala/SimpleApp.scala.bk
> >
> >
> > simple.sbt
> > name := "Simple Project"
> >
> > version := "1.0"
> >
> > scalaVersion := "2.10.3"
> >
> > libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
> > "1.0.0-SNAPSHOT",
> > "org.apache.spark" %% "spark-streaming" %
> > "1.0.0-SNAPSHOT")
> >
> > resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
> >
> >
> > I am able to run the SimpleApp which is mentioned in the doc but when I
> try
> > to run the NetworkWordCount app I get error like this am I missing
> > something?
> >
> > [info] Running com.shrikar.sparkapps.NetworkWordCount
> > 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to:
> shrikar
> > 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
> > authentication disabled; ui acls disabled; users with view permissions:
> > Set(shrikar)
> > 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
> > 14/05/22 14:26:48 INFO Remoting: Starting remoting
> > 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses
> > :[akka.tcp://spark@192.168.10.88:49963]
> > 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
> > [akka.tcp://spark@192.168.10.88:49963]
> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
> > 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
> > 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory
> at
> >
> /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
> > 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
> > capacity 911.6 MB.
> > 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port
> 49964
> > with id = ConnectionManagerId(192.168.10.88,49964)
> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
> > BlockManager
> > 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block
> manager
> > 192.168.10.88:49964 with 911.6 MB RAM
> > 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered
> BlockManager
> > 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
> > [error] (run-main) java.lang.NoClassDefFoundError:
> > javax/servlet/http/HttpServletResponse
> > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
> > at org.apache.spark.HttpServer.start(HttpServer.scala:54)
> > at
> >
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
> > at
> >
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
> > at
> >
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
> > at
> >
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
> > at
> >
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
> > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
> > at org.apache.spark.SparkContext.(SparkContext.scala:202)
> > at
> >
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
> > at
> >
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
> > at
> >
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:91)
> > at
> com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39)
> > at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> >
> >
> > Thanks,
> > Shrikar
> >
>


Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you launching the application? sbt run ? spark-submit? local
mode or Spark standalone cluster? Are you packaging all your code into
a jar?
Looks to me that you seem to have spark classes in your execution
environment but missing some of Spark's dependencies.

TD



On Thu, May 22, 2014 at 2:27 PM, Shrikar archak  wrote:
> Hi All,
>
> I am trying to run the network count example as a seperate standalone job
> and running into some issues.
>
> Environment:
> 1) Mac Mavericks
> 2) Latest spark repo from Github.
>
>
> I have a structure like this
>
> Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
> .
> ./simple.sbt
> ./src
> ./src/main
> ./src/main/scala
> ./src/main/scala/NetworkWordCount.scala
> ./src/main/scala/SimpleApp.scala.bk
>
>
> simple.sbt
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.10.3"
>
> libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
> "1.0.0-SNAPSHOT",
> "org.apache.spark" %% "spark-streaming" %
> "1.0.0-SNAPSHOT")
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
>
> I am able to run the SimpleApp which is mentioned in the doc but when I try
> to run the NetworkWordCount app I get error like this am I missing
> something?
>
> [info] Running com.shrikar.sparkapps.NetworkWordCount
> 14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to: shrikar
> 14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(shrikar)
> 14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 14/05/22 14:26:48 INFO Remoting: Starting remoting
> 14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@192.168.10.88:49963]
> 14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@192.168.10.88:49963]
> 14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
> 14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
> 14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory at
> /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
> 14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
> capacity 911.6 MB.
> 14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port 49964
> with id = ConnectionManagerId(192.168.10.88,49964)
> 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block manager
> 192.168.10.88:49964 with 911.6 MB RAM
> 14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
> [error] (run-main) java.lang.NoClassDefFoundError:
> javax/servlet/http/HttpServletResponse
> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
> at
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
> at
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
> at
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
> at
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
> at
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
> at org.apache.spark.SparkContext.(SparkContext.scala:202)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
> at
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
> at
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:91)
> at com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39)
> at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
>
>
> Thanks,
> Shrikar
>


Re: accessing partition i+1 from mapper of partition i

2014-05-22 Thread Austin Gibbons
Mohit, if you want to end up with (1 .. N) , why don't you skip the logic
for finding missing values, and generate it directly?

val max = myCollection.reduce(math.max)
sc.parallelize((0 until max))

In either case, you don't need to call cache, which will force it into
memory - you can do something like "count" which will not necessarily store
the RDD in memory.

Additionally, instead of an accumulable, you could consider mapping that
value directly:

rdd.mapPartitionWithIndex{case(index, partition) => index ->
partition.reduce(math.max)}.collectAsMap()


On Mon, May 19, 2014 at 9:50 PM, Mohit Jaggi  wrote:

> Thanks Brian. This works. I used Accumulable to do the "collect" in step
> B. While doing that I found that Accumulable.value is not a Spark "action",
> I need to call "cache" in the underlying RDD for "value" to work. Not sure
> if that is intentional or a bug.
> The "collect" of Step B can be done as a new RDD too.
>
>
> On Thu, May 15, 2014 at 5:47 PM, Brian Gawalt  wrote:
>
>> I don't think there's a direct way of bleeding elements across
>> partitions. But you could write it yourself relatively succinctly:
>>
>> A) Sort the RDD
>> B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex(
>> ) method. Map each partition to its partition ID, and its maximum element.
>> Collect the (partID, maxElements) in the driver.
>> C) Broadcast the collection of (partID, part's max element) tuples
>> D) Look again at the sorted RDD's partitions with mapPartitionsWithIndex(
>> ). For each partition *K:*
>> D1) Find the immediately-preceding partition* K -1 , *and its associated
>> maximum value. Use that to decide how many values are missing between the
>> last element of part *K-1 *and the first element of part *K*.
>> D2) Step through part *K*'s elements and find the rest of the missing
>> elements in that part
>>
>> This approach sidesteps worries you might have over the hack of using
>> .filter to remove the first element, as well as the zipping.
>>
>> --Brian
>>
>>
>>
>> On Tue, May 13, 2014 at 9:31 PM, Mohit Jaggi wrote:
>>
>>> Hi,
>>> I am trying to find a way to fill in missing values in an RDD. The RDD
>>> is a sorted sequence.
>>> For example, (1, 2, 3, 5, 8, 11, ...)
>>> I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
>>>
>>> One way to do this is to "slide and zip"
>>> rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
>>> x = rdd1.first
>>> rdd2 = rdd1 filter (_ != x)
>>> rdd3 = rdd2 zip rdd1
>>> rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and
>>> y }
>>>
>>> Another method which I think is more efficient is to use
>>> mapParititions() on rdd1 to be able to iterate on elements of rdd1 in each
>>> partition. However, that leaves the boundaries of the partitions to be
>>> "unfilled". *Is there a way within the function passed to
>>> mapPartitions, to read the first element in the next partition?*
>>>
>>> The latter approach also appears to work for a general "sliding window"
>>> calculation on the RDD. The former technique requires a lot of "sliding and
>>> zipping" and I believe it is not efficient. If only I could read the next
>>> partition...I have tried passing a pointer to rdd1 to the function passed
>>> to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
>>> Spark cannot deal with a mapper calling another mapper (since it happens on
>>> a worker not the driver)
>>>
>>> Mohit.
>>>
>>>
>>
>


-- 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Austin Gibbons
Research | quantiFind  | 708 601 4894
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Unable to run a Standalone job

2014-05-22 Thread Shrikar archak
Hi All,

I am trying to run the network count example as a seperate standalone job
and running into some issues.

Environment:
1) Mac Mavericks
2) Latest spark repo from Github.


I have a structure like this

Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/NetworkWordCount.scala
./src/main/scala/SimpleApp.scala.bk


simple.sbt
name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %
"1.0.0-SNAPSHOT",
"org.apache.spark" %% "spark-streaming" %
"1.0.0-SNAPSHOT")

resolvers += "Akka Repository" at "http://repo.akka.io/releases/";


I am able to run the SimpleApp which is mentioned in the doc but when I try
to run the NetworkWordCount app I get error like this am I missing
something?

[info] Running com.shrikar.sparkapps.NetworkWordCount
14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to: shrikar
14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(shrikar)
14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/05/22 14:26:48 INFO Remoting: Starting remoting
14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@192.168.10.88:49963]
14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@192.168.10.88:49963]
14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory at
/var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
capacity 911.6 MB.
14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port
49964 with id = ConnectionManagerId(192.168.10.88,49964)
14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block manager
192.168.10.88:49964 with 911.6 MB RAM
*14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered BlockManager*
*14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server*
*[error] (run-main) java.lang.NoClassDefFoundError:
javax/servlet/http/HttpServletResponse*
*java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse*
* at org.apache.spark.HttpServer.start(HttpServer.scala:54)*
* at
org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)*
* at
org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)*
at
org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
 at
org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
at
org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
at org.apache.spark.SparkContext.(SparkContext.scala:202)
 at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
 at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:91)
at com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39)
 at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)


Thanks,
Shrikar


Re: How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
Thanks.
I used the following commnad line to run the SVM, but it seems that the path
is not correct. What the right path or command line should be? Thanks.
./bin/run-example org.apache.spark.mllib.classification.SVM
spark://100.1.255.193:7077 train.csv 20
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/mllib/classification/SVM
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.mllib.classification.SVM
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.mllib.classification.SVM. 
Program will exit.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6288.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
Thanks Stephen,
I used the following commnad line to run the SVM, but it seems that the
path is not correct. What the right path or command line should be? Thanks.
*./bin/run-example org.apache.spark.mllib.classification.SVM
spark://100.1.255.193:7077  train.csv 20*
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/mllib/classification/SVM
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.mllib.classification.SVM
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.mllib.classification.SVM.
 Program will exit.







On Thu, May 22, 2014 at 3:05 PM, Stephen Boesch [via Apache Spark User
List]  wrote:
>
> There is a bin/run-example.sh  []
>
>
> 2014-05-22 12:48 GMT-07:00 yxzhao <[hidden email]>:
>>
>> I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
>> following directory on my data set. But I did not find the sample command
>> line to run them. Anybody help? Thanks.
>>
>>
spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
>>
>>
>>
>> --
>> View this message in context:
>>
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
>
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6278.html
> To unsubscribe from How to Run Machine Learning Examples, click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277p6287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Andrew Ash
Spark uses the Twitter Chill library, which registers a bunch of core Scala
and Java classes by default.  I'm assuming that java.util.Date is
automatically registered by that, but Joda's DateTime is not.  We could
always take a look through the source to confirm too.

As far as the class name, my understanding was that it would have the class
name at the start of every serialized object, not just once.  I did some
tests at one point to confirm that, but it's a little fuzzy so I won't say
definitely that the class name is repeated.  Can you look at the
Kryo-serialized version of the classes at some point to see what actually
happens?


On Thu, May 22, 2014 at 5:02 PM, Mohit Jaggi  wrote:

> Andrew,
> I did not register anything explicitly based on the belief that the class
> name is written out in full only once. I also wondered why that problem
> would be specific to JodaTime and not show up with Java.util.date...I guess
> it is possible based on internals of Joda time.
> If I remove DateTime from my RDD, the problem goes away.
> I will try explicit registration(and add DateTime back to my RDD) and see
> if that makes things better.
>
> Mohit.
>
>
>
>
> On Wed, May 21, 2014 at 8:36 PM, Andrew Ash  wrote:
>
>> Hi Mohit,
>>
>> The log line about the ExternalAppendOnlyMap is more of a symptom of
>> slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
>> when a shuffle is causing too much data to be held in memory.  Rather than
>> OOM'ing, Spark writes the data out to disk in a sorted order and reads it
>> back from disk later on when it's needed.  That's the job of the
>> ExternalAppendOnlyMap.
>>
>> I wouldn't normally expect a conversion from Date to a Joda DateTime to
>> take significantly more memory.  But since you're using Kryo and classes
>> should be registered with it, may may have forgotten to register DateTime
>> with Kryo.  If you don't register a class, it writes the class name at the
>> beginning of every serialized instance, which for DateTime objects of size
>> roughly 1 long, that's a ton of extra space and very inefficient.
>>
>> Can you confirm that DateTime is registered with Kryo?
>>
>> http://spark.apache.org/docs/latest/tuning.html#data-serialization
>>
>>
>> On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi wrote:
>>
>>> Hi,
>>>
>>> I changed my application to use Joda time instead of java.util.Date and
>>> I started getting this:
>>>
>>> WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
>>> time so far)
>>>
>>> What does this mean? How can I fix this? Due to this a small job takes
>>> forever.
>>>
>>> Mohit.
>>>
>>>
>>> P.S.: I am using kyro serialization, have played around with several
>>> values of sparkRddMemFraction
>>>
>>
>>
>


Re: Spark / YARN classpath issues

2014-05-22 Thread Andrew Or
I think you should be able to drop "yarn-standalone" altogether. We
recently updated SparkPi to take in 1 argument (num slices, which you set
to 10). Previously, it took in 2 arguments, the master and num slices.

Glad you got it figured out.


2014-05-22 13:41 GMT-07:00 Jon Bender :

> Andrew,
>
> Brilliant!  I built on Java 7 but was still running our cluster on Java 6.
>  Upgraded the cluster and it worked (with slight tweaks to the args, I
> guess the app args come first then yarn-standalone comes last):
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
> \
>   ./bin/spark-class org.apache.spark.deploy.yarn.Client \
>   --jar
> examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
> \
>   --class org.apache.spark.examples.SparkPi \
>   --args 10 \
>   --args yarn-standalone \
>   --num-workers 3 \
>   --master-memory 4g \
>   --worker-memory 2g \
>   --worker-cores 1
>
> I'll make sure to use spark-submit from here on out.
>
> Thanks very much!
> Jon
>
>
> On Thu, May 22, 2014 at 12:40 PM, Andrew Or  wrote:
>
>> Hi Jon,
>>
>> Your configuration looks largely correct. I have very recently confirmed
>> that the way you launch SparkPi also works for me.
>>
>> I have run into the same problem a bunch of times. My best guess is that
>> this is a Java version issue. If the Spark assembly jar is built with Java
>> 7, it cannot be opened by Java 6 because the two versions use different
>> packaging schemes. This is a known issue:
>> https://issues.apache.org/jira/browse/SPARK-1520.
>>
>> The workaround is to either make sure that all your executor nodes are
>> running Java 7, and, very importantly, have JAVA_HOME point to this
>> version. You can achieve this through
>>
>> export SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java7/home"
>>
>> in spark-env.sh. Another safe alternative, of course, is to just build
>> the jar with Java 6. An additional debugging step is to review the launch
>> environment of all the containers. This is detailed in the last paragraph
>> of this section:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/running-on-yarn.html#debugging-your-application.
>> This may not be necessary, but I have personally found it immensely useful.
>>
>> One last thing, launching Spark applications through
>> org.apache.spark.deploy.yarn.Client is deprecated in Spark 1.0. You should
>> use bin/spark-submit instead. You can find information about its usage on
>> the docs I linked to you, or simply through the --help option.
>>
>> Cheers,
>> Andrew
>>
>>
>> 2014-05-22 11:38 GMT-07:00 Jon Bender :
>>
>> Hey all,
>>>
>>> I'm working through the basic SparkPi example on a YARN cluster, and i'm
>>> wondering why my containers don't pick up the spark assembly classes.
>>>
>>> I built the latest spark code against CDH5.0.0
>>>
>>> Then ran the following:
>>> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
>>> \
>>>   ./bin/spark-class org.apache.spark.deploy.yarn.Client \
>>>   --jar
>>> examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
>>> \
>>>   --class org.apache.spark.examples.SparkPi \
>>>   --args yarn-standalone \
>>>   --num-workers 3 \
>>>   --master-memory 4g \
>>>   --worker-memory 2g \
>>>   --worker-cores 1
>>>
>>> The job dies, and in the stderr from the containers I see
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/deploy/yarn/ApplicationMaster
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.deploy.yarn.ApplicationMaster
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
>>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
>>>
>>> my yarn-site.xml contains the following classpath:
>>>   
>>> yarn.application.classpath
>>> 
>>> /etc/hadoop/conf/,
>>> /usr/lib/hadoop/*,/usr/lib/hadoop//lib/*,
>>> /usr/lib/hadoop-hdfs/*,/user/lib/hadoop-hdfs/lib/*,
>>> /usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,
>>> /usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,
>>> /usr/lib/avro/*
>>> 
>>>   
>>>
>>> I've confirmed that the spark-assembly JAR has this class.  Does it
>>> actually need to be defined in yarn.application.classpath or should the
>>> spark client take care of ensuring the necessary JARs are added during job
>>> submission?
>>>
>>> Any tips would be greatly appreciated!
>>> Cheers,
>>> Jon
>>>
>>
>>
>


Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-22 Thread Mohit Jaggi
Andrew,
I did not register anything explicitly based on the belief that the class
name is written out in full only once. I also wondered why that problem
would be specific to JodaTime and not show up with Java.util.date...I guess
it is possible based on internals of Joda time.
If I remove DateTime from my RDD, the problem goes away.
I will try explicit registration(and add DateTime back to my RDD) and see
if that makes things better.

Mohit.




On Wed, May 21, 2014 at 8:36 PM, Andrew Ash  wrote:

> Hi Mohit,
>
> The log line about the ExternalAppendOnlyMap is more of a symptom of
> slowness than causing slowness itself.  The ExternalAppendOnlyMap is used
> when a shuffle is causing too much data to be held in memory.  Rather than
> OOM'ing, Spark writes the data out to disk in a sorted order and reads it
> back from disk later on when it's needed.  That's the job of the
> ExternalAppendOnlyMap.
>
> I wouldn't normally expect a conversion from Date to a Joda DateTime to
> take significantly more memory.  But since you're using Kryo and classes
> should be registered with it, may may have forgotten to register DateTime
> with Kryo.  If you don't register a class, it writes the class name at the
> beginning of every serialized instance, which for DateTime objects of size
> roughly 1 long, that's a ton of extra space and very inefficient.
>
> Can you confirm that DateTime is registered with Kryo?
>
> http://spark.apache.org/docs/latest/tuning.html#data-serialization
>
>
> On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi  wrote:
>
>> Hi,
>>
>> I changed my application to use Joda time instead of java.util.Date and I
>> started getting this:
>>
>> WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1
>> time so far)
>>
>> What does this mean? How can I fix this? Due to this a small job takes
>> forever.
>>
>> Mohit.
>>
>>
>> P.S.: I am using kyro serialization, have played around with several
>> values of sparkRddMemFraction
>>
>
>


Re: spark setting maximum available memory

2014-05-22 Thread Mayur Rustagi
Ideally you should use less.. 75 % would be good to leave enough for
scratch space for shuffle writes & system processes.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Fri, May 23, 2014 at 1:41 AM, Andrew Or  wrote:

> Hi Ibrahim,
>
> If your worker machines only have 8GB of memory, then launching executors
> with all the memory will leave no room for system processes. There is no
> guideline, but I usually leave around 1GB just to be safe, so
>
> conf.set("spark.executor.memory", "7g")
>
> Andrew
>
>
> 2014-05-22 7:23 GMT-07:00 İbrahim Rıza HALLAÇ 
> :
>
>  In my situation each slave has 8 GB memory.  I want to use the maximum
>> memory that I can: .set("spark.executor.memory", "?g")
>> How can I determine the amount of memory I should set ? It fails when I
>> set it to 8GB.
>>
>
>


Re: How to Run Machine Learning Examples

2014-05-22 Thread Marco Shaw
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with 
there Spark packages and none seem to package it. 

Am I missing something?  Is this only provided with the Spark project pre-built 
binaries or from source installs?

Marco

> On May 22, 2014, at 5:04 PM, Stephen Boesch  wrote:
> 
> 
> There is a bin/run-example.sh  []
> 
> 
> 2014-05-22 12:48 GMT-07:00 yxzhao :
>> I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
>> following directory on my data set. But I did not find the sample command
>> line to run them. Anybody help? Thanks.
>> spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
Andrew,

Brilliant!  I built on Java 7 but was still running our cluster on Java 6.
 Upgraded the cluster and it worked (with slight tweaks to the args, I
guess the app args come first then yarn-standalone comes last):

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
\
  ./bin/spark-class org.apache.spark.deploy.yarn.Client \
  --jar
examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
\
  --class org.apache.spark.examples.SparkPi \
  --args 10 \
  --args yarn-standalone \
  --num-workers 3 \
  --master-memory 4g \
  --worker-memory 2g \
  --worker-cores 1

I'll make sure to use spark-submit from here on out.

Thanks very much!
Jon


On Thu, May 22, 2014 at 12:40 PM, Andrew Or  wrote:

> Hi Jon,
>
> Your configuration looks largely correct. I have very recently confirmed
> that the way you launch SparkPi also works for me.
>
> I have run into the same problem a bunch of times. My best guess is that
> this is a Java version issue. If the Spark assembly jar is built with Java
> 7, it cannot be opened by Java 6 because the two versions use different
> packaging schemes. This is a known issue:
> https://issues.apache.org/jira/browse/SPARK-1520.
>
> The workaround is to either make sure that all your executor nodes are
> running Java 7, and, very importantly, have JAVA_HOME point to this
> version. You can achieve this through
>
> export SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java7/home"
>
> in spark-env.sh. Another safe alternative, of course, is to just build the
> jar with Java 6. An additional debugging step is to review the launch
> environment of all the containers. This is detailed in the last paragraph
> of this section:
> http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/running-on-yarn.html#debugging-your-application.
> This may not be necessary, but I have personally found it immensely useful.
>
> One last thing, launching Spark applications through
> org.apache.spark.deploy.yarn.Client is deprecated in Spark 1.0. You should
> use bin/spark-submit instead. You can find information about its usage on
> the docs I linked to you, or simply through the --help option.
>
> Cheers,
> Andrew
>
>
> 2014-05-22 11:38 GMT-07:00 Jon Bender :
>
> Hey all,
>>
>> I'm working through the basic SparkPi example on a YARN cluster, and i'm
>> wondering why my containers don't pick up the spark assembly classes.
>>
>> I built the latest spark code against CDH5.0.0
>>
>> Then ran the following:
>> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
>> \
>>   ./bin/spark-class org.apache.spark.deploy.yarn.Client \
>>   --jar
>> examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
>> \
>>   --class org.apache.spark.examples.SparkPi \
>>   --args yarn-standalone \
>>   --num-workers 3 \
>>   --master-memory 4g \
>>   --worker-memory 2g \
>>   --worker-cores 1
>>
>> The job dies, and in the stderr from the containers I see
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/deploy/yarn/ApplicationMaster
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.deploy.yarn.ApplicationMaster
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
>> at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
>>
>> my yarn-site.xml contains the following classpath:
>>   
>> yarn.application.classpath
>> 
>> /etc/hadoop/conf/,
>> /usr/lib/hadoop/*,/usr/lib/hadoop//lib/*,
>> /usr/lib/hadoop-hdfs/*,/user/lib/hadoop-hdfs/lib/*,
>> /usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,
>> /usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,
>> /usr/lib/avro/*
>> 
>>   
>>
>> I've confirmed that the spark-assembly JAR has this class.  Does it
>> actually need to be defined in yarn.application.classpath or should the
>> spark client take care of ensuring the necessary JARs are added during job
>> submission?
>>
>> Any tips would be greatly appreciated!
>> Cheers,
>> Jon
>>
>
>


Re: Spark Job Server first steps

2014-05-22 Thread Gerard Maas
Hi Michael,

Thanks for the tip on the /tmp dir. I had unzipped all the jars before
uploading to check for the class. The issue is that the jars were not
uploaded correctly.
I was not familiar with the '@' syntax of curl and omitted it, resulting in
a Jar file containing only the jar's name.

curl --data-binary *@*sparkjobservertest_2.10-0.1.jar
localhost:8090/jars/test

Definitively a case of PEBKAC or 'PICNIC' as I used to know it. :-)

Thanks!

-Gerard.


On Thu, May 22, 2014 at 7:52 PM, Michael Cutler  wrote:

> Hi Gerard,
>
> We're using the Spark Job Server in production, from GitHub [master]
> running against a recent Spark-1.0 snapshot so it definitely works.  I'm
> afraid the only time we've seen a similar error was an unfortunate case of
> PEBKAC .
>
> First and foremost, have you tried doing an unzip -l "
> /tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar" on
> the JAR uploaded to the server to make sure the class is where you're
> expecting it to be?
>
> It's not uncommon for a package statement to be neglected when moving
> classes around in an IDE like Eclipse.
>
> Best,
>
> Michael
>
>
>
>
>  *Michael Cutler*
> Founder, CTO
>
>
> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com 
> Web: tumra.com
>  *
> *Visit us at our offices in Chiswick Park *
> *Registered in England & Wales, 07916412. VAT No. 130595328*
>
>
> This email and any files transmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender 
> immediately.
> If you are not the intended recipient you must not use, disclose, copy,
> print, distribute or rely on this email.
>
>
> On 22 May 2014 18:25, Gerard Maas  wrote:
>
>> Hi,
>>
>> I'm starting to explore the Spark Job Server contributed by Ooyala [1],
>> running from the master branch.
>>
>> I started by developing and submitting a simple job and the JAR check
>> gave me errors on a seemingly good jar. I disabled the fingerprint checking
>> on the jar and  I could submit it, but when I tried to submit the job, it
>> could not find it's classpath. Therefore I decided to take a couple of
>> steps backwards and go through the example in the docs.
>>
>> Using the (Hello)WordCount example, upload is OK and the jar in in the UI
>> as well, but when I submit the job, I get the same classpathNotFound error
>> as before:
>>
>> 19:07 $ curl -d ""
>> 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
>> {
>>   "status": "ERROR",
>>   "result": "classPath spark.jobserver.WordCountExample not found"
>> }
>>
>> I'm not sure where it goes wrong.  Here's what seems to be the relevant
>> snippet in the server logs:
>>
>> [2014-05-22 19:17:28,891] INFO  .apache.spark.SparkContext []
>> [akka://JobServer/user/context-
>> supervisor/666d021a-spark.jobserver.WordCountExample] - Added JAR
>> /tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar at
>> http://172.17.42.1:37978/jars/test-2014-05-22T18:44:09.254+02:00.jarwith 
>> timestamp 1400779048891
>> [2014-05-22 19:17:28,891] INFO  util.ContextURLClassLoader []
>> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
>> - Added URL
>> file:/tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar
>> to ContextURLClassLoader
>> [2014-05-22 19:17:28,891] INFO  spark.jobserver.JarUtils$ []
>> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
>> - Loading object spark.jobserver.WordCountExample$ using loader
>> spark.jobserver.util.ContextURLClassLoader@5deae1b7
>> [2014-05-22 19:17:28,892] INFO  spark.jobserver.JarUtils$ []
>> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
>> - Loading class spark.jobserver.WordCountExample using loader
>> spark.jobserver.util.ContextURLClassLoader@5deae1b7
>>
>> * all OK until here and then ...*
>>
>> [2014-05-22 19:17:28,892] INFO  ocalContextSupervisorActor []
>> [akka://JobServer/user/context-supervisor] - Shutting down context
>> 666d021a-spark.jobserver.WordCountExample
>>
>> Any ideas? Something silly I might be doing?  btw, I'm running in dev
>> mode using sbt and default config (local).
>>
>> -kr, Gerard.
>>
>>
>> [1] https://github.com/ooyala/spark-jobserver
>>
>
>


Re: spark setting maximum available memory

2014-05-22 Thread Andrew Or
Hi Ibrahim,

If your worker machines only have 8GB of memory, then launching executors
with all the memory will leave no room for system processes. There is no
guideline, but I usually leave around 1GB just to be safe, so

conf.set("spark.executor.memory", "7g")

Andrew


2014-05-22 7:23 GMT-07:00 İbrahim Rıza HALLAÇ :

> In my situation each slave has 8 GB memory.  I want to use the maximum
> memory that I can: .set("spark.executor.memory", "?g")
> How can I determine the amount of memory I should set ? It fails when I
> set it to 8GB.
>


Re: How to Run Machine Learning Examples

2014-05-22 Thread Stephen Boesch
There is a bin/run-example.sh  []


2014-05-22 12:48 GMT-07:00 yxzhao :

> I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
> following directory on my data set. But I did not find the sample command
> line to run them. Anybody help? Thanks.
>
> spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


How to Run Machine Learning Examples

2014-05-22 Thread yxzhao
I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
following directory on my data set. But I did not find the sample command
line to run them. Anybody help? Thanks.
spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Error while launching ec2 spark cluster with HVM (r3.large)

2014-05-22 Thread adparker
I had this problem too and fixed it by setting the wait time-out to a larger
value: --wait

For example, in "stand alone" mode with default values, a time out of 480
seconds worked for me:

$ cd spark-0.9.1/ec2
$ ./spark-ec2 --key-pair= --identity-file= --instance-type=r3.large
--wait=480  launch 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-launching-ec2-spark-cluster-with-HVM-r3-large-tp5862p6276.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark / YARN classpath issues

2014-05-22 Thread Andrew Or
Hi Jon,

Your configuration looks largely correct. I have very recently confirmed
that the way you launch SparkPi also works for me.

I have run into the same problem a bunch of times. My best guess is that
this is a Java version issue. If the Spark assembly jar is built with Java
7, it cannot be opened by Java 6 because the two versions use different
packaging schemes. This is a known issue:
https://issues.apache.org/jira/browse/SPARK-1520.

The workaround is to either make sure that all your executor nodes are
running Java 7, and, very importantly, have JAVA_HOME point to this
version. You can achieve this through

export SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java7/home"

in spark-env.sh. Another safe alternative, of course, is to just build the
jar with Java 6. An additional debugging step is to review the launch
environment of all the containers. This is detailed in the last paragraph
of this section:
http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/running-on-yarn.html#debugging-your-application.
This may not be necessary, but I have personally found it immensely useful.

One last thing, launching Spark applications through
org.apache.spark.deploy.yarn.Client is deprecated in Spark 1.0. You should
use bin/spark-submit instead. You can find information about its usage on
the docs I linked to you, or simply through the --help option.

Cheers,
Andrew


2014-05-22 11:38 GMT-07:00 Jon Bender :

> Hey all,
>
> I'm working through the basic SparkPi example on a YARN cluster, and i'm
> wondering why my containers don't pick up the spark assembly classes.
>
> I built the latest spark code against CDH5.0.0
>
> Then ran the following:
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
> \
>   ./bin/spark-class org.apache.spark.deploy.yarn.Client \
>   --jar
> examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
> \
>   --class org.apache.spark.examples.SparkPi \
>   --args yarn-standalone \
>   --num-workers 3 \
>   --master-memory 4g \
>   --worker-memory 2g \
>   --worker-cores 1
>
> The job dies, and in the stderr from the containers I see
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/deploy/yarn/ApplicationMaster
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.deploy.yarn.ApplicationMaster
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
>
> my yarn-site.xml contains the following classpath:
>   
> yarn.application.classpath
> 
> /etc/hadoop/conf/,
> /usr/lib/hadoop/*,/usr/lib/hadoop//lib/*,
> /usr/lib/hadoop-hdfs/*,/user/lib/hadoop-hdfs/lib/*,
> /usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,
> /usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,
> /usr/lib/avro/*
> 
>   
>
> I've confirmed that the spark-assembly JAR has this class.  Does it
> actually need to be defined in yarn.application.classpath or should the
> spark client take care of ensuring the necessary JARs are added during job
> submission?
>
> Any tips would be greatly appreciated!
> Cheers,
> Jon
>


Spark / YARN classpath issues

2014-05-22 Thread Jon Bender
Hey all,

I'm working through the basic SparkPi example on a YARN cluster, and i'm
wondering why my containers don't pick up the spark assembly classes.

I built the latest spark code against CDH5.0.0

Then ran the following:
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
\
  ./bin/spark-class org.apache.spark.deploy.yarn.Client \
  --jar
examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.3.0-cdh5.0.0.jar
\
  --class org.apache.spark.examples.SparkPi \
  --args yarn-standalone \
  --num-workers 3 \
  --master-memory 4g \
  --worker-memory 2g \
  --worker-cores 1

The job dies, and in the stderr from the containers I see
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/deploy/yarn/ApplicationMaster
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.ApplicationMaster
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)

my yarn-site.xml contains the following classpath:
  
yarn.application.classpath

/etc/hadoop/conf/,
/usr/lib/hadoop/*,/usr/lib/hadoop//lib/*,
/usr/lib/hadoop-hdfs/*,/user/lib/hadoop-hdfs/lib/*,
/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,
/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,
/usr/lib/avro/*

  

I've confirmed that the spark-assembly JAR has this class.  Does it
actually need to be defined in yarn.application.classpath or should the
spark client take care of ensuring the necessary JARs are added during job
submission?

Any tips would be greatly appreciated!
Cheers,
Jon


Re: Spark Job Server first steps

2014-05-22 Thread Michael Cutler
Hi Gerard,

We're using the Spark Job Server in production, from GitHub [master]
running against a recent Spark-1.0 snapshot so it definitely works.  I'm
afraid the only time we've seen a similar error was an unfortunate case of
PEBKAC .

First and foremost, have you tried doing an unzip -l "/tmp/spark-jobserver/
filedao/data/test-2014-05-22T18:44:09.254+02:00.jar" on the JAR uploaded to
the server to make sure the class is where you're expecting it to be?

It's not uncommon for a package statement to be neglected when moving
classes around in an IDE like Eclipse.

Best,

Michael




*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com Web:
tumra.com *
*Visit us at our offices in Chiswick Park *
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 22 May 2014 18:25, Gerard Maas  wrote:

> Hi,
>
> I'm starting to explore the Spark Job Server contributed by Ooyala [1],
> running from the master branch.
>
> I started by developing and submitting a simple job and the JAR check gave
> me errors on a seemingly good jar. I disabled the fingerprint checking on
> the jar and  I could submit it, but when I tried to submit the job, it
> could not find it's classpath. Therefore I decided to take a couple of
> steps backwards and go through the example in the docs.
>
> Using the (Hello)WordCount example, upload is OK and the jar in in the UI
> as well, but when I submit the job, I get the same classpathNotFound error
> as before:
>
> 19:07 $ curl -d ""
> 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
> {
>   "status": "ERROR",
>   "result": "classPath spark.jobserver.WordCountExample not found"
> }
>
> I'm not sure where it goes wrong.  Here's what seems to be the relevant
> snippet in the server logs:
>
> [2014-05-22 19:17:28,891] INFO  .apache.spark.SparkContext []
> [akka://JobServer/user/context-
> supervisor/666d021a-spark.jobserver.WordCountExample] - Added JAR
> /tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar at
> http://172.17.42.1:37978/jars/test-2014-05-22T18:44:09.254+02:00.jar with
> timestamp 1400779048891
> [2014-05-22 19:17:28,891] INFO  util.ContextURLClassLoader []
> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
> - Added URL
> file:/tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar
> to ContextURLClassLoader
> [2014-05-22 19:17:28,891] INFO  spark.jobserver.JarUtils$ []
> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
> - Loading object spark.jobserver.WordCountExample$ using loader
> spark.jobserver.util.ContextURLClassLoader@5deae1b7
> [2014-05-22 19:17:28,892] INFO  spark.jobserver.JarUtils$ []
> [akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
> - Loading class spark.jobserver.WordCountExample using loader
> spark.jobserver.util.ContextURLClassLoader@5deae1b7
>
> * all OK until here and then ...*
>
> [2014-05-22 19:17:28,892] INFO  ocalContextSupervisorActor []
> [akka://JobServer/user/context-supervisor] - Shutting down context
> 666d021a-spark.jobserver.WordCountExample
>
> Any ideas? Something silly I might be doing?  btw, I'm running in dev mode
> using sbt and default config (local).
>
> -kr, Gerard.
>
>
> [1] https://github.com/ooyala/spark-jobserver
>


Spark Job Server first steps

2014-05-22 Thread Gerard Maas
Hi,

I'm starting to explore the Spark Job Server contributed by Ooyala [1],
running from the master branch.

I started by developing and submitting a simple job and the JAR check gave
me errors on a seemingly good jar. I disabled the fingerprint checking on
the jar and  I could submit it, but when I tried to submit the job, it
could not find it's classpath. Therefore I decided to take a couple of
steps backwards and go through the example in the docs.

Using the (Hello)WordCount example, upload is OK and the jar in in the UI
as well, but when I submit the job, I get the same classpathNotFound error
as before:

19:07 $ curl -d ""
'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
  "status": "ERROR",
  "result": "classPath spark.jobserver.WordCountExample not found"
}

I'm not sure where it goes wrong.  Here's what seems to be the relevant
snippet in the server logs:

[2014-05-22 19:17:28,891] INFO  .apache.spark.SparkContext []
[akka://JobServer/user/context-
supervisor/666d021a-spark.jobserver.WordCountExample] - Added JAR
/tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar at
http://172.17.42.1:37978/jars/test-2014-05-22T18:44:09.254+02:00.jar with
timestamp 1400779048891
[2014-05-22 19:17:28,891] INFO  util.ContextURLClassLoader []
[akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
- Added URL
file:/tmp/spark-jobserver/filedao/data/test-2014-05-22T18:44:09.254+02:00.jar
to ContextURLClassLoader
[2014-05-22 19:17:28,891] INFO  spark.jobserver.JarUtils$ []
[akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
- Loading object spark.jobserver.WordCountExample$ using loader
spark.jobserver.util.ContextURLClassLoader@5deae1b7
[2014-05-22 19:17:28,892] INFO  spark.jobserver.JarUtils$ []
[akka://JobServer/user/context-supervisor/666d021a-spark.jobserver.WordCountExample]
- Loading class spark.jobserver.WordCountExample using loader
spark.jobserver.util.ContextURLClassLoader@5deae1b7

* all OK until here and then ...*

[2014-05-22 19:17:28,892] INFO  ocalContextSupervisorActor []
[akka://JobServer/user/context-supervisor] - Shutting down context
666d021a-spark.jobserver.WordCountExample

Any ideas? Something silly I might be doing?  btw, I'm running in dev mode
using sbt and default config (local).

-kr, Gerard.


[1] https://github.com/ooyala/spark-jobserver


Spark Streaming with Kafka | Check if DStream is Empty | HDFS Write

2014-05-22 Thread Anish Sneh
Hi All

I am using Spark Streaming with Kafka, I recieve messages and after minor 
processing I write them to HDFS, as of now I am using saveAsTextFiles() / 
saveAsHadoopFiles() Java methods


- Is there some default way of writing stream to Hadoop like we have HDFS sink 
concept in Flume? I mean is there some configurable way of writing at Spark 
Streaming after processing DStream.

- How can I check if DStream is empty so that I can skip HDFS write if no 
message is present (I am pulling Kafka topic every 1 sec)? because sometime it 
writes empty file to HDFS due to unavailability of messages.

 
Please suggest.

TIA

-- 
Anish Sneh
"Experience is the best teacher."
+91-99718-55883
http://in.linkedin.com/in/anishsneh

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
I am not 100% sure of the functionality in Catalyst, probably the easiest
way to see what it supports is to look at
SqlParser.scalain
GIT.  Straight away I can see "
LIKE", "RLIKE" and "REGEXP" so clearly some of the basics are in there.

As the saying goes ... *"Use the source, Luke!
"*   :o)
ᐧ


*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com Web:
tumra.com *
*Visit us at our offices in Chiswick Park *
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 22 May 2014 09:06, Flavio Pompermaier  wrote:

> Is there a way to query fields by similarity (like Lucene or using a
> similarity metric) to be able to query something like WHERE language LIKE
> "it~0.5" ?
>
> Best,
> Flavio
>
>
> On Thu, May 22, 2014 at 8:56 AM, Michael Cutler  wrote:
>
>> Hi Nick,
>>
>> Here is an illustrated example which extracts certain fields from
>> Facebook messages, each one is a JSON object and they are serialised into
>> files with one complete JSON object per line. Example of one such message:
>> CandyCrush.json 
>>
>> You need to define a case class which has all the fields you'll be able
>> to query later in SQL, e.g.
>>
>> case class CandyCrushInteraction(id: String, user: String, level: Int,
>> gender: String, language: String)
>>
>> The basic objective is to use Spark to convert the file from RDD[String] -
>> - parse JSON - - > RDD[JValue] - - extract fields - - > RDD[
>> CandyCrushInteraction]
>>
>> // Produces a RDD[String]
>> val lines = sc.textFile("facebook-2014-05-19.json")
>>
>>
>>
>> // Process the messages
>>
>> val interactions = lines.map(line => {
>>
>>
>>
>>   // Parse the JSON, returns RDD[JValue]
>>   parse(line)
>>
>>
>>
>> }).filter(json => {
>>   // Filter out only 'Candy Crush Saga' Facebook App activity
>>
>>   (json \ "facebook" \ "application").extract[String] == "Candy Crush 
>> Saga"
>>
>>
>>
>> }).map(json => {
>>   // Extract fields we want, we use compact() because they may not exist
>>
>>   val id = compact(json \ "facebook" \ "id")
>>
>>
>>
>>   val user = compact(json \ "facebook" \ "author" \ "hash")
>>
>>
>>
>>   val gender = compact(json \ "demographic" \ "gender")
>>
>>
>>
>>   val language = compact(json \ "language" \ "tag")
>>
>>
>>
>>   // Extract the 'level' using a RegEx or default to zero
>>
>>   var level = 0;
>>   pattern.findAllIn( compact(json \ "interaction" \ "title") 
>> ).matchData.foreach(m => {
>>
>>
>>
>> level = m.group(1).toInt
>>
>>
>>
>>   })
>>   // Return an RDD[CandyCrushInteraction]
>>   ( CandyCrushInteraction(id, user, level, gender, language) )
>>
>>
>>
>> })
>>
>>
>> Now you can register the RDD[CandyCrushInteraction] as a table and query
>> it in SQL.
>>
>> interactions.registerAsTable("candy_crush_interaction")
>>
>>
>>
>> // Game level by Gender
>>
>> sql("SELECT gender, COUNT(level), MAX(level), MIN(level), AVG(level) 
>> FROM candy_crush_interaction WHERE level > 0 GROUP BY 
>> gender").collect().foreach(println)
>>
>>
>>
>> /* Returns:
>> ["male",14727,590,1,104.71705031574659]
>> ["female",15422,590,1,114.17202697445208]
>> ["mostly_male",2824,590,1,97.08852691218131]
>> ["mostly_female",1934,590,1,99.0517063081696]
>>
>>
>>
>> ["unisex",2674,590,1,113.42071802543006]
>> [,11023,590,1,93.45677220357435]
>>  */
>>
>>
>> Full working example: 
>> CandyCrushSQL.scala
>>
>> MC
>>
>>
>> *Michael Cutler*
>> Founder, CTO
>>
>>
>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com 
>> Web: tumra.com
>>  *
>> *Visit us at our offices in Chiswick Park *
>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>
>>
>> This email and any files transmitted with it are confidential and may
>> also be privileged. It is intended only for the person to whom it is
>> addressed. If you have received this email in error, please inform the
>> sender immediately. If you are not the intended recipient you must not
>> use, disclose, copy, print, distribute or rely on this email.
>>
>>
>> On 22 May 2014 04:43, Nicholas Chammas wrote:
>>
>>> That's a good idea.

Re: java.io.IOException: Failed to save output of task

2014-05-22 Thread Grega Kešpret
I have since resolved the issue. The problem was that multiple rdds were
trying to write to the same s3 bucket.

Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com  |
@celtramobile


On Thu, May 22, 2014 at 8:18 AM, Grega Kešpret  wrote:

> Hello,
>
> my last reduce task in the job always fails with "java.io.IOException:
> Failed to save output of task" when using saveAsTextFile with s3 endpoint
> (all others are successful). Has anyone had similar problems?
>
> https://gist.github.com/gregakespret/813b540faca678413ad4
>
>
> -
>
> 14/05/21 21:44:45 ERROR SparkHadoopWriter: Error committing the output of
> task: attempt_201405212144__m_00_3432
> java.io.IOException: Failed to save output of task:
> attempt_201405212144__m_00_3432
> at
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
> at
> org.apache.hadoop.mapred.SparkHadoopWriter.commit(SparkHadoopWriter.scala:110)
> at 
> org.apache.spark.rdd.PairRDDFunctions.org
> $apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:731)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:734)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:734)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com  | 
> @celtramobile
>


Fwd: Spark Streaming: Flume stream not found

2014-05-22 Thread Andy Konwinski
I'm forwarding this email along which contains a question from a Spark user
Adrien (CC'd) who can't successfully get any emails through to the Apache
mailing lists.

Please reply-all when responding to include Adrien. See below for his
question.
-- Forwarded message --
From: "Adrien Legrand" 
Date: May 22, 2014 1:06 AM
Subject: Re: Post validation
To: "Andy Konwinski" 
Cc:

Thanks, it would be nice ! Here is the question:

Title: Spark Streaming: Flume stream not found

Message:
Hi everyone,

I am currently trying to process a flume (avro) stream with spark streaming
on a yarn cluster. Everything is fine when I try to launch my code locally.
 To do so, I use the following args :

master = "local"
host, port = the machine I'm sending the stream to with flume (I triple
checked the concordance between flume / spark).

val ssc = new StreamingContext(master, "FlumeEventCount",
batchInterval,
System.getenv("SPARK_HOME"),
StreamingContext.jarOfClass(this.getClass))

val stream = FlumeUtils.createStream(ssc, host, port.toInt)

But when I try to launch the same job parallelized (replacing "local" by
"yarn-standalone"), the jar is launched (I can see some print I used to
debug the code) but it shows the expected output (from the data processing)
only 1 time out of 5 or 10.
Here is the complete command line:

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client --jar
/home/www/loganalysis-1.0-SNAPSHOT-jar-with-dependencies.jar --class
com.loganalysis.Computation --args yarn-standalone --args
receiver.priv.fr--args  --num-workers 6 --master-memory 4g
--worker-memory 2g
--worker-cores 1

For no apparent reasons, sometimes the processing is done. The first guess
was that, since I use a big jars with all dependencies in it, other
machines don't have those dependencies and thus can't do the processing.
That's why I tried to add my executed jar with -addJars but the result was
the same.


2014-05-22 7:18 GMT+02:00 Andy Konwinski :

> One other idea occurred to me. You can try subscribing to the nabble
> mailing lists and send your emails to those. They will relay emails to the
> apache lists. Not sure if it'll help or not.
>
> I'm really sorry to hear you're having problems with the lists.
>
> If you forward me your questions I can relay them to the mailing list for
> you.
>
>
>
> On Wed, May 21, 2014 at 1:56 AM, Adrien Legrand wrote:
>
>> Hello Andy,
>> Thank you for your answer.
>> I used the mailing list's form to submit both of the subjects. Is it
>> still possible that the problem you've talked about may be the cause ?
>> I am also directly checking my mails in gmail, I don't use any mailing
>> client like outlook.
>> The only thing I'm thinking about the problem here is my company's proxy,
>> but it seems really unlikely...
>>
>> Thank you,
>>
>> Adrien
>>
>>
>>
>> 2014-05-20 18:36 GMT+02:00 Andy Konwinski :
>>
>> Hi Adrien,
>>>
>>> I'm not positive but I suspect something about how you at up your email
>>> client (or forwarding or something) may be triggering spam filters.
>>>
>>> In my gmail a note pops up in the email from you that says: "this
>>> message may not have been sent by legrand...@gmail.com".
>>>
>>> Other than that I'm not sure what else might be going wrong. You can try
>>> pinging the apache infra people for more help too.
>>> On May 20, 2014 2:42 AM,  wrote:
>>>
 Hello Andy,
 I posted 2 different topics (the important one is "Spark streaming:
 flume stream not found") on the spark user mailing list, but none of them
 was accepted.
 I registered myself before posting each subject and I think I didn't
 made any mistakes.
 After posting the last (and important) subject, I received the
 following mail:

 Hi. This is the qmail-send program at apache.org  .
 I'm afraid I wasn't able to deliver your message to the following
 addresses.
 This is a permanent error; I've given up. Sorry it didn't work out.

 Did I do something wrong ?

 Thank you,

 Adrien

 _
 Sent from http://apache-spark-user-list.1001560.n3.nabble.com


>>
>


How to turn off MetadataCleaner?

2014-05-22 Thread Adrian Mocanu
Hi
After using sparks TestSuiteBase to run some tests I've noticed that at the 
end, after finishing all tests the cleaner is still running and outputs the 
following perdiodically:
INFO  o.apache.spark.util.MetadataCleaner  - Ran metadata cleaner for 
SHUFFLE_BLOCK_MANAGER

I use method testOperation and I've changed it so that it stores the pointer to 
ssc after running setupStreams. Then using that pointer to turn things off, but 
the cleaner remains up.

How to shut down all of spark, including cleaner?

Here is how I changed testOperation method (changes in bold):

  def testOperation[U: ClassTag, V: ClassTag](
   input: Seq[Seq[U]],
   operation: DStream[U] => 
DStream[V],
   expectedOutput: Seq[Seq[V]],
   numBatches: Int,
   useSet: Boolean
   ) {
val numBatches_ = if (numBatches > 0) numBatches else expectedOutput.size
val ssc = setupStreams[U, V](input, operation)
val output = runStreams[V](ssc, numBatches_, expectedOutput.size)
verifyOutput[V](output, expectedOutput, useSet)
ssc.awaitTermination(500)
ssc.stop(true)
  }

-Adrian



Akka disassociation on Java SE Embedded

2014-05-22 Thread Chanwit Kaewkasi
Hi all,

On an ARM cluster, I have been testing a wordcount program with JRE 7
and everything is OK. But when changing to the embedded version of
Java SE (Oracle's eJRE), the same program cannot complete all
computing stages.

It is failed by many Akka's disassociation.

- I've been trying to increase Akka's timeout but still stuck. I am
not sure what is the right way to do so? (I suspected that GC pausing
the world is causing this).

- Another question is that how could I properly turn on Akka's logging
to see what's the root cause of this disassociation problem? (If my
guess about GC is wrong).

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


Re: logging in pyspark

2014-05-22 Thread Shivani Rao
I am having trouble adding logging to the class that does serialization and
deserialization. Where is the code for  org.apache.spark.Logging located?
and is this serializable?


On Mon, May 12, 2014 at 10:02 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Ah, yes, that is correct. You need a serializable object one way or the
> other.
>
> An alternate suggestion would be to use a combination of 
> RDD.sample()and
>  collect() to take a look at some small amount of data and just log it
> from the driver. That's pretty awkward as well, but will spare you having
> to make some kind of serializable logger function.
>
>
> On Wed, May 7, 2014 at 9:32 AM, Diana Carroll wrote:
>
>> foreach vs. map isn't the issue.  Both require serializing the called
>> function, so the pickle error would still apply, yes?
>>
>> And at the moment, I'm just testing.  Definitely wouldn't want to log
>> something for each element, but may want to detect something and log for
>> SOME elements.
>>
>> So my question is: how are other people doing logging from distributed
>> tasks, given the serialization issues?
>>
>> The same issue actually exists in Scala, too.  I could work around it by
>> creating a small serializable object that provides a logger, but it seems
>> kind of kludgy to me, so I'm wondering if other people are logging from
>> tasks, and if so, how?
>>
>> Diana
>>
>>
>> On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I think you're looking for 
>>> RDD.foreach()
>>> .
>>>
>>> According to the programming 
>>> guide
>>> :
>>>
>>> Run a function func on each element of the dataset. This is usually done
 for side effects such as updating an accumulator variable (see below) or
 interacting with external storage systems.
>>>
>>>
>>> Do you really want to log something for each element of your RDD?
>>>
>>> Nick
>>>
>>>
>>> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll wrote:
>>>
 What should I do if I want to log something as part of a task?

 This is what I tried.  To set up a logger, I followed the advice here:
 http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off

 logger = logging.getLogger("py4j")
 logger.setLevel(logging.INFO)
 logger.addHandler(logging.StreamHandler())

 This works fine when I call it from my driver (ie pyspark):
 logger.info("this works fine")

 But I want to try logging within a distributed task so I did this:

 def logTestMap(a):
  logger.info("test")
 return a

 myrdd.map(logTestMap).count()

 and got:
 PicklingError: Can't pickle 'lock' object

 So it's trying to serialize my function and can't because of a lock
 object used in logger, presumably for thread-safeness.  But then...how
 would I do it?  Or is this just a really bad idea?

 Thanks
 Diana

>>>
>>>
>>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA


Re: ETL and workflow management on Spark

2014-05-22 Thread Mayur Rustagi
Hi,
We are in process of migrating Pig on spark. What is your currrent Spark
setup?
Version & cluster management that you use?
Also what is the datasize you are working with right now.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Thu, May 22, 2014 at 8:19 PM, William Kang wrote:

> Hi,
> We are moving into adopting the full stack of Spark. So far, we have used
> Shark to do some ETL work, which is not bad but is not prefect either. We
> ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig.
>
> Do you have any suggestions with the ETL solution in Spark stack?
>
> And did any one have a working work flow management solution with Spark?
>
> Many thanks.
>
>
> Cao
>


Re: GraphX partition problem

2014-05-22 Thread Ankur Dave
The fix will be included in Spark 1.0, but if you just want to apply the
fix to 0.9.1, here's a hotfixed version of 0.9.1 that only includes PR
#367: https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions.
You can clone and build this.

Ankur 


On Thu, May 22, 2014 at 4:53 AM, Zhicharevich, Alex
wrote:

>  Hi,
>
>
>
> I’m running a simple connected components code using GraphX (version 0.9.1)
>
>
>
> My input comes from a HDFS text file partitioned to 400 parts. When I run
> the code on a single part or a small number of files (like 20) the code
> runs fine. As soon as I’m trying to read more files (more than 30) I’m
> getting an error and the job fails.
>
> From looking at the logs I see the following exception
>
> java.util.NoSuchElementException: End of stream
>
>at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)
>
>at
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29)
>
>at
> org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)
>
>at
> org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)
>
>at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)
>
>
>
> From searching the web, I see it’s a known issue with GraphX
>
> Here : https://github.com/apache/spark/pull/367
>
> And here : https://github.com/apache/spark/pull/497
>
>
>
> Are there some stable releases that include this fix? Should I clone the
> git repo and build it myself? How would you advise me to deal with this
> issue
>
>
>
> Thanks,
>
> Alex
>
>
>
>
>
>
>


controlling the time in spark-streaming

2014-05-22 Thread Ian Holsman
Hi.

I'm writing a pilot project, and plan on using spark's streaming app for it.

To start with I have a dump of some access logs with their own timestamps,
and am using the textFileStream and some old files to test it with.

One of the issues I've come across is simulating the windows. I would like
use the timestamp from the access logs as the 'system time' instead of the
real clock time.

I googled a bit and found the 'manual' clock which appears to be used for
testing the job scheduler.. but I'm not sure what my next steps should be.

I'm guessing I'll need to do something like

1. use the textFileStream to create a 'DStream'
2. have some kind of DStream that runs on top of that that creates the RDDs
based on the timestamps Instead of the system time
3. the rest of my mappers.

Is this correct? or do I need to create my own 'textFileStream' to
initially create the RDDs and modify the system clock inside of it.

I'm not too concerned about out-of-order messages, going backwards in time,
or being 100% in sync across workers.. as this is more for
'development'/prototyping.

Are there better ways of achieving this? I would assume that controlling
the windows RDD buckets would be a common use case.

TIA
Ian

-- 
Ian Holsman
i...@holsman.com.au
PH: + 61-3-9028 8133 / +1-(425) 998-7083


Shark resilience to unusable slaves

2014-05-22 Thread Yana Kadiyska
Hi, I am running into a pretty concerning issue with Shark (granted I'm
running v. 0.8.1).

I have a Spark slave node that has run out of disk space. When I try to
start Shark it attempts to deploy the application to a directory on that
node, fails and eventually gives up  (I see a "Master Removed our
application" message in the shark server log).

Is Spark supposed to be able to ignore a slave if something goes wrong for
it (I realize that the slave probably appears "alive" enough)? I restarted
the Spark master in hopes that it would detect that the slave is suffering
but it doesn't seem to be the case.

Any thoughts appreciated -- we'll monitor disk space but I'm a little
worried that the cluster is not functional on account of a single slave.


Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
Yeah, actually for that I used directly codahale with my own stuffs using
the Akka system from within Spark itself.

So the workers send messages back to a bunch of actors on the driver which
are using codahale metrics.
This way I can collect what/how an executor do/did, but also I can
aggregate all executors metrics at once (via dedicated aggregation purposed
codahale metrics).

However, I didn't had time to dig enough in Spark to see with I could reuse
the SparkListener system itself -- which is kind-of doing the same thing,
but w/o akka AFAICT => where I can see that TaskMetrics are collected by
task within the context/granularity of a Stage. Than aggregation looks like
being done in a built-in (Queued) Bus. So I'll let someone else report how
this could be extended, but my gut feeling that it won't be straightforward.

hth (with respect of my limited knowledge of these internals ^^)

cheers,


  aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]




On Thu, May 22, 2014 at 5:02 PM, Pierre B <
pierre.borckm...@realimpactanalytics.com> wrote:

> Hi Andy!
>
> Yes Spark UI provides a lot of interesting informations for debugging
> purposes.
>
> Here I’m trying to integrate a simple progress monitoring in my app ui.
>
> I’m typically running a few “jobs” (or rather actions), and I’d like to be
> able to display the progress of each of those in my ui.
>
> I don’t really see how i could do that using SparkListener for the moment …
>
> Thanks for your help!
>
> Cheers!
>
>
>
>
>   *Pierre Borckmans*
> Software team
>
> *Real**Impact* Analytics *| *Brussels Office
>  www.realimpactanalytics.com *| *[hidden 
> email]
>
> *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans
>
>
>
>
>
>
> On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] <[hidden
> email] > wrote:
>
> SparkListener offers good stuffs.
> But I also completed it with another metrics stuffs on my own that use
> Akka to aggregate metrics from anywhere I'd like to collect them (without
> any deps on ganglia yet on Codahale).
> However, this was useful to gather some custom metrics (from within the
> tasks then) not really to collect overall monitoring information about the
> spark thingies themselves.
> For that Spark UI offers already a pretty good insight no?
>
> Cheers,
>
>  aℕdy ℙetrella
> about.me/noootsab
> [image: aℕdy ℙetrella on about.me]
>
> 
>
>
> On Thu, May 22, 2014 at 4:51 PM, Pierre B < href="x-msg://7/user/SendEmail.jtp?type=node&node=6258&i=0"
> target="_top" rel="nofollow" link="external">[hidden email]> wrote:
>
>> Is there a simple way to monitor the overall progress of an action using
>> SparkListener or anything else?
>>
>> I see that one can name an RDD... Could that be used to determine which
>> action triggered a stage, ... ?
>>
>>
>> Thanks
>>
>> Pierre
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
>  To unsubscribe from Use SparkListener to get overall progress of an
> action, click here.
> NAML
>
>
>
> --
> View this message in context: Re: Use SparkListener to get overall
> progress of an 
> action
>
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Hi Andy!

Yes Spark UI provides a lot of interesting informations for debugging purposes.

Here I’m trying to integrate a simple progress monitoring in my app ui.

I’m typically running a few “jobs” (or rather actions), and I’d like to be able 
to display the progress of each of those in my ui.

I don’t really see how i could do that using SparkListener for the moment …

Thanks for your help!

Cheers!




Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans






On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] 
 wrote:

> SparkListener offers good stuffs.
> But I also completed it with another metrics stuffs on my own that use Akka 
> to aggregate metrics from anywhere I'd like to collect them (without any deps 
> on ganglia yet on Codahale).
> However, this was useful to gather some custom metrics (from within the tasks 
> then) not really to collect overall monitoring information about the spark 
> thingies themselves.
> For that Spark UI offers already a pretty good insight no?
> 
> Cheers,
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> 
> On Thu, May 22, 2014 at 4:51 PM, Pierre B <[hidden email]> wrote:
> Is there a simple way to monitor the overall progress of an action using
> SparkListener or anything else?
> 
> I see that one can name an RDD... Could that be used to determine which
> action triggered a stage, ... ?
> 
> 
> Thanks
> 
> Pierre
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
> To unsubscribe from Use SparkListener to get overall progress of an action, 
> click here.
> NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6259.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread andy petrella
SparkListener offers good stuffs.
But I also completed it with another metrics stuffs on my own that use Akka
to aggregate metrics from anywhere I'd like to collect them (without any
deps on ganglia yet on Codahale).
However, this was useful to gather some custom metrics (from within the
tasks then) not really to collect overall monitoring information about the
spark thingies themselves.
For that Spark UI offers already a pretty good insight no?

Cheers,

 aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]




On Thu, May 22, 2014 at 4:51 PM, Pierre B <
pierre.borckm...@realimpactanalytics.com> wrote:

> Is there a simple way to monitor the overall progress of an action using
> SparkListener or anything else?
>
> I see that one can name an RDD... Could that be used to determine which
> action triggered a stage, ... ?
>
>
> Thanks
>
> Pierre
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: ETL and workflow management on Spark

2014-05-22 Thread Derek Schoettle

unsubscribe



From:   William Kang 
To: user@spark.apache.org
Date:   05/22/2014 10:50 AM
Subject:ETL and workflow management on Spark



Hi,
We are moving into adopting the full stack of Spark. So far, we have used
Shark to do some ETL work, which is not bad but is not prefect either. We
ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig.

Do you have any suggestions with the ETL solution in Spark stack?

And did any one have a working work flow management solution with Spark?

Many thanks.


Cao

Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?

I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?


Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Computing cosine similiarity using pyspark

2014-05-22 Thread jamal sasha
Hi,
  I have bunch of vectors like
[0.1234,-0.231,0.23131]
 and so on.

and  I want to compute cosine similarity and pearson correlation using
pyspark..
How do I do this?
Any ideas?
Thanks


ETL and workflow management on Spark

2014-05-22 Thread William Kang
Hi,
We are moving into adopting the full stack of Spark. So far, we have used
Shark to do some ETL work, which is not bad but is not prefect either. We
ended writing UDF and UDGF, UDAF that can be avoided if we could use Pig.

Do you have any suggestions with the ETL solution in Spark stack?

And did any one have a working work flow management solution with Spark?

Many thanks.


Cao


spark setting maximum available memory

2014-05-22 Thread İbrahim Rıza HALLAÇ
In my situation each slave has 8 GB memory.  I want to use the maximum memory 
that I can: .set("spark.executor.memory", "?g") How can I determine the amount 
of memory I should set ? It fails when I set it to 8GB. 
 

Re: Ignoring S3 0 files exception

2014-05-22 Thread Laurent T
Ok, digging a bit into Spark i think i got it:
sc.newAPIHadoopFile("s3n://missingPattern/*",
EmptiableTextInputFormat.class, LongWritable.class, Text.class,
sc.hadoopConfiguration()).map(new Function, String>() {@Overridepublic String
call(Tuple2 arg0) throws Exception {return
arg0._2.toString();}}).count();
And the EmptiableTextInputFormat:
import java.io.IOException;import java.util.Collections;import
java.util.List;import org.apache.hadoop.mapreduce.InputSplit;import
org.apache.hadoop.mapreduce.JobContext;import
org.apache.hadoop.mapreduce.lib.input.InvalidInputException;import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;public class
EmptiableTextInputFormat extends TextInputFormat {  @Override   public 
List
getSplits(JobContext arg0) throws IOException { try {   
return
super.getSplits(arg0);  } catch (InvalidInputException e) { 
return
Collections. emptyList();   }   }}
Thanks !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ignoring-S3-0-files-exception-tp6101p6252.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

reading task failed 4 times, for unknown reason

2014-05-22 Thread Kostakis, Orestis
We have an instance of Spark running on top of Mesos and GlusterFS. Due to some 
fixes of bugs that we also came across, we installed the latest versions: 
1.0.0-rc9 (spark-1.0.0-bin-2.0.5-alpha, java 1.6.0_27), Mesos 0.18.1. Since 
then, moderate sized tasks (10-20GB) cannot complete.

I notice on the Mesos UI that for a failed task and a consequently killed 
context, many tasks appear to keep on running. 20 minutes later some (data 
reading) tasks are still in the 'Running' state.

Furthermore, on the specific error I get (note, we never call 'collect' 
explicitly, only count(), but the execution has definitely reach that point 
yet):
: An error occurred while calling o127.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
4.0:212 failed 4 times, most recent failure: TID 594 on host slave3.domain.com 
failed for unknown reason
I notice that task 4.0.212 is still  running on at least another slave.
537 task 4.0:212RUNNING   2014-05-21T17:12:19+0300  
 2014-05-21T17:12:19+0300   slave1.domain.comSandbox

And also from stderr:
14/05/21 17:12:19 INFO Executor: Running task ID 537
[...]
14/05/21 17:12:24 INFO Executor: Serialized size of result for 537 is 1100
14/05/21 17:12:24 INFO Executor: Sending result for 537 directly to driver
14/05/21 17:12:24 INFO Executor: Finished task ID 537

At the same time, INFO messages like this one appear:
14/05/22 15:06:40 INFO TaskSetManager: Ignorning task-finished event for TID 
621 because task 157 has already completed successfully

Additional errors include:
14/05/22 15:06:37 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask 
completion from 20140516-155535-170164746-5050-22001-5

and more importantly:
W0522 15:06:33.621423 12899 sched.cpp:901] Attempting to launch task 559 with 
an unknown offer 20140516-155535-170164746-5050-22001-114535


An assumption based on the above is that certain tasks complete and some part 
of the system is not notified about it. So the task gets rescheduled and after 
4 tries the context exits.

Thank you in advance!
-Orestis



Drop shark Cache

2014-05-22 Thread vinay Bajaj
Hi

Is there any way or command by which we can wipe/drop whole cache of shark
in one go.

Thanks
Vinay Bajaj


GraphX partition problem

2014-05-22 Thread Zhicharevich, Alex
Hi,

I'm running a simple connected components code using GraphX (version 0.9.1)

My input comes from a HDFS text file partitioned to 400 parts. When I run the 
code on a single part or a small number of files (like 20) the code runs fine. 
As soon as I'm trying to read more files (more than 30) I'm getting an error 
and the job fails.
>From looking at the logs I see the following exception
java.util.NoSuchElementException: End of stream
   at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)
   at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29)
   at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)
   at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)

>From searching the web, I see it's a known issue with GraphX
Here : https://github.com/apache/spark/pull/367
And here : https://github.com/apache/spark/pull/497

Are there some stable releases that include this fix? Should I clone the git 
repo and build it myself? How would you advise me to deal with this issue

Thanks,
Alex





Re: how to set task number?

2014-05-22 Thread qingyang li
my aim of setting task number is to increase the query speed,and I have
also found " mapPartitionsWithIndex at
Operator.scala:333"
is costing much time.  so, my another question is :
how to tunning 
mapPartitionsWithIndex
to make the costing time down?




2014-05-22 18:09 GMT+08:00 qingyang li :

> i have added  SPARK_JAVA_OPTS+="-Dspark.
> default.parallelism=40 "  in shark-env.sh,
> but i find there are only10 tasks on the cluster and 2 tasks each machine.
>
>
> 2014-05-22 18:07 GMT+08:00 qingyang li :
>
> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
>> shark-env.sh
>>
>>
>> 2014-05-22 17:50 GMT+08:00 qingyang li :
>>
>> i am using tachyon as storage system and using to shark to query a table
>>> which is a bigtable, i have 5 machines as a spark cluster, there are 4
>>> cores on each machine .
>>> My question is:
>>> 1. how to set task number on each core?
>>> 2. where to see how many partitions of one RDD?
>>>
>>
>>
>


Re: how to set task number?

2014-05-22 Thread qingyang li
i have added  SPARK_JAVA_OPTS+="-Dspark.
default.parallelism=40 "  in shark-env.sh,
but i find there are only10 tasks on the cluster and 2 tasks each machine.


2014-05-22 18:07 GMT+08:00 qingyang li :

> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
> shark-env.sh
>
>
> 2014-05-22 17:50 GMT+08:00 qingyang li :
>
> i am using tachyon as storage system and using to shark to query a table
>> which is a bigtable, i have 5 machines as a spark cluster, there are 4
>> cores on each machine .
>> My question is:
>> 1. how to set task number on each core?
>> 2. where to see how many partitions of one RDD?
>>
>
>


Re: how to set task number?

2014-05-22 Thread qingyang li
i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "  in
shark-env.sh


2014-05-22 17:50 GMT+08:00 qingyang li :

> i am using tachyon as storage system and using to shark to query a table
> which is a bigtable, i have 5 machines as a spark cluster, there are 4
> cores on each machine .
> My question is:
> 1. how to set task number on each core?
> 2. where to see how many partitions of one RDD?
>


Re: Ignoring S3 0 files exception

2014-05-22 Thread Laurent Thoulon
Hi Mayur, 

Thanks for your help. 
I'm not sure I understand what are the parameters i must give to 
newAPIHadoopFile [ K , V , F <: InputFormat [ K , V ] ] ( path: String , 
fClass: Class [ F ] , kClass: Class [ K ] , vClass: Class [ V ] , conf: 
Configuration ) : JavaPairRDD [ K , V ] 

It seems it returns a JavaPairRDD but i currently use sc.textFile and it 
returns just lines from the files. 
I'm not sure... how is this going to work ? What does the fClass stand for ? 
Why don't i receive just a set of lines ? 
Also, were you thinking of a specific configuration option to ignore 0 files ? 

Thanks 
Regards, 
Laurent T 


- Mail original -

De: "Mayur Rustagi"  
À: "laurent thoulon"  
Envoyé: Mercredi 21 Mai 2014 13:51:46 
Objet: Re: Ignoring S3 0 files exception 


You can try newhaoopapi in spark context. Should be able to configure the 
loader to ignore 0 files. 




Regards 
Mayur 



Mayur Rustagi 
Ph: +1 (760) 203 3257 
http://www.sigmoidanalytics.com 

@mayur_rustagi 




On Wed, May 21, 2014 at 3:36 PM, Laurent T < laurent.thou...@ldmobile.net > 
wrote: 


Noone has any idea ? It's really troublesome, it seems like i have no way to 
catch errors while an action is beeing processed and just ignore it. Here's a 
bit more details on what i'm doing: JavaRDD a = 
sc.textFile("s3n://"+missingFilenamePattern) 
JavaRDD b = sc.textFile("s3n://"+existingFilenamePattern) 

JavaRDD aPlusB = a.union(b);

aPlusB.reduceByLey(MyReducer); // <-- This throws the error I'd like to ignore 
the exception caused by a to process b without troubles. Thanks 

View this message in context: Re: Ignoring S3 0 files exception 


Sent from the Apache Spark User List mailing list archive at Nabble.com. 






how to set task number?

2014-05-22 Thread qingyang li
i am using tachyon as storage system and using to shark to query a table
which is a bigtable, i have 5 machines as a spark cluster, there are 4
cores on each machine .
My question is:
1. how to set task number on each core?
2. where to see how many partitions of one RDD?


Re: SparkContext#stop

2014-05-22 Thread Piotr Kołaczkowski
No exceptions in any logs. No errors in stdout or stderr.


2014-05-22 11:21 GMT+02:00 Andrew Or :

> You should always call sc.stop(), so it cleans up state and does not fill
> up your disk over time. The strange behavior you observe is mostly benign,
> as it only occurs after you have supposedly finished all of your work with
> the SparkContext. I am not aware of a bug in Spark that causes this
> behavior.
>
> What are you doing in your application? Do you see any exceptions in the
> logs? Have you looked at the worker logs? You can browse through these on
> the worker web UI on http://:8081
>
> Andrew
>



-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404


Re: SparkContext#stop

2014-05-22 Thread Andrew Or
You should always call sc.stop(), so it cleans up state and does not fill
up your disk over time. The strange behavior you observe is mostly benign,
as it only occurs after you have supposedly finished all of your work with
the SparkContext. I am not aware of a bug in Spark that causes this
behavior.

What are you doing in your application? Do you see any exceptions in the
logs? Have you looked at the worker logs? You can browse through these on
the worker web UI on http://:8081

Andrew


Workers disconnected from master sometimes and never reconnect back

2014-05-22 Thread Piotr Kołaczkowski
Hi,

Another problem we observed that on a very heavily loaded cluster, if the
worker fails to respond to the heartbeat within 60 seconds, it gets
disconnected permanently from the master and never connects back again. It
is very easy to reproduce - just setup a spark standalone cluster on a
single machine, suspend it for a while and after waking up the cluster
doesn't work anymore because all workers are lost.

Is there any way to mitigate this?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404


SparkContext#stop

2014-05-22 Thread Piotr Kołaczkowski
Hi,

We observed strange behabiour of Spark 0.9.0 when using sc.stop().

We have a bunch of applications that perform some jobs and then issue
sc.stop() at the end of main. Most of the time, everything works as
desired, but sometimes the applications get marked as "FAILED" by the
master and all remote workers get killed:

Executor Summary
> ExecutorID Worker Cores Memory State Logs
> 17 worker-20140520224948-10.240.75.212-39131 2 4096 KILLED stdout stderr
> 11 worker-20140520224947-10.240.121.104-40995 2 4096 KILLED stdout stderr
> 14 worker-20140520224948-10.240.10.39-57360 2 4096 KILLED stdout stderr
> 13 worker-20140520224855-10.240.124.170-41538 2 4096 KILLED stdout stderr
> 16 worker-20140520224802-10.240.110.72-51637 2 4096 KILLED stdout stderr
> 10 worker-20140520224948-10.240.146.198-53600 2 4096 KILLED stdout stderr
> 18 worker-20140520224948-10.240.109.20-49695 2 4096 KILLED stdout stderr
> 12 worker-20140520224950-10.240.238.138-50737 2 4096 KILLED stdout stderr
> 15 worker-20140520224947-10.240.255.168-57993 2 4096 KILLED stdout stderr


There are no errors in logs nor stdout / stderr, except the message from
the master:

INFO [Thread-31] 2014-05-21 15:41:35,814 ProcessUtil.java (line 36)
> SparkMaster: 14/05/21 15:41:35 ERROR DseSparkMaster: Application
> TestRunner: count with ID app-20140521141832-0006 failed 10 times, removing
> it


Tail of logs from the application:

14/05/21 17:52:56.318 INFO SparkContext: Job finished: count at
> SchedulerThroughputTest.scala:32, took 6.429144266 s
> results: 18.316,8.017,7.032,6.836,6.882,6.416,6.413,6.592,6.299,6.435
> 14/05/21 17:52:59.543 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
> 14/05/21 17:52:59.544 INFO SparkDeploySchedulerBackend: Asking each
> executor to shut down
> 14/05/21 17:53:00.607 INFO MapOutputTrackerMasterActor:
> MapOutputTrackerActor stopped!
> 14/05/21 17:53:00.661 INFO ConnectionManager: Selector thread was
> interrupted!
> 14/05/21 17:53:00.663 INFO ConnectionManager: ConnectionManager stopped
> 14/05/21 17:53:00.664 INFO MemoryStore: MemoryStore cleared
> 14/05/21 17:53:00.664 INFO BlockManager: BlockManager stopped
> 14/05/21 17:53:00.665 INFO BlockManagerMasterActor: Stopping
> BlockManagerMaster
> 14/05/21 17:53:00.666 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/05/21 17:53:00.669 INFO RemoteActorRefProvider$RemotingTerminator:
> Shutting down remote daemon.
> 14/05/21 17:53:00.670 INFO SparkContext: Successfully stopped SparkContext
> 14/05/21 17:53:00.672 INFO RemoteActorRefProvider$RemotingTerminator:
> Remote daemon shut down; proceeding with flushing remote transports.


Now if we do not call the sc.stop() and the end of the application,
everything works fine, and spark reports FINISHED every single time.

So should we call sc.stop() and the observed behaviour is a Spark bug, or
is it our bug and we shouldn't ever call sc.stop() at the end of main?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404


Spark Streaming Error: SparkException: Error sending message to BlockManagerMaster

2014-05-22 Thread Sourav Chandra
Hi,

I am running Spark streaming application. I have faced some uncaught
exception after which my worker stops processing any further messages.

I am using *spark 0.9.0*

Could you please let me know what could be the cause of this and how to
overcome this issue?

[ERROR] [05/22/2014 03:23:26.250] [spark-akka.actor.default-dispatcher-18]
[ActorSystem(spark)] exception while executing timer task
org.apache.spark.SparkException: Error sending message to
BlockManagerMaster [message = HeartBeat(BlockManagerId(,
10.29.29.226, 36655, 0))]
at
org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:176)
 at
org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
at org.apache.spark.storage.BlockManager.org
$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
 at
org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
 at
akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.run(Scheduler.scala:464)
 at
akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:281)
at
akka.actor.LightArrayRevolverScheduler$$anonfun$close$1.apply(Scheduler.scala:280)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at akka.actor.LightArrayRevolverScheduler.close(Scheduler.scala:279)
at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:630)
 at
akka.actor.ActorSystemImpl$$anonfun$_start$1.apply$mcV$sp(ActorSystem.scala:582)
at akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
 at
akka.actor.ActorSystemImpl$$anonfun$_start$1.apply(ActorSystem.scala:582)
at akka.actor.ActorSystemImpl$$anon$3.run(ActorSystem.scala:596)
 at
akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.runNext$1(ActorSystem.scala:750)
at
akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply$mcV$sp(ActorSystem.scala:753)
 at
akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
at
akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$run$1.apply(ActorSystem.scala:746)
 at akka.util.ReentrantGuard.withGuard(LockUtil.scala:15)
at
akka.actor.ActorSystemImpl$TerminationCallbacks.run(ActorSystem.scala:746)
 at
akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
at
akka.actor.ActorSystemImpl$$anonfun$terminationCallbacks$1.apply(ActorSystem.scala:593)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
 at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
 at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
 at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException:
Recipient[Actor[akka://spark/user/BlockManagerMaster#1305432112]] had
already been terminated.
 at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at
org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161)
 ... 39 more

Thanks,
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

"Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Using Spark to analyze complex JSON

2014-05-22 Thread Flavio Pompermaier
Is there a way to query fields by similarity (like Lucene or using a
similarity metric) to be able to query something like WHERE language LIKE
"it~0.5" ?

Best,
Flavio


On Thu, May 22, 2014 at 8:56 AM, Michael Cutler  wrote:

> Hi Nick,
>
> Here is an illustrated example which extracts certain fields from Facebook
> messages, each one is a JSON object and they are serialised into files with
> one complete JSON object per line. Example of one such message:
> CandyCrush.json 
>
> You need to define a case class which has all the fields you'll be able to
> query later in SQL, e.g.
>
> case class CandyCrushInteraction(id: String, user: String, level: Int,
> gender: String, language: String)
>
> The basic objective is to use Spark to convert the file from RDD[String] -
> - parse JSON - - > RDD[JValue] - - extract fields - - > RDD[
> CandyCrushInteraction]
>
> // Produces a RDD[String]
> val lines = sc.textFile("facebook-2014-05-19.json")
>
>
> // Process the messages
>
> val interactions = lines.map(line => {
>
>
>   // Parse the JSON, returns RDD[JValue]
>   parse(line)
>
>
> }).filter(json => {
>   // Filter out only 'Candy Crush Saga' Facebook App activity
>
>   (json \ "facebook" \ "application").extract[String] == "Candy Crush 
> Saga"
>
>
> }).map(json => {
>   // Extract fields we want, we use compact() because they may not exist
>
>   val id = compact(json \ "facebook" \ "id")
>
>
>   val user = compact(json \ "facebook" \ "author" \ "hash")
>
>
>   val gender = compact(json \ "demographic" \ "gender")
>
>
>   val language = compact(json \ "language" \ "tag")
>
>
>   // Extract the 'level' using a RegEx or default to zero
>
>   var level = 0;
>   pattern.findAllIn( compact(json \ "interaction" \ "title") 
> ).matchData.foreach(m => {
>
>
> level = m.group(1).toInt
>
>
>   })
>   // Return an RDD[CandyCrushInteraction]
>   ( CandyCrushInteraction(id, user, level, gender, language) )
>
>
> })
>
>
> Now you can register the RDD[CandyCrushInteraction] as a table and query
> it in SQL.
>
> interactions.registerAsTable("candy_crush_interaction")
>
>
> // Game level by Gender
>
> sql("SELECT gender, COUNT(level), MAX(level), MIN(level), AVG(level) FROM 
> candy_crush_interaction WHERE level > 0 GROUP BY 
> gender").collect().foreach(println)
>
>
> /* Returns:
> ["male",14727,590,1,104.71705031574659]
> ["female",15422,590,1,114.17202697445208]
> ["mostly_male",2824,590,1,97.08852691218131]
> ["mostly_female",1934,590,1,99.0517063081696]
>
>
> ["unisex",2674,590,1,113.42071802543006]
> [,11023,590,1,93.45677220357435]
>  */
>
>
> Full working example: 
> CandyCrushSQL.scala
>
> MC
>
>
> *Michael Cutler*
> Founder, CTO
>
>
> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com 
> Web: tumra.com
>  *
> *Visit us at our offices in Chiswick Park *
> *Registered in England & Wales, 07916412. VAT No. 130595328*
>
>
> This email and any files transmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender 
> immediately.
> If you are not the intended recipient you must not use, disclose, copy,
> print, distribute or rely on this email.
>
>
> On 22 May 2014 04:43, Nicholas Chammas  wrote:
>
>> That's a good idea. So you're saying create a SchemaRDD by applying a
>> function that deserializes the JSON and transforms it into a relational
>> structure, right?
>>
>> The end goal for my team would be to expose some JDBC endpoint for
>> analysts to query from, so once Shark is updated to use Spark SQL that
>> would become possible without having to resort to using Hive at all.
>>
>>
>> On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer wrote:
>>
>>> Hi,
>>>
>>> as far as I understand, if you create an RDD with a relational
>>> structure from your JSON, you should be able to do much of that
>>> already today. For example, take lift-json's deserializer and do
>>> something like
>>>
>>>   val json_table: RDD[MyCaseClass] = json_data.flatMap(json =>
>>> json.extractOpt[MyCaseClass])
>>>
>>> then I guess you can use Spark SQL on that. (Something like your
>>> likes[2] query won't work, though, I guess.)
>>>
>>> Regards
>>> Tobias
>>>
>>>
>>> On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas
>>>  wrote:
>>> > Looking forward to that update!
>>> >
>>> > Given a table of JSON objects like this one:
>>> >
>>> > {
>>> >"name": "Nick",
>>> >"location": {
>>> >   "x": 241.6,
>>> >   "y": -22.5
>>> >},
>>> >"likes": ["ice cream", "dogs", "Vanilla Ice"]
>>> > }
>>> >
>>> > It would be SUPER COOL if we could query that table in a

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath
Hi

In my opinion, running HBase for immutable data is generally overkill in
particular if you are using Shark anyway to cache and analyse the data and
provide the speed.

HBase is designed for random-access data patterns and high throughput R/W
activities. If you are only ever writing immutable logs, then that is what
HDFS is designed for.

Having said that, if you replace HBase you will need to come up with a
reliable way to put data into HDFS (a log aggregator like Flume or message
bus like Kafka perhaps, etc), so the pain of doing that may not be worth it
given you already know HBase.


On Thu, May 22, 2014 at 9:33 AM, Limbeck, Philip  wrote:

>  HI!
>
>
>
> We are currently using HBase as our primary data store of different
> event-like data. On-top of that, we use Shark to aggregate this data and
> keep it
> in memory for fast data access.  Since we use no specific HBase
> functionality whatsoever except Putting data into it, a discussion
> came up on having to set up an additional set of components on top of HDFS
> instead of just writing to HDFS directly.
>
>  Is there any overview regarding implications of doing that ? I mean
> except things like taking care of file structure and the like. What is the
> true
>
> advantage of Spark on HBase in favor of Spark on HDFS?
>
>
>
> Best
>
> Philip
>
> Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben
> Firmenbuchnummer/Commercial Register No. 275184h
> Firmenbuchgericht/Commercial Register Court: Landesgericht St. Poelten
>
> This email (including any attachments) may contain information which is
> privileged, confidential, or protected. If you are not the intended
> recipient, note that any disclosure, copying, distribution, or use of the
> contents of this message and attached files is prohibited. If you have
> received this email in error, please notify the sender and delete this
> email and any attached files.
>
>


Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Limbeck, Philip
HI!

We are currently using HBase as our primary data store of different event-like 
data. On-top of that, we use Shark to aggregate this data and keep it
in memory for fast data access.  Since we use no specific HBase functionality 
whatsoever except Putting data into it, a discussion
came up on having to set up an additional set of components on top of HDFS 
instead of just writing to HDFS directly.

Is there any overview regarding implications of doing that ? I mean except 
things like taking care of file structure and the like. What is the true
advantage of Spark on HBase in favor of Spark on HDFS?

Best
Philip

Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben 
Firmenbuchnummer/Commercial Register No. 275184h Firmenbuchgericht/Commercial 
Register Court: Landesgericht St. Poelten

This email (including any attachments) may contain information which is 
privileged, confidential, or protected. If you are not the intended recipient, 
note that any disclosure, copying, distribution, or use of the contents of this 
message and attached files is prohibited. If you have received this email in 
error, please notify the sender and delete this email and any attached files.



Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-22 Thread Kevin Markey

  
  
Update:  Partly user error.  But still getting FS closed error.

Yes, we are running plain vanilla Hadoop 2.3.0.  But it probably
doesn't matter

1. Tried Colin McCabe's suggestion to patch with pull 850
(https://issues.apache.org/jira/browse/SPARK-1898).  No effect.

2. When testing Colin's patch, realized that master is set
two places.  In spark-submit it is set using "--master
yarn-cluster".  But it is also set via command line in my
application (for SparkContext initialization) -- which I did not
modify specifically for testing 1.0.  When modifying my scripts to
use --master, I removed the user option --sm, which had the effect
of setting the master to "local"!!  But local contained by two Yarn
containers!!!  (This is almost too funny to imagine!)  (Each of the
two container logs is identical, looking like a local mode log.)

3. I corrected this option to also specify "yarn-cluster".  Now ASM
reports that the application is RUNNING...

14/05/22 00:09:08 INFO yarn.Client: Application report from ASM:

     application identifier: application_1400738053238_0002
     appId: 2
     clientToAMToken: null
     appDiagnostics: 
     appMasterHost: 192.168.1.21
     appQueue: default
     appMasterRpcPort: 0
     appStartTime: 1400738864762
     yarnAppState: RUNNING
     distributedFinalState: UNDEFINED
     appTrackingUrl:
  http://Sleepycat:8088/proxy/application_1400738053238_0002/
     appUser: hduser

And it is reported as a SUCCESS, despite a Filesystem closed
IOException when the AM attempts to clean up the staging directory. 


(Now the two container logs are different.  One is that of a driver,
the other that of an executor.)

4. But there is still the Filesystem closed error.  This is reported
only in the driver's/AM's stderr log.  It does not affect the final
success of the job.  

I'm not sure whether my app's opening and closing the FS has any
bearing.  (It never has before.)  Before the application concludes,
it saves several text files to HDFS by using Hadoop utilities to (1)
get an instance of the filesystem, (2) an output stream to a path in
HDFS, (3) write to it, (4) close the stream, and (5) close the
filesystem -- for each file to be written!  But this is done
by the driver thread, not by the application master.  (Does the FS
object owned by ApplicationMaster interact with any arbitrary FS
instance in the driver?)  Furthermore, it opens and closes the
filesystem at least 3 times and as many as hundreds of times, and
has in the past without side-effects.

We've had dozens of arguments about whether to close such FS
instances.  Sometimes we have problems if we close them.  Sometimes
when we don't!

I shall experiment with the FileSystem handling when it's
convenient.

5. Finally, I don't know if pull 850 had any effect.  I've not
rolled it back to retest with the correct SparkContext master
setting.

Thanks for your feedback.

Kevin


On 05/21/2014 11:54 PM, Tathagata Das
  wrote:


  
Are you running a vanilla Hadoop 2.3.0
  or the one that comes with CDH5 / HDP(?) ? We may be able to
  reproduce this in that case. 



  TD
  
  On Wed, May 21, 2014 at 8:35 PM, Tom
Graves 
wrote:

  

  It sounds like something is closing the
  hdfs filesystem before everyone is really done
  with it. The filesystem gets cached and is shared
  so if someone closes it while other threads are
  still using it you run into this error.   Is your application
  closing the filesystem?     Are you using the event
  logging feature?   Could you share the options you
  are running with?
  
  
  
Yarn will retry the application depending on how the
Application Master attempt fails (this is a
configurable setting as to how many times it
retries).  That is probably the second driver you
are referring to.  But they shouldn't have
overlapped as far as both being up at the same time.
Is that the case you are seeing?  Generally you want
to look at why the first application attempt fails.
  
  

  
  Tom
  


 

Spark Streaming on Mesos, various questions

2014-05-22 Thread Tobias Pfeiffer
Hi,

with the hints from Gerard I was able to get my locally working Spark
code running on Mesos. Thanks!

Basically, on my local dev machine, I use "sbt assembly" to create a
fat jar (which is actually not so fat since I use "... % 'provided'"
in my sbt file for the Spark dependencies), upload it to my cluster
and run it using
  java -cp myApplicationCode.jar:spark-assembly-1.0.0-SNAPSHOT.jar
mypackage.MainClass
I can see in my Mesos master web interface how the tasks are added and
distributed to the slaves and in the driver program I can see the
final results, that is very nice.

Now, as the next step, I wanted to get Spark Streaming running. That
worked out by now, but I have various questions. I'd be happy if
someone could help me out with some answers.

1. I wrongly assumed that when using ssc.socketTextStream(), the
driver would connect to the specified server. It does not; apparently
one of the slaves does ;-) Does that mean that before any DStream
processing can be done, all the received data needs to be sent to the
other slaves? What about the extreme case dstream.filter(x => false);
would all the data be transferred to other hosts, just to be discarded
there?

2. How can I reduce the logging? It seems like for every chunk
received from the socketTextStream, I get a line "INFO
BlockManagerInfo: Added input-0-1400739888200 in memory on ...",
that's very noisy. Also, when the foreachRDD() is processed every N
seconds, I get a lot of output.

3. In my (non-production) cluster, I have six slaves, two of which
have 2G of RAM, the other four just 512M. So far, I have not seen
Mesos ever give a job to one of the four low-mem machines. Is 512M
just not enough for *any* task, or is there a rationale like "they are
not cool enough to play with the Big Guys" built into Mesos?

4. I don't have any HDFS or shared disk space. What does this mean for
Spark Streaming's default storage level MEMORY_AND_DISK_SER_2?

5. My prototype example for Spark Streaming is a simple word count:
  val wordCounts = ssc.socketTextStream(...).flatMap(_.split("
")).map((_, 1)).reduceByKey(_ + _)
  wordCounts.print()
However, (with a batchDuration of five seconds) this only works
correctly if I run the application in Mesos "coarse mode". In the
default "fine-grained mode", I will always receive 0 as word count
(that is, a wrong result), and a lot of warnings like
  W0522 06:57:23.578400 12824 sched.cpp:901] Attempting to launch task
7 with an unknown offer 20140520-102159-2154735808-5050-1108-7891
Can anyone explain this behavior?

Thanks,
Tobias