Getting started : Spark on YARN issue

2014-06-19 Thread Praveen Seluka
I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN + HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles. Now am trying to run the example Spark job . (In Yarn-cluster mode). >From my *local machine. *I have setup HADOOP_CONF_DIR environment variable correct

Re: Getting started : Spark on YARN issue

2014-06-19 Thread Praveen Seluka
within your >> cluster to use public hostnames. Let me know if that does the job. >> >> Andrew >> >> >> 2014-06-19 6:04 GMT-07:00 Praveen Seluka : >> >> I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN + >>> HDFS) in EC2.

Re: Number of executors change during job running

2014-07-10 Thread Praveen Seluka
If I understand correctly, you could not change the number of executors at runtime right(correct me if am wrong) - its defined when we start the application and fixed. Do you mean number of tasks? On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das wrote: > Can you try setting the number-of-partitio

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Praveen Seluka
Here are my answers. But am just getting started with Spark Streaming - so please correct me if am wrong. 1) Yes 2) Receivers will run on executors. Its actually a job thats submitted where # of tasks equals # of receivers. An executor can actually run more than one task at the same time. Hence you

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Praveen Seluka
If you want to make Twitter* classes available in your shell, I believe you could do the following 1. Change the parent pom module ordering - Move external/twitter before assembly 2. In assembly/pom.xm, add external/twitter dependency - this will package twitter* into the assembly jar Now when spa

Re: persistence state of an RDD

2014-07-15 Thread Praveen Seluka
Nathan, you are looking for SparkContext.getRDDStorageInfo which returns the information on how much is cached. On Tue, Jul 15, 2014 at 8:01 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > Is there a way of determining programatically the cache state of an RDD? > Not its storage lev

Re: API to add/remove containers inside an application

2014-09-04 Thread Praveen Seluka
+user On Thu, Sep 4, 2014 at 10:53 PM, Praveen Seluka wrote: > Spark on Yarn has static allocation of resources. > https://issues.apache.org/jira/browse/SPARK-3174 - This JIRA by Sandy is > about adding and removing executors dynamically based on load. Even before > doing this, c

Re: API to add/remove containers inside an application

2014-09-04 Thread Praveen Seluka
Mailed our list - will send it to Spark Dev On Fri, Sep 5, 2014 at 11:28 AM, Rajat Gupta wrote: > +1 on this. First step to more automated autoscaling of spark application > master... > > > On Fri, Sep 5, 2014 at 12:56 AM, Praveen Seluka > wrote: > >> +user >>

Yarn Over-allocating Containers

2014-09-11 Thread praveen seluka
Hi all Am seeing a strange issue in Spark on Yarn(Stable). Let me know if known, or am missing something as it looks very fundamental. Launch a Spark job with 2 Containers. addContainerRequest called twice and then calls allocate to AMRMClient. This will get 2 Containers allocated. Fine as of now

executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
Can someone explain the motivation behind passing executorAdded event to DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded *method is called by *TaskSchedulerImpl*. I see some issue in the below code, *TaskSchedulerImpl.scala code* if (!executorsByHost.contains(o.host))

Re: executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
Some corrections. On Fri, Sep 26, 2014 at 5:32 PM, praveen seluka wrote: > Can someone explain the motivation behind passing executorAdded event to > DAGScheduler ? *DAGScheduler *does *submitWaitingStages *when *executorAdded > *method is called by *TaskSchedulerImpl*. I see some iss

Re: executorAdded event to DAGScheduler

2014-09-26 Thread praveen seluka
not sure if it will create an issue when you have multiple workers in > the same host, as submitWaitingStages is called everywhere and I never > try such a deployment mode > > Best, > > -- > Nan Zhu > > On Friday, September 26, 2014 at 8:02 AM, praveen seluka wrote: > &

Re: In Java how can I create an RDD with a large number of elements

2014-12-08 Thread praveen seluka
Steve, Something like this will do I think => sc.parallelize(1 to 1000, 1000).flatMap(x => 1 to 10) the above will launch 1000 tasks (maps), with each task creating 10^5 numbers (total of 100 million elements) On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis wrote: > assume I don't care about