Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

Re: Resilient nature of RDD

2014-04-03 Thread David Thomas
but the > re-computation will occur on an executor. So if several partitions are > lost, e.g. due to a few machines failing, the re-computation can be striped > across the cluster making it fast. > > > On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote: > >> Can someone e

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
ttp://in.linkedin.com/in/sonalgoyal> > > > > > On Fri, Mar 28, 2014 at 9:24 AM, David Thomas wrote: > >> How can we replicate RDD elements? Say I have 1 element and 100 nodes in >> the cluster. I need to replicate this one item on all the nodes i.e. >> effectively create an RDD of 100 elements. >> > >

Replicating RDD elements

2014-03-27 Thread David Thomas
How can we replicate RDD elements? Say I have 1 element and 100 nodes in the cluster. I need to replicate this one item on all the nodes i.e. effectively create an RDD of 100 elements.

Round Robin Partitioner

2014-03-13 Thread David Thomas
Is it possible to parition the RDD elements in a round robin fashion? Say I have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure each element gets mapped to each node in the cluster.

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
Spark runtime/scheduler traverses the DAG starting from > that RDD and triggers evaluation of anything parent RDDs it needs that > aren't computed and cached yet. > > Any future operations build on the same DAG as long as you use the same > RDD objects and, if you used cache

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
ld be lazy, but > apparently uses an RDD.count call in its implementation: > https://spark-project.atlassian.net/browse/SPARK-1021). > > David Thomas > March 11, 2014 at 9:49 PM > For example, is distinct() transformation lazy? > > when I see the Spark source code, distin

Are all transformations lazy?

2014-03-11 Thread David Thomas
For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct? /** * Return a new RDD

Block

2014-03-10 Thread David Thomas
What is the concept of Block and BlockManager in Spark? How is a Block related to a Partition of a RDD?

Custom RDD

2014-03-10 Thread David Thomas
Is there any guide available on creating a custom RDD?

Help with groupByKey

2014-03-02 Thread David Thomas
I have an RDD of (K, Array[V]) pairs. For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2))) How can I do a groupByKey such that I get back an RDD of the form (K, Array[V]) pairs. Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))

Where does println output go?

2014-03-01 Thread David Thomas
So I'm having this code: rdd.foreach(p => { print(p) }) Where can I see this output? Currently I'm running my spark program on a cluster. When I run the jar using sbt run, I see only INFO logs on the console. Where should I check to see the application sysouts?