During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?
What is the difference between checkpointing and caching an RDD?
but the
> re-computation will occur on an executor. So if several partitions are
> lost, e.g. due to a few machines failing, the re-computation can be striped
> across the cluster making it fast.
>
>
> On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote:
>
>> Can someone e
Can someone explain how RDD is resilient? If one of the partition is lost,
who is responsible to recreate that partition - is it the driver program?
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.
ttp://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Fri, Mar 28, 2014 at 9:24 AM, David Thomas wrote:
>
>> How can we replicate RDD elements? Say I have 1 element and 100 nodes in
>> the cluster. I need to replicate this one item on all the nodes i.e.
>> effectively create an RDD of 100 elements.
>>
>
>
How can we replicate RDD elements? Say I have 1 element and 100 nodes in
the cluster. I need to replicate this one item on all the nodes i.e.
effectively create an RDD of 100 elements.
Is it possible to parition the RDD elements in a round robin fashion? Say I
have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure
each element gets mapped to each node in the cluster.
Spark runtime/scheduler traverses the DAG starting from
> that RDD and triggers evaluation of anything parent RDDs it needs that
> aren't computed and cached yet.
>
> Any future operations build on the same DAG as long as you use the same
> RDD objects and, if you used cache
ld be lazy, but
> apparently uses an RDD.count call in its implementation:
> https://spark-project.atlassian.net/browse/SPARK-1021).
>
> David Thomas
> March 11, 2014 at 9:49 PM
> For example, is distinct() transformation lazy?
>
> when I see the Spark source code, distin
For example, is distinct() transformation lazy?
when I see the Spark source code, distinct applies a map-> reduceByKey ->
map function to the RDD elements. Why is this lazy? Won't the function be
applied immediately to the elements of RDD when I call someRDD.distinct?
/**
* Return a new RDD
What is the concept of Block and BlockManager in Spark? How is a Block
related to a Partition of a RDD?
Is there any guide available on creating a custom RDD?
I have an RDD of (K, Array[V]) pairs.
For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2)))
How can I do a groupByKey such that I get back an RDD of the form (K,
Array[V]) pairs.
Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))
So I'm having this code:
rdd.foreach(p => {
print(p)
})
Where can I see this output? Currently I'm running my spark program on a
cluster. When I run the jar using sbt run, I see only INFO logs on the
console. Where should I check to see the application sysouts?
15 matches
Mail list logo