Re: Partitioning Dataset and Using Reduce in Apache Spark

raghav0110.cs Wed, 11 Mar 2015 23:13:46 -0700


 Thank you very much for your detailed response, it was very informative and 
cleared up some of my misconceptions. After your explanation, I understand that 
the distribution of the data and parallelism is all meant to be an abstraction 
to the developer.



In your response you say “When you call reduce and similar methods, each 
partition can be reduced in parallel. Then the results of that can be 
transferred across the network and reduced to the final result”. By similar 
methods do you mean all actions within spark? Does transfer of data from worker 
nodes to driver nodes happen only when an action is performed?


I am assuming that in Spark, you typically have a set of transformations 
followed by some sort of action. The RDD is partitioned and sent to different 
worker nodes(assuming this a cluster setup), the transformations are applied to 
the RDD partitions at the various worker nodes, and then when an action is 
performed, you perform the action on the worker nodes and then aggregate the 
partial results at the driver and then perform another reduction at the driver 
to obtain the overall results. I would also assume that deciding whether the 
action should be done on a worker node, depends on the type of action. For 
example, performing reduce at the worker node makes sense, while it doesn't 
make sense to save the file at the worker node.  Does that sound correct, or am 
I misinterpreting something?






Thanks,

Raghav





From: Daniel Siegmann
Sent: ‎Thursday‎, ‎March‎ ‎5‎, ‎2015 ‎2‎:‎01‎ ‎PM
To: Raghav Shankar
Cc: user@spark.apache.org








An RDD is a Resilient Distributed Data set. The partitioning and distribution 
of the data happens in the background. You'll occasionally need to concern 
yourself with it (especially to get good performance), but from an API 
perspective it's mostly invisible (some methods do allow you to specify a 
number of partitions).


When you call sc.textFile(myPath) or similar, you get an RDD. That RDD will be 
composed of a bunch of partitions, but you don't really need to worry about 
that. The partitioning will be based on how the data is stored. When you call a 
method that causes a shuffle (such as reduce), the data is repartitioned into a 
number of partitions based on your default parallelism setting (which IIRC is 
based on your number of cores if you haven't set it explicitly).

When you call reduce and similar methods, each partition can be reduced in 
parallel. Then the results of that can be transferred across the network and 
reduced to the final result. You supply the function and Spark handles the 
parallel execution of that function.

I hope this helps clear up your misconceptions. You might also want to 
familiarize yourself with the collections API in Java 8 (or Scala, or Python, 
or pretty much any other language with lambda expressions), since RDDs are 
meant to have an API that feels similar.



On Thu, Mar 5, 2015 at 9:45 AM, raggy <raghav0110...@gmail.com> wrote:

I am trying to use Apache spark to load up a file, and distribute the file to
several nodes in my cluster and then aggregate the results and obtain them.
I don't quite understand how to do this.

From my understanding the reduce action enables Spark to combine the results
from different nodes and aggregate them together. Am I understanding this
correctly?

From a programming perspective, I don't understand how I would code this
reduce function.

How exactly do I partition the main dataset into N pieces and ask them to be
parallel processed by using a list of transformations?

reduce is supposed to take in two elements and a function for combining
them. Are these 2 elements supposed to be RDDs from the context of Spark or
can they be any type of element? Also, if you have N different partitions
running parallel, how would reduce aggregate all their results into one
final result(since the reduce function aggregates only 2 elements)?

Also, I don't understand this example. The example from the spark website
uses reduce, but I don't see the data being processed in parallel. So, what
is the point of the reduce? If I could get a detailed explanation of the
loop in this example, I think that would clear up most of my questions.

class ComputeGradient extends Function<DataPoint, Vector> {
  private Vector w;
  ComputeGradient(Vector w) { this.w = w; }
  public Vector call(DataPoint p) {
    return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1));
  }
}

JavaRDD<DataPoint> points = spark.textFile(...).map(new
ParsePoint()).cache();
Vector w = Vector.random(D); // current separating plane
for (int i = 0; i < ITERATIONS; i++) {
  Vector gradient = points.map(new ComputeGradient(w)).reduce(new
AddVectors());
  w = w.subtract(gradient);
}
System.out.println("Final separating plane: " + w);

Also, I have been trying to find the source code for reduce from the Apache
Spark Github, but the source is pretty huge and I haven't been able to
pinpoint it. Could someone please direct me towards which file I could find
it in?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Dataset-and-Using-Reduce-in-Apache-Spark-tp21933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Partitioning Dataset and Using Reduce in Apache Spark

Reply via email to