I am trying to submit a spark application using the command line. I used the
spark submit command for doing so. I initially setup my Spark application on
Eclipse and have been making changes on there. I recently obtained my own
version of the Spark source code and added a new method to RDD.scala.
For a research project, I tried sorting the elements in an RDD. I did this in
two different approaches.
In the first method, I applied a mapPartitions() function on the RDD, so
that it would sort the contents of the RDD, and provide a result RDD that
contains the sorted list as the only record in
I am trying to implement top-k in scala within apache spark. I am aware that
spark has a top action. But, top() uses reduce(). Instead, I would like to
use treeReduce(). I am trying to compare the performance of reduce() and
treeReduce().
The main issue I have is that I cannot use these 2 lines
I am trying to understand what the treeReduce function for an RDD does, and
how it is different from the normal reduce function. My current
understanding is that treeReduce tries to split up the reduce into multiple
steps. We do a partial reduce on different nodes, and then a final reduce is
done
For a class project, I am trying to utilize 2 spark Applications communicate
with each other by passing an RDD object that was created from one
application to another Spark application. The first application is developed
in Scala and creates an RDD and sends it to the 2nd application over the
I am a PhD student working on a research project related to Apache Spark. I
am trying to modify some of the spark source code such that instead of
sending the final result RDD from the worker nodes to a master node, I want
to send the final result RDDs to some different node. In order to do this,
I am trying to debug a Spark Application on a cluster using a master and
several worker nodes. I have been successful at setting up the master node
and worker nodes using Spark standalone cluster manager. I downloaded the
spark folder with binaries and use the following commands to setup worker
I am a PhD student trying to understand the internals of Spark, so that I can
make some modifications to it. I am trying to understand how the aggregation
of the distributed datasets(through the network) onto the driver node works.
I would very much appreciate it if someone could point me towards
I am trying to use Apache spark to load up a file, and distribute the file to
several nodes in my cluster and then aggregate the results and obtain them.
I don't quite understand how to do this.
From my understanding the reduce action enables Spark to combine the results
from different nodes and