Submitting Spark Applications using Spark Submit

2015-06-16 Thread raggy
I am trying to submit a spark application using the command line. I used the spark submit command for doing so. I initially setup my Spark application on Eclipse and have been making changes on there. I recently obtained my own version of the Spark source code and added a new method to RDD.scala.

Different Sorting RDD methods in Apache Spark

2015-06-09 Thread raggy
For a research project, I tried sorting the elements in an RDD. I did this in two different approaches. In the first method, I applied a mapPartitions() function on the RDD, so that it would sort the contents of the RDD, and provide a result RDD that contains the sorted list as the only record in

Implementing top() using treeReduce()

2015-06-09 Thread raggy
I am trying to implement top-k in scala within apache spark. I am aware that spark has a top action. But, top() uses reduce(). Instead, I would like to use treeReduce(). I am trying to compare the performance of reduce() and treeReduce(). The main issue I have is that I cannot use these 2 lines

TreeReduce Functionality in Spark

2015-06-03 Thread raggy
I am trying to understand what the treeReduce function for an RDD does, and how it is different from the normal reduce function. My current understanding is that treeReduce tries to split up the reduce into multiple steps. We do a partial reduce on different nodes, and then a final reduce is done

Sending RDD object over the network

2015-04-05 Thread raggy
For a class project, I am trying to utilize 2 spark Applications communicate with each other by passing an RDD object that was created from one application to another Spark application. The first application is developed in Scala and creates an RDD and sends it to the 2nd application over the

Task result in Spark Worker Node

2015-03-29 Thread raggy
I am a PhD student working on a research project related to Apache Spark. I am trying to modify some of the spark source code such that instead of sending the final result RDD from the worker nodes to a master node, I want to send the final result RDDs to some different node. In order to do this,

Launching Spark Cluster Application through IDE

2015-03-19 Thread raggy
I am trying to debug a Spark Application on a cluster using a master and several worker nodes. I have been successful at setting up the master node and worker nodes using Spark standalone cluster manager. I downloaded the spark folder with binaries and use the following commands to setup worker

Aggregation of distributed datasets

2015-03-13 Thread raggy
I am a PhD student trying to understand the internals of Spark, so that I can make some modifications to it. I am trying to understand how the aggregation of the distributed datasets(through the network) onto the driver node works. I would very much appreciate it if someone could point me towards

Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread raggy
I am trying to use Apache spark to load up a file, and distribute the file to several nodes in my cluster and then aggregate the results and obtain them. I don't quite understand how to do this. From my understanding the reduce action enables Spark to combine the results from different nodes and