Re: Global sequential access of elements in RDD

2015-02-27 Thread Imran Rashid
Why would you want to use spark to sequentially process your entire data
set?  The entire purpose is to let you do distributed processing -- which
means letting partitions get processed simultaneously by different cores /
nodes.

that being said, occasionally in a bigger pipeline with a lot of
distributed operations, you might need to do one segment in a completely
sequential manner.  You have a few options -- just be aware that with all
of them, you are working *around* the idea of an RDD, so make sure you have
a really good reason.

1) rdd.toLocalIterator.  Still pulls all of the data to the driver, just
like rdd.collect(), but its slightly more scalable since it won't store
*all* of the data in memory on the driver (it does still store all of the
data in one partition in memory, though.)

2) write the rdd to some external data storage (eg. hdfs), and then read
the data sequentially off of hdfs on your driver.  Still needs to pull all
of the data to the driver, but you can get it to avoid pulling an entire
partition into memory and make it streaming.

3) create a number of rdds that consist of just one partition of your
original rdd, and then execute actions on them sequentially:

val originalRDD = ... //this should be cached to make sure you don't
recompute it
(0 until originalRDD.partitions.size).foreach{partitionIdx =
  val prunedRdd = new PartitionPruningRDD(originalRDD, {x = x ==
partitionIdx})
  prunedRDD.runSomeActionHere()
}

note that PartitionPruningRDD is a developer api, however.  This will run
your action on one partition at a time, and ideally the tasks will be
scheduled on the same node where the partitions have been cached, so you
don't need to move the data around.  But again, b/c you're turning it into
a sequential program, most of your cluster is sitting idle, and your not
really leveraging spark ...


imran

On Fri, Feb 27, 2015 at 1:38 AM, Wush Wu w...@bridgewell.com wrote:

 Dear all,

 I want to implement some sequential algorithm on RDD.

 For example:

 val conf = new SparkConf()
   conf.setMaster(local[2]).
   setAppName(SequentialSuite)
 val sc = new SparkContext(conf)
 val rdd = sc.
parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).
sortBy(x = x, true)
 rdd.foreach(println)

 I want to see the ordered number on my screen, but it shows unordered
 integers. The two partitions execute the println simultaneously.

 How do I make the RDD execute a function globally sequential?

 Best,
 Wush



Re: Global sequential access of elements in RDD

2015-02-27 Thread Wush Wu
Thanks for your reply.

But your code snippet uses the `collect` which is not feasible for me.
My algorithm involves a large amount of data and I do not want to transmit
them.

Wush

2015-02-27 16:27 GMT+08:00 Yanbo Liang yblia...@gmail.com:

 Actually, sortBy will return an ordered RDD.
 Your output is unordered integers may be due to foreach.

 You can reference the following code snippet, it will return ordered
 integers [1,1,1,2,2,3,4,5,7,8,9]

 val rdd = sc.parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).sortBy(x 
 = x, true)
 println(rdd.collect().mkString(,))



 2015-02-27 15:38 GMT+08:00 Wush Wu w...@bridgewell.com:

 Dear all,

 I want to implement some sequential algorithm on RDD.

 For example:

 val conf = new SparkConf()
   conf.setMaster(local[2]).
   setAppName(SequentialSuite)
 val sc = new SparkContext(conf)
 val rdd = sc.
parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).
sortBy(x = x, true)
 rdd.foreach(println)

 I want to see the ordered number on my screen, but it shows unordered
 integers. The two partitions execute the println simultaneously.

 How do I make the RDD execute a function globally sequential?

 Best,
 Wush





Global sequential access of elements in RDD

2015-02-26 Thread Wush Wu
Dear all,

I want to implement some sequential algorithm on RDD.

For example:

val conf = new SparkConf()
  conf.setMaster(local[2]).
  setAppName(SequentialSuite)
val sc = new SparkContext(conf)
val rdd = sc.
   parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2).
   sortBy(x = x, true)
rdd.foreach(println)

I want to see the ordered number on my screen, but it shows unordered
integers. The two partitions execute the println simultaneously.

How do I make the RDD execute a function globally sequential?

Best,
Wush