Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Hemant Bhanawat
I think rdd.toLocalIterator is what you want. But it will keep one partition's data in-memory. On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera wrote: > Hi all, > > I have a large set of data which would not fit into the memory. So, I wan > to take n number of data from

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Juan Rodríguez Hortalá
Hi, Maybe you could use zipWithIndex and filter to skip the first elements. For example starting from scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10),

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Niranda Perera
Hi all, thank you for your response. after taking a look at the implementations of rdd.collect(), I thought of using the rdd.runJob(...) method . for (int i = 0; i < dataFrame.rdd().partitions().length; i++) { dataFrame.sqlContext().sparkContext().runJob(data.rdd(), some

taking an n number of rows from and RDD starting from an index

2015-09-01 Thread Niranda Perera
Hi all, I have a large set of data which would not fit into the memory. So, I wan to take n number of data from the RDD given a particular index. for an example, take 1000 rows starting from the index 1001. I see that there is a take(num: Int): Array[T] method in the RDD, but it only returns