I think rdd.toLocalIterator is what you want. But it will keep one partition's data in-memory.
On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera <niranda.per...@gmail.com> wrote: > Hi all, > > I have a large set of data which would not fit into the memory. So, I wan > to take n number of data from the RDD given a particular index. for an > example, take 1000 rows starting from the index 1001. > > I see that there is a take(num: Int): Array[T] method in the RDD, but it > only returns the 'first n number of rows'. > > the simplest use case of this, requirement is, say, I write a custom > relation provider with a custom relation extending the InsertableRelation. > > say I submit this query, > "insert into table abc select * from xyz sort by x asc" > > in my custom relation, I have implemented the def insert(data: DataFrame, > overwrite: Boolean): Unit > method. here, since the data is large, I can not call methods such as > DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...). > As you could see, the resultant DF from the "select * from xyz sort by x > asc" is sorted, and if I sun, foreachpartition on that DF and implement the > insert method, this sorted order would be affected, since the inserting > operation would be done in parallel in each partition. > > in order to handle this, my initial idea was to take rows from the RDD in > batches and do the insert operation, and for that I was looking for a > method to take n number of rows starting from a given index. > > is there any better way to handle this, in RDDs? > > your assistance in this regard is highly appreciated. > > cheers > > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > https://pythagoreanscript.wordpress.com/ >