Hi all, I have a large set of data which would not fit into the memory. So, I wan to take n number of data from the RDD given a particular index. for an example, take 1000 rows starting from the index 1001.
I see that there is a take(num: Int): Array[T] method in the RDD, but it only returns the 'first n number of rows'. the simplest use case of this, requirement is, say, I write a custom relation provider with a custom relation extending the InsertableRelation. say I submit this query, "insert into table abc select * from xyz sort by x asc" in my custom relation, I have implemented the def insert(data: DataFrame, overwrite: Boolean): Unit method. here, since the data is large, I can not call methods such as DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...). As you could see, the resultant DF from the "select * from xyz sort by x asc" is sorted, and if I sun, foreachpartition on that DF and implement the insert method, this sorted order would be affected, since the inserting operation would be done in parallel in each partition. in order to handle this, my initial idea was to take rows from the RDD in batches and do the insert operation, and for that I was looking for a method to take n number of rows starting from a given index. is there any better way to handle this, in RDDs? your assistance in this regard is highly appreciated. cheers -- Niranda @n1r44 <https://twitter.com/N1R44> https://pythagoreanscript.wordpress.com/