Re: iterating over DataFrame Partitions sequentially

Jakob Odersky Fri, 09 Sep 2016 12:25:10 -0700

> Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact 
> content on each partition i want to batch load some other data from DB, i 
> cannot operate in parallel due to resource contraints i have,  hence want to 
> sequential iterate over each partition and perform operations.

Ah I see. I think in that case your best option is to run several
jobs, selecting different subsets of your dataframe for each job and
running them one after the other. One way to do that would be to get
the underlying rdd, mapping with the partition's index and then
filtering and itering over every element. Eg.:

val withPartitionIndex = df.rdd.mapPartitionWithIndex((idx, it) =>
it.map(elem => (idx, elem))

for (i <- 0 until n) {
  withPartitionIndex.filter{case (idx, _) => idx == i}.foreach{ case
(idx, elem) =>
    //do something with elem
  }
}

it's not the best use-case of Spark though and will probably be a
performance bottleneck.

On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky <ja...@odersky.com> wrote:
> Hi Sujeet,
>
> going sequentially over all parallel, distributed data seems like a
> counter-productive thing to do. What are you trying to accomplish?
>
> regards,
> --Jakob
>
> On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog <sujeet....@gmail.com> wrote:
>> Hi,
>> Is there a way to iterate over a DataFrame with n partitions sequentially,
>>
>>
>> Thanks,
>> Sujeet
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: iterating over DataFrame Partitions sequentially

Reply via email to