Why not do that with spark sql to utilise the executors properly, rather than a sequential filter on the driver.
Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k If you were sorting just so you could iterate in order, this might save you a couple of sorts too. https://richardstartin.com > On 5 Jan 2017, at 10:40, Rohit Verma <rohit.ve...@rokittech.com> wrote: > > Hi all, > > I am aware that collect will return a list aggregated on driver, this will > return OOM when we have a too big list. > Is toLocalIterator safe to use with very big list, i want to access all > values one by one. > > Basically the goal is to compare two sorted rdds (A and B) to find top k > entries missed in B but there in A > > Rohit > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org