Why not do that with spark sql to utilise the executors properly, rather than a 
sequential filter on the driver.

Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k

If you were sorting just so you could iterate in order, this might save you a 
couple of sorts too.

https://richardstartin.com

> On 5 Jan 2017, at 10:40, Rohit Verma <rohit.ve...@rokittech.com> wrote:
> 
> Hi all,
> 
> I am aware that collect will return a list aggregated on driver, this will 
> return OOM when we have a too big list.
> Is toLocalIterator safe to use with very big list, i want to access all 
> values one by one.
> 
> Basically the goal is to compare two sorted rdds (A and B) to find top k 
> entries missed in B but there in A 
> 
> Rohit
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to