Instead of .map you can try doing a .mapPartitions and see the performance.
Thanks
Best Regards
On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote:
> For a large dataset, I want to filter out something and then do the
> computing intensive work.
>
> What I am doing now:
>
> Data.filter(somerules)
For a large dataset, I want to filter out something and then do the
computing intensive work.
What I am doing now:
Data.filter(somerules).cache()
Data.count()
Data.map(timeintensivecompute)
But this sometimes takes unusually long time due to cache missing and
recalculation.
So I changed to thi