Re: Will .count() always trigger an evaluation of each row?

Sean Owen Sat, 18 Feb 2017 02:18:07 -0800

I think the right answer is "don't do that" but if you really had to you
could trigger a Dataset operation that does nothing per partition. I
presume that would be more reliable because the whole partition has to be
computed to make it available in practice. Or, go so far as to loop over
every element.


On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Especially during development, people often use .count() or
> .persist().count() to force evaluation of all rows — exposing any
> problems, e.g. due to bad data — and to load data into cache to speed up
> subsequent operations.
>
> But as the optimizer gets smarter, I’m guessing it will eventually learn
> that it doesn’t have to do all that work to give the correct count. (This
> blog post
> <https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
> suggests that something like this is already happening.) This will change
> Spark’s practical behavior while technically preserving semantics.
>
> What will people need to do then to force evaluation or caching?
>
> Nick
> 
>

Re: Will .count() always trigger an evaluation of each row?

Reply via email to