Re: Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

Koert Kuipers Mon, 18 Jun 2018 14:08:00 -0700

we use DataFrame and RDD. Dataset not only has issues with predicate
pushdown, it also adds shufffles at times where it shouldn't. and there is
some overhead from the encoders themselves, because under the hood it is
still just Row objects.



On Mon, Jun 18, 2018 at 5:00 PM, Valery Khamenya <khame...@gmail.com> wrote:

> Hi Spark gurus,
>
> I was surprised to read here:
> https://stackoverflow.com/questions/50129411/why-is-
> predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap
>
> that filters are not pushed down in typed Datasets and one should rather
> stick to Dataframes.
>
> But writing code for groupByKey + mapGroups is a headache with Dataframes
> if compared to typed Dataset. The former mostly doesn't force you to write
> any Encoders (unless you try to write generic transformations on
> parametrized Dataset[T]) . Neither typed Dataset forces you to do an ugly
> Row parsing with getInteger, getString, etc -- like it is needed with
> Dataframes.
>
> So, what should the poor Spark user rely on by default, if the goal is to
> deliver a library of  data transformations -- Dataset or Dataframe?
>
> best regards
> --
> Valery
>

Re: Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

Reply via email to