Re: DataFrame versus Dataset creation and usage

Ted Yu Fri, 24 Jun 2016 11:31:19 -0700

In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?


On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <mar...@attivio.com> wrote:

> Hi,
>
> I'm exposing a custom source to the Spark environment.  I have a question
> about the best way to approach this problem.
>
> I created a custom relation for my source and it creates a
> DataFrame<Row>.  My custom source knows the data types which are *dynamic*
> so this seemed to be the appropriate return type.  This works fine.
>
> The next step I want to take is to expose some custom mapping functions
> (written in Java).  But when I look at the APIs, the map method for
> DataFrame returns an RDD (not a DataFrame).  (Should I use
> SqlContext.createDataFrame on the result? -- does this result in additional
> processing overhead?)  The Dataset type seems to be more of what I'd be
> looking for, it's map method returns the Dataset type.  So chaining them
> together is a natural exercise.
>
> But to create the Dataset from a DataFrame, it appears that I have to
> provide the types of each field in the Row in the DataFrame.as[...]
> method.  I would think that the DataFrame would be able to do this
> automatically since it has all the types already.
>
> This leads me to wonder how I should be approaching this effort.  As all
> the fields and types are dynamic, I cannot use beans as my type when
> passing data around.  Any advice would be appreciated.
>
> Thanks,
> Martin
>
>
>
>

Re: DataFrame versus Dataset creation and usage

Reply via email to