Re: DataFrame versus Dataset creation and usage

Xinh Huynh Fri, 24 Jun 2016 13:57:24 -0700

Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know
ahead of time the row type T in a Dataset[T]?


One option is to start with DataFrames in the beginning of your data
pipeline, figure out the field types, and then switch completely over to
RDDs or Dataset in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use
them as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano <mar...@attivio.com> wrote:

> Indeed.  But I'm dealing with 1.6 for now unfortunately.
>
>
> On 06/24/2016 02:30 PM, Ted Yu wrote:
>
> In Spark 2.0, Dataset and DataFrame are unified.
>
> Would this simplify your use case ?
>
> On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano <mar...@attivio.com>
> wrote:
>
>> Hi,
>>
>> I'm exposing a custom source to the Spark environment.  I have a question
>> about the best way to approach this problem.
>>
>> I created a custom relation for my source and it creates a
>> DataFrame<Row>.  My custom source knows the data types which are
>> *dynamic* so this seemed to be the appropriate return type.  This works
>> fine.
>>
>> The next step I want to take is to expose some custom mapping functions
>> (written in Java).  But when I look at the APIs, the map method for
>> DataFrame returns an RDD (not a DataFrame).  (Should I use
>> SqlContext.createDataFrame on the result? -- does this result in additional
>> processing overhead?)  The Dataset type seems to be more of what I'd be
>> looking for, it's map method returns the Dataset type.  So chaining them
>> together is a natural exercise.
>>
>> But to create the Dataset from a DataFrame, it appears that I have to
>> provide the types of each field in the Row in the DataFrame.as[...]
>> method.  I would think that the DataFrame would be able to do this
>> automatically since it has all the types already.
>>
>> This leads me to wonder how I should be approaching this effort.  As all
>> the fields and types are dynamic, I cannot use beans as my type when
>> passing data around.  Any advice would be appreciated.
>>
>> Thanks,
>> Martin
>>
>>
>>
>>
>
>

Re: DataFrame versus Dataset creation and usage

Reply via email to