Re: DataFrame versus Dataset creation and usage

2016-06-28 Thread Martin Serrano
Xinh,

Thanks for the clarification.  I'm new to Spark and trying to navigate the 
different APIs.  I was just following some examples and retrofitting them, but 
I see now I should stick with plain RDDs until my schema is known (at the end 
of the data pipeline).

Thanks again!

On 06/24/2016 04:57 PM, Xinh Huynh wrote:
Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know ahead 
of time the row type T in a Dataset[T]?

One option is to start with DataFrames in the beginning of your data pipeline, 
figure out the field types, and then switch completely over to RDDs or Dataset 
in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use them 
as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano 
> wrote:
Indeed.  But I'm dealing with 1.6 for now unfortunately.


On 06/24/2016 02:30 PM, Ted Yu wrote:
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano 
> wrote:
Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin









Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Xinh Huynh
Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know
ahead of time the row type T in a Dataset[T]?

One option is to start with DataFrames in the beginning of your data
pipeline, figure out the field types, and then switch completely over to
RDDs or Dataset in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use
them as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano  wrote:

> Indeed.  But I'm dealing with 1.6 for now unfortunately.
>
>
> On 06/24/2016 02:30 PM, Ted Yu wrote:
>
> In Spark 2.0, Dataset and DataFrame are unified.
>
> Would this simplify your use case ?
>
> On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano 
> wrote:
>
>> Hi,
>>
>> I'm exposing a custom source to the Spark environment.  I have a question
>> about the best way to approach this problem.
>>
>> I created a custom relation for my source and it creates a
>> DataFrame.  My custom source knows the data types which are
>> *dynamic* so this seemed to be the appropriate return type.  This works
>> fine.
>>
>> The next step I want to take is to expose some custom mapping functions
>> (written in Java).  But when I look at the APIs, the map method for
>> DataFrame returns an RDD (not a DataFrame).  (Should I use
>> SqlContext.createDataFrame on the result? -- does this result in additional
>> processing overhead?)  The Dataset type seems to be more of what I'd be
>> looking for, it's map method returns the Dataset type.  So chaining them
>> together is a natural exercise.
>>
>> But to create the Dataset from a DataFrame, it appears that I have to
>> provide the types of each field in the Row in the DataFrame.as[...]
>> method.  I would think that the DataFrame would be able to do this
>> automatically since it has all the types already.
>>
>> This leads me to wonder how I should be approaching this effort.  As all
>> the fields and types are dynamic, I cannot use beans as my type when
>> passing data around.  Any advice would be appreciated.
>>
>> Thanks,
>> Martin
>>
>>
>>
>>
>
>


Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
Indeed.  But I'm dealing with 1.6 for now unfortunately.

On 06/24/2016 02:30 PM, Ted Yu wrote:
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano 
> wrote:
Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin







Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Ted Yu
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano  wrote:

> Hi,
>
> I'm exposing a custom source to the Spark environment.  I have a question
> about the best way to approach this problem.
>
> I created a custom relation for my source and it creates a
> DataFrame.  My custom source knows the data types which are *dynamic*
> so this seemed to be the appropriate return type.  This works fine.
>
> The next step I want to take is to expose some custom mapping functions
> (written in Java).  But when I look at the APIs, the map method for
> DataFrame returns an RDD (not a DataFrame).  (Should I use
> SqlContext.createDataFrame on the result? -- does this result in additional
> processing overhead?)  The Dataset type seems to be more of what I'd be
> looking for, it's map method returns the Dataset type.  So chaining them
> together is a natural exercise.
>
> But to create the Dataset from a DataFrame, it appears that I have to
> provide the types of each field in the Row in the DataFrame.as[...]
> method.  I would think that the DataFrame would be able to do this
> automatically since it has all the types already.
>
> This leads me to wonder how I should be approaching this effort.  As all
> the fields and types are dynamic, I cannot use beans as my type when
> passing data around.  Any advice would be appreciated.
>
> Thanks,
> Martin
>
>
>
>


DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano
Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin