Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski
Hi Bryan,

What about groupBy [1] and agg [2]? What about UserDefinedAggregateFunction [3]?

[1] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset
[2] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset
[3] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 8:32 PM, Bryan Jeffrey  wrote:
> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
>
> Regards,
>
> Bryan Jeffrey
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
All,

Thank you for the replies.  It seems as though the Dataset API is still far
behind the RDD API.  This is unfortunate as the Dataset API potentially
provides a number of performance benefits.  I will move to using it in a
more limited set of cases for the moment.

Thank you!

Bryan Jeffrey

On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher 
wrote:

> There certainly are some gaps between the richness of the RDD API and the
> Dataset API. I'm also migrating from RDD to Dataset and ran into
> reduceByKey and join scenarios.
>
> In the spark-dev list, one person was discussing reduceByKey being
> sub-optimal at the moment and it spawned this JIRA
> https://issues.apache.org/jira/browse/SPARK-15598. But you might be able
> to get by with groupBy().reduce() for now, check performance though.
>
> As for join, the approach would be using the joinWith function on Dataset.
> Although the API isn't as sugary as it was for RDD IMO, something which
> I've been discussing in a separate thread as well. I can't find a weblink
> for it but the thread subject is "Dataset Outer Join vs RDD Outer Join".
>
> On Tue, Jun 7, 2016 at 2:40 PM, Bryan Jeffrey 
> wrote:
>
>> It would also be nice if there was a better example of joining two
>> Datasets. I am looking at the documentation here:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
>> a little bit sparse - is there a better documentation source?
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
>> wrote:
>>
>>> Hello.
>>>
>>> I am looking at the option of moving RDD based operations to Dataset
>>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>>> What would the equivalent be in the Dataset interface - I do not see a
>>> simple reduceByKey replacement.
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>


Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
There certainly are some gaps between the richness of the RDD API and the
Dataset API. I'm also migrating from RDD to Dataset and ran into
reduceByKey and join scenarios.

In the spark-dev list, one person was discussing reduceByKey being
sub-optimal at the moment and it spawned this JIRA
https://issues.apache.org/jira/browse/SPARK-15598. But you might be able to
get by with groupBy().reduce() for now, check performance though.

As for join, the approach would be using the joinWith function on Dataset.
Although the API isn't as sugary as it was for RDD IMO, something which
I've been discussing in a separate thread as well. I can't find a weblink
for it but the thread subject is "Dataset Outer Join vs RDD Outer Join".

On Tue, Jun 7, 2016 at 2:40 PM, Bryan Jeffrey 
wrote:

> It would also be nice if there was a better example of joining two
> Datasets. I am looking at the documentation here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
> a little bit sparse - is there a better documentation source?
>
> Regards,
>
> Bryan Jeffrey
>
> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro
Seems you can see docs for 2.0 for now;
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.0-SNAPSHOT-2016_06_07_07_01-1e2c931-docs/

// maropu

On Tue, Jun 7, 2016 at 11:40 AM, Bryan Jeffrey 
wrote:

> It would also be nice if there was a better example of joining two
> Datasets. I am looking at the documentation here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
> a little bit sparse - is there a better documentation source?
>
> Regards,
>
> Bryan Jeffrey
>
> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
It would also be nice if there was a better example of joining two
Datasets. I am looking at the documentation here:
http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a
little bit sparse - is there a better documentation source?

Regards,

Bryan Jeffrey

On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
wrote:

> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
>
> Regards,
>
> Bryan Jeffrey
>
>


Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello.

I am looking at the option of moving RDD based operations to Dataset based
operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
would the equivalent be in the Dataset interface - I do not see a simple
reduceByKey replacement.

Regards,

Bryan Jeffrey


Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane-

http://spark.apache.org/docs/latest/tuning.html has excellent information that 
may be helpful.  In particular increasing the number of tasks may help, as well 
as confirming that you don’t have more data than you're expecting landing on a 
key.

Also, if you are using spark  1.2.0,  setting spark.shuffle.manager=sort was a 
huge help for many of our shuffle heavy workloads (this is the default in 1.2.0 
now)

Cheers,

Sean


On Jan 22, 2015, at 3:15 PM, Kane Kim 
kane.ist...@gmail.commailto:kane.ist...@gmail.com wrote:

I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org