Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-15 Thread Carlos J. Gil Bellosta
Thanks for your reply.

I think that the problem was that SparkR tried to serialize the whole
environment. Mind that the large dataframe was part of it. So every
worker received their slice / partition (which is very small) plus the
whole thing!

So I deleted the large dataframe and list before parallelizing and the
cluster ran without memory issues.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman :
> Could you try increasing the number of slices with the large data set ?
> SparkR assumes that each slice (or partition in Spark terminology) can fit
> in memory of a single machine.  Also is the error happening when you do the
> map function or does it happen when you combine the results ?
>
> Thanks
> Shivaram
>
>
> On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta
>  wrote:
>>
>> Hello,
>>
>> I am having problems trying to apply the split-apply-combine strategy
>> for dataframes using SparkR.
>>
>> I have a largish dataframe and I would like to achieve something similar
>> to what
>>
>> ddply(df, .(id), foo)
>>
>> would do, only that using SparkR as computing engine. My df has a few
>> million records and I would like to split it by "id" and operate on
>> the pieces. These pieces are quite small in size: just a few hundred
>> records.
>>
>> I do something along the following lines:
>>
>> 1) Use split to transform df into a list of dfs.
>> 2) parallelize the resulting list as a RDD (using a few thousand slices)
>> 3) map my function on the pieces using Spark.
>> 4) recombine the results (do.call, rbind, etc.)
>>
>> My cluster works and I can perform medium sized batch jobs.
>>
>> However, it fails with my full df: I get a heap space out of memory
>> error. It is funny as the slices are very small in size.
>>
>> Should I send smaller batches to my cluster? Is there any recommended
>> general approach to these kind of split-apply-combine problems?
>>
>> Best,
>>
>> Carlos J. Gil Bellosta
>> http://www.datanalytics.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine.  Also is the error happening when you do the
map function or does it happen when you combine the results ?

Thanks
Shivaram


On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta <
gilbello...@gmail.com> wrote:

> Hello,
>
> I am having problems trying to apply the split-apply-combine strategy
> for dataframes using SparkR.
>
> I have a largish dataframe and I would like to achieve something similar
> to what
>
> ddply(df, .(id), foo)
>
> would do, only that using SparkR as computing engine. My df has a few
> million records and I would like to split it by "id" and operate on
> the pieces. These pieces are quite small in size: just a few hundred
> records.
>
> I do something along the following lines:
>
> 1) Use split to transform df into a list of dfs.
> 2) parallelize the resulting list as a RDD (using a few thousand slices)
> 3) map my function on the pieces using Spark.
> 4) recombine the results (do.call, rbind, etc.)
>
> My cluster works and I can perform medium sized batch jobs.
>
> However, it fails with my full df: I get a heap space out of memory
> error. It is funny as the slices are very small in size.
>
> Should I send smaller batches to my cluster? Is there any recommended
> general approach to these kind of split-apply-combine problems?
>
> Best,
>
> Carlos J. Gil Bellosta
> http://www.datanalytics.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Carlos J. Gil Bellosta
Hello,

I am having problems trying to apply the split-apply-combine strategy
for dataframes using SparkR.

I have a largish dataframe and I would like to achieve something similar to what

ddply(df, .(id), foo)

would do, only that using SparkR as computing engine. My df has a few
million records and I would like to split it by "id" and operate on
the pieces. These pieces are quite small in size: just a few hundred
records.

I do something along the following lines:

1) Use split to transform df into a list of dfs.
2) parallelize the resulting list as a RDD (using a few thousand slices)
3) map my function on the pieces using Spark.
4) recombine the results (do.call, rbind, etc.)

My cluster works and I can perform medium sized batch jobs.

However, it fails with my full df: I get a heap space out of memory
error. It is funny as the slices are very small in size.

Should I send smaller batches to my cluster? Is there any recommended
general approach to these kind of split-apply-combine problems?

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org