Re: SparkR: split, apply, combine strategy for dataframes?
Thanks for your reply. I think that the problem was that SparkR tried to serialize the whole environment. Mind that the large dataframe was part of it. So every worker received their slice / partition (which is very small) plus the whole thing! So I deleted the large dataframe and list before parallelizing and the cluster ran without memory issues. Best, Carlos J. Gil Bellosta http://www.datanalytics.com 2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman : > Could you try increasing the number of slices with the large data set ? > SparkR assumes that each slice (or partition in Spark terminology) can fit > in memory of a single machine. Also is the error happening when you do the > map function or does it happen when you combine the results ? > > Thanks > Shivaram > > > On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta > wrote: >> >> Hello, >> >> I am having problems trying to apply the split-apply-combine strategy >> for dataframes using SparkR. >> >> I have a largish dataframe and I would like to achieve something similar >> to what >> >> ddply(df, .(id), foo) >> >> would do, only that using SparkR as computing engine. My df has a few >> million records and I would like to split it by "id" and operate on >> the pieces. These pieces are quite small in size: just a few hundred >> records. >> >> I do something along the following lines: >> >> 1) Use split to transform df into a list of dfs. >> 2) parallelize the resulting list as a RDD (using a few thousand slices) >> 3) map my function on the pieces using Spark. >> 4) recombine the results (do.call, rbind, etc.) >> >> My cluster works and I can perform medium sized batch jobs. >> >> However, it fails with my full df: I get a heap space out of memory >> error. It is funny as the slices are very small in size. >> >> Should I send smaller batches to my cluster? Is there any recommended >> general approach to these kind of split-apply-combine problems? >> >> Best, >> >> Carlos J. Gil Bellosta >> http://www.datanalytics.com >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR: split, apply, combine strategy for dataframes?
Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks Shivaram On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta < gilbello...@gmail.com> wrote: > Hello, > > I am having problems trying to apply the split-apply-combine strategy > for dataframes using SparkR. > > I have a largish dataframe and I would like to achieve something similar > to what > > ddply(df, .(id), foo) > > would do, only that using SparkR as computing engine. My df has a few > million records and I would like to split it by "id" and operate on > the pieces. These pieces are quite small in size: just a few hundred > records. > > I do something along the following lines: > > 1) Use split to transform df into a list of dfs. > 2) parallelize the resulting list as a RDD (using a few thousand slices) > 3) map my function on the pieces using Spark. > 4) recombine the results (do.call, rbind, etc.) > > My cluster works and I can perform medium sized batch jobs. > > However, it fails with my full df: I get a heap space out of memory > error. It is funny as the slices are very small in size. > > Should I send smaller batches to my cluster? Is there any recommended > general approach to these kind of split-apply-combine problems? > > Best, > > Carlos J. Gil Bellosta > http://www.datanalytics.com > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
SparkR: split, apply, combine strategy for dataframes?
Hello, I am having problems trying to apply the split-apply-combine strategy for dataframes using SparkR. I have a largish dataframe and I would like to achieve something similar to what ddply(df, .(id), foo) would do, only that using SparkR as computing engine. My df has a few million records and I would like to split it by "id" and operate on the pieces. These pieces are quite small in size: just a few hundred records. I do something along the following lines: 1) Use split to transform df into a list of dfs. 2) parallelize the resulting list as a RDD (using a few thousand slices) 3) map my function on the pieces using Spark. 4) recombine the results (do.call, rbind, etc.) My cluster works and I can perform medium sized batch jobs. However, it fails with my full df: I get a heap space out of memory error. It is funny as the slices are very small in size. Should I send smaller batches to my cluster? Is there any recommended general approach to these kind of split-apply-combine problems? Best, Carlos J. Gil Bellosta http://www.datanalytics.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org