It works on a smaller dataset of 100 rows. Probably I could find the size when 
it fails using binary search. However, it would not help me because I need to 
work with 2B rows.

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, April 27, 2015 6:58 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Scalability of group by


Hi

Can you test on a smaller dataset to identify if it is cluster issue or scaling 
issue in spark
On 28 Apr 2015 11:30, "Ulanov, Alexander" 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi,

I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in 
Spark 1.3 as follows:
“select id, time, first(value) from data group by id, time”

My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is 
allocated with 5GB of memory. However, all executors are being lost during the 
query execution and I get “ExecutorLostFailure”.

Could you suggest what might be the reason for it? Could it be that “group by” 
is implemented as RDD.groupBy so it holds the group by result in memory? What 
is the workaround?

Best regards, Alexander

Reply via email to