Try to increase partitions count, that will make each partition has less data. Yong
Subject: Re: Java Heap Space Error From: yu...@useinsider.com Date: Thu, 24 Sep 2015 00:32:47 +0300 CC: user@spark.apache.org To: java8...@hotmail.com Yes, it’s possible. I use S3 as data source. My external tables has partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 in 2.stage because of sql.shuffle.partitions. How can i avoid this situation, this is my query: select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",' '))))) inputlist from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com> wrote:Based on your description, you job shouldn't have any shuffle then, as you just apply regex and concatenation on the column, but there is one partition having 4.3M records to be read, vs less than 1M records for other partitions. Is that possible? It depends on what is the source of your data. If there is shuffle in your query (More than 2 stages generated by your query, and this is my guess of what happening), then it simple means that one partition having way more data than the rest of partitions. Yong From: yu...@useinsider.com Subject: Java Heap Space Error Date: Wed, 23 Sep 2015 23:07:17 +0300 To: user@spark.apache.org What can cause this issue in the attached picture? I’m running and sql query which runs a regex on strings and concatenates them. Because of this task, my job gives java heap space error. <Screen Shot 2015-09-23 at 23.03.18.png>