This is interesting. So you mean that query as "select userid from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid" works in your cluster? In this case, do you also see this one task with way more data than the rest, as it happened when you use regex and concatenation? It is hard to believe that just add "regex" and "concatenation" will make the distribution more equally across partitions. In your query, the distribution in the partitions simply depends on the Hash partitioner of "userid". Can you show us the query after you add "regex" and "concatenation"? Yong
Subject: Re: Java Heap Space Error From: yu...@useinsider.com Date: Thu, 24 Sep 2015 15:34:48 +0300 CC: user@spark.apache.org To: jingyu.zh...@news.com.au; java8...@hotmail.com @JingyuYes, it works without regex and concatenation as the query below: So, what we can understand from this? Because when i do like that, shuffle read sizes are equally distributed between partitions. val usersInputDF = sqlContext.sql(s""" | select userid from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid """.stripMargin) @java8964 I tried with sql.shuffle.partitions = 10000 but no luck. It’s again one of the partitions shuffle size is huge and the others are very small. ——So how can i balance this shuffle read size between partitions? On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:Is you sql works if do not runs a regex on strings and concatenates them, I mean just Select the stuff without String operations? On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com> wrote: Try to increase partitions count, that will make each partition has less data. Yong Subject: Re: Java Heap Space Error From: yu...@useinsider.com Date: Thu, 24 Sep 2015 00:32:47 +0300 CC: user@spark.apache.org To: java8...@hotmail.com Yes, it’s possible. I use S3 as data source. My external tables has partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 in 2.stage because of sql.shuffle.partitions. How can i avoid this situation, this is my query: select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",' '))))) inputlist from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com> wrote: Based on your description, you job shouldn't have any shuffle then, as you just apply regex and concatenation on the column, but there is one partition having 4.3M records to be read, vs less than 1M records for other partitions. Is that possible? It depends on what is the source of your data. If there is shuffle in your query (More than 2 stages generated by your query, and this is my guess of what happening), then it simple means that one partition having way more data than the rest of partitions. Yong From: yu...@useinsider.com Subject: Java Heap Space Error Date: Wed, 23 Sep 2015 23:07:17 +0300 To: user@spark.apache.org What can cause this issue in the attached picture? I’m running and sql query which runs a regex on strings and concatenates them. Because of this task, my job gives java heap space error. <Screen Shot 2015-09-23 at 23.03.18.png> This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the addressee indicated in this message or responsible for delivery of the message to the addressee, you may not copy or deliver this message or its attachments to anyone. Rather, you should permanently delete this message and its attachments and kindly notify the sender by reply e-mail. Any content of this message and its attachments which does not relate to the official business of the sending company must be taken not to have been sent or endorsed by that company or any of its related entities. No warranty is made that the e-mail or attachments are free from computer virus or other defect.