This is interesting.
So you mean that query as 
"select userid from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid"
works in your cluster?
In this case, do you also see this one task with way more data than the rest, 
as it happened when you use regex and concatenation?
It is hard to believe that just add "regex" and "concatenation" will make the 
distribution more equally across partitions. In your query, the distribution in 
the partitions simply depends on the Hash partitioner of "userid".
Can you show us the query after you add "regex" and "concatenation"?
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 15:34:48 +0300
CC: user@spark.apache.org
To: jingyu.zh...@news.com.au; java8...@hotmail.com

@JingyuYes, it works without regex and concatenation as the query below:
So, what we can understand from this? Because when i do like that, shuffle read 
sizes are equally distributed between partitions.
val usersInputDF = sqlContext.sql(s"""         |  select userid from landing 
where dt='2015-9' and userid != '' and userid is not null and userid is not 
NULL and pagetype = 'productDetail' group by userid
       """.stripMargin)
@java8964
I tried with sql.shuffle.partitions = 10000 but no luck. It’s again one of the 
partitions shuffle size is huge and the others are very small.

——So how can i balance this shuffle read size between partitions?

On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:Is you 
sql works if do not runs a regex on strings and concatenates them, I mean just 
Select the stuff without String operations?

On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com> wrote:



Try to increase partitions count, that will make each partition has less data.
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 00:32:47 +0300
CC: user@spark.apache.org
To: java8...@hotmail.com

Yes, it’s possible. I use S3 as data source. My external tables has 
partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 
in 2.stage because of sql.shuffle.partitions. 
How can i avoid this situation, this is my query:
select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not 
NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
 '))))) inputlist from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid

On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com> wrote:
Based on your description, you job shouldn't have any shuffle then, as you just 
apply regex and concatenation on the column, but there is one partition having 
4.3M records to be read, vs less than 1M records for other partitions.
Is that possible? It depends on what is the source of your data.
If there is shuffle in your query (More than 2 stages generated by your query, 
and this is my guess of what happening), then it simple means that one 
partition having way more data than the rest of partitions.
Yong
From: yu...@useinsider.com
Subject: Java Heap Space Error
Date: Wed, 23 Sep 2015 23:07:17 +0300
To: user@spark.apache.org

What can cause this issue in the attached picture? I’m running and sql query 
which runs a regex on strings and concatenates them. Because of this task, my 
job gives java heap space error.
<Screen Shot 2015-09-23 at 23.03.18.png>
                                          





This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been sent or endorsed by that 
company or any of its related entities. No warranty is made that the e-mail or 
attachments are free from computer virus or other defect.
                                          

Reply via email to