@Jingyu
Yes, it works without regex and concatenation as the query below:

So, what we can understand from this? Because when i do like that, shuffle read 
sizes are equally distributed between partitions.

val usersInputDF = sqlContext.sql(
s"""
         |  select userid from landing where dt='2015-9' and userid != '' and 
userid is not null and userid is not NULL and pagetype = 'productDetail' group 
by userid

       """.stripMargin)

@java8964

I tried with sql.shuffle.partitions = 10000 but no luck. It’s again one of the 
partitions shuffle size is huge and the others are very small.


——
So how can i balance this shuffle read size between partitions?


> On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:
> 
> Is you sql works if do not runs a regex on strings and concatenates them, I 
> mean just Select the stuff without String operations?
> 
> On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> Try to increase partitions count, that will make each partition has less data.
> 
> Yong
> 
> Subject: Re: Java Heap Space Error
> From: yu...@useinsider.com <mailto:yu...@useinsider.com>
> Date: Thu, 24 Sep 2015 00:32:47 +0300
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> To: java8...@hotmail.com <mailto:java8...@hotmail.com>
> 
> 
> Yes, it’s possible. I use S3 as data source. My external tables has 
> partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 
> 200 in 2.stage because of sql.shuffle.partitions. 
> 
> How can i avoid this situation, this is my query:
> 
> select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not 
> NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
>  '))))) inputlist from landing where dt='2015-9' and userid != '' and userid 
> is not null and userid is not NULL and pagetype = 'productDetail' group by 
> userid
> 
> On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> 
> Based on your description, you job shouldn't have any shuffle then, as you 
> just apply regex and concatenation on the column, but there is one partition 
> having 4.3M records to be read, vs less than 1M records for other partitions.
> 
> Is that possible? It depends on what is the source of your data.
> 
> If there is shuffle in your query (More than 2 stages generated by your 
> query, and this is my guess of what happening), then it simple means that one 
> partition having way more data than the rest of partitions.
> 
> Yong
> 
> From: yu...@useinsider.com <mailto:yu...@useinsider.com>
> Subject: Java Heap Space Error
> Date: Wed, 23 Sep 2015 23:07:17 +0300
> To: user@spark.apache.org <mailto:user@spark.apache.org>
> 
> What can cause this issue in the attached picture? I’m running and sql query 
> which runs a regex on strings and concatenates them. Because of this task, my 
> job gives java heap space error.
> 
> <Screen Shot 2015-09-23 at 23.03.18.png>
> 
> 
> 
> This message and its attachments may contain legally privileged or 
> confidential information. It is intended solely for the named addressee. If 
> you are not the addressee indicated in this message or responsible for 
> delivery of the message to the addressee, you may not copy or deliver this 
> message or its attachments to anyone. Rather, you should permanently delete 
> this message and its attachments and kindly notify the sender by reply 
> e-mail. Any content of this message and its attachments which does not relate 
> to the official business of the sending company must be taken not to have 
> been sent or endorsed by that company or any of its related entities. No 
> warranty is made that the e-mail or attachments are free from computer virus 
> or other defect.

Reply via email to