Hello,

It worked like a charm. Thank you very much.

Some userid’s were null that’s why many records go to userid ’null’. When i put 
a where clause: userid != ‘null’, it solved problem.

> On 24 Sep 2015, at 22:43, java8964 <java8...@hotmail.com> wrote:
> 
> I can understand why your first query will finish without OOM, but the new 
> one will fail with OOM.
> 
> In the new query, you are asking a groupByKey/cogroup operation, which will 
> force all the productName + prodcutionCatagory per user id sent to the same 
> reducer. This could easily below out reducer's memory if you have one user id 
> having lot of productName and productCatagory.
> 
> Keep in mind that Spark on the reducer side still use a Hash to merge all the 
> data from different mappers, so the memory in the reduce side has to be able 
> to merge all the productionName + productCatagory for the most frequently 
> shown up user id (at least), and I don't know why you want all the 
> productName and productCategory per user Id (Maybe a distinct could be 
> enough?).
> 
> Image you have one user id show up 1M time in your dataset, with 0.5M 
> productname as 'A', and 0.5M product name as 'B', and your query will push 1M 
> of 'A' and 'B' into the same reducer, and ask Spark to merge them in the 
> HashMap for you for that user Id. This will cause OOM.
> 
> Above all, you need to find out what is the max count per user id in your 
> data: select max(count(*)) from land where ..... group by userid
> 
> Your memory has to support that amount of productName and productCatagory, 
> and if your partition number is not high enough (even as your unique count of 
> user id), if that is really what you want, to consolidate all the 
> productionName and product catagory together, without even consider removing 
> duplication.
> 
> But both query still should push similar records count per partition, but 
> with much of different volume size of data.
> 
> Yong
> 
> Subject: Re: Java Heap Space Error
> From: yu...@useinsider.com
> Date: Thu, 24 Sep 2015 18:56:51 +0300
> CC: jingyu.zh...@news.com.au; user@spark.apache.org
> To: java8...@hotmail.com
> 
> Yes right, the query you wrote worked in same cluster. In this case, 
> partitions were equally distributed but when i used regex and concetanations 
> it’s not as i said before. Query with concetanation is below:
> 
> val usersInputDF = sqlContext.sql(
>   s"""
>      |  select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname 
> is not 
> NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
>  '))))) inputlist from landing where 
> dt='${dateUtil.getYear}-${dateUtil.getMonth}' and day >= '${day}' and userid 
> != '' and userid is not null and userid is not NULL and pagetype = 
> 'productDetail' group by userid
> 
>    """.stripMargin)
> 
> 
> On 24 Sep 2015, at 16:52, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> 
> This is interesting.
> 
> So you mean that query as 
> 
> "select userid from landing where dt='2015-9' and userid != '' and userid is 
> not null and userid is not NULL and pagetype = 'productDetail' group by 
> userid"
> 
> works in your cluster?
> 
> In this case, do you also see this one task with way more data than the rest, 
> as it happened when you use regex and concatenation?
> 
> It is hard to believe that just add "regex" and "concatenation" will make the 
> distribution more equally across partitions. In your query, the distribution 
> in the partitions simply depends on the Hash partitioner of "userid".
> 
> Can you show us the query after you add "regex" and "concatenation"?
> 
> Yong
> 
> Subject: Re: Java Heap Space Error
> From: yu...@useinsider.com <mailto:yu...@useinsider.com>
> Date: Thu, 24 Sep 2015 15:34:48 +0300
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> To: jingyu.zh...@news.com.au <mailto:jingyu.zh...@news.com.au>; 
> java8...@hotmail.com <mailto:java8...@hotmail.com>
> 
> @Jingyu
> Yes, it works without regex and concatenation as the query below:
> 
> So, what we can understand from this? Because when i do like that, shuffle 
> read sizes are equally distributed between partitions.
> 
> val usersInputDF = sqlContext.sql(
> s"""
>          |  select userid from landing where dt='2015-9' and userid != '' and 
> userid is not null and userid is not NULL and pagetype = 'productDetail' 
> group by userid
> 
>        """.stripMargin)
> 
> @java8964
> 
> I tried with sql.shuffle.partitions = 10000 but no luck. It’s again one of 
> the partitions shuffle size is huge and the others are very small.
> 
> 
> ——
> So how can i balance this shuffle read size between partitions?
> 
> 
> On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au 
> <mailto:jingyu.zh...@news.com.au>> wrote:
> 
> Is you sql works if do not runs a regex on strings and concatenates them, I 
> mean just Select the stuff without String operations?
> 
> On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> Try to increase partitions count, that will make each partition has less data.
> 
> Yong
> 
> Subject: Re: Java Heap Space Error
> From: yu...@useinsider.com <mailto:yu...@useinsider.com>
> Date: Thu, 24 Sep 2015 00:32:47 +0300
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> To: java8...@hotmail.com <mailto:java8...@hotmail.com>
> 
> 
> Yes, it’s possible. I use S3 as data source. My external tables has 
> partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 
> 200 in 2.stage because of sql.shuffle.partitions. 
> 
> How can i avoid this situation, this is my query:
> 
> select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not 
> NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
>  '))))) inputlist from landing where dt='2015-9' and userid != '' and userid 
> is not null and userid is not NULL and pagetype = 'productDetail' group by 
> userid
> 
> On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> 
> Based on your description, you job shouldn't have any shuffle then, as you 
> just apply regex and concatenation on the column, but there is one partition 
> having 4.3M records to be read, vs less than 1M records for other partitions.
> 
> Is that possible? It depends on what is the source of your data.
> 
> If there is shuffle in your query (More than 2 stages generated by your 
> query, and this is my guess of what happening), then it simple means that one 
> partition having way more data than the rest of partitions.
> 
> Yong
> 
> From: yu...@useinsider.com <mailto:yu...@useinsider.com>
> Subject: Java Heap Space Error
> Date: Wed, 23 Sep 2015 23:07:17 +0300
> To: user@spark.apache.org <mailto:user@spark.apache.org>
> 
> What can cause this issue in the attached picture? I’m running and sql query 
> which runs a regex on strings and concatenates them. Because of this task, my 
> job gives java heap space error.
> 
> <Screen Shot 2015-09-23 at 23.03.18.png>
> 
> 
> 
> This message and its attachments may contain legally privileged or 
> confidential information. It is intended solely for the named addressee. If 
> you are not the addressee indicated in this message or responsible for 
> delivery of the message to the addressee, you may not copy or deliver this 
> message or its attachments to anyone. Rather, you should permanently delete 
> this message and its attachments and kindly notify the sender by reply 
> e-mail. Any content of this message and its attachments which does not relate 
> to the official business of the sending company must be taken not to have 
> been sent or endorsed by that company or any of its related entities. No 
> warranty is made that the e-mail or attachments are free from computer virus 
> or other defect.
> 

Reply via email to