Re:Re: Re: spark sql data skew

崔苗 Sun, 22 Jul 2018 20:50:19 -0700

but how to get count(distinct userId) group by company from count(distinct 
userId) group by company+x?
count(userId) is different from count(distinct userId)


在 2018-07-21 00:49:58，Xiaomeng Wan <shawn...@gmail.com> 写道：
try divide and conquer, create a column x for the fist character of userid, and 
group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗 <cuim...@danale.com> wrote:
30Ｇ user data, how to get distinct users count after creating a composite key 
based on company and userid?

在 2018-07-13 18:24:52，Jean Georges Perrin <j...@jgp.net> 写道：
Just thinking out loud… repartition by key? create a composite key based on 
company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <cuim...@danale.com> wrote:

Hi,
when I want to count(distinct userId) by company，I met the data skew and the 
task takes too long time，how to count distinct by keys on skew data in spark 
sql ?


thanks for any reply

Re:Re: Re: spark sql data skew

Reply via email to