Re: Re: Re: spark sql data skew

2018-07-23 Thread Gourav Sengupta
https://docs.databricks.com/spark/latest/spark-sql/skew-join.html The above might help, in case you are using a join. On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 wrote: > but how to get count(distinct userId) group by company from count(distinct > userId) group by company+x? > count(userId) is

Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character. On 17 July 2018 at 02:25, 崔苗 wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > >