https://docs.databricks.com/spark/latest/spark-sql/skew-join.html
The above might help, in case you are using a join. On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 <cuim...@danale.com> wrote: > but how to get count(distinct userId) group by company from count(distinct > userId) group by company+x? > count(userId) is different from count(distinct userId) > > > 在 2018-07-21 00:49:58,Xiaomeng Wan <shawn...@gmail.com> 写道: > > try divide and conquer, create a column x for the fist character of > userid, and group by company+x. if still too large, try first two character. > > On 17 July 2018 at 02:25, 崔苗 <cuim...@danale.com> wrote: > >> 30G user data, how to get distinct users count after creating a composite >> key based on company and userid? >> >> >> 在 2018-07-13 18:24:52,Jean Georges Perrin <j...@jgp.net> 写道: >> >> Just thinking out loud… repartition by key? create a composite key based >> on company and userid? >> >> How big is your dataset? >> >> On Jul 13, 2018, at 06:20, 崔苗 <cuim...@danale.com> wrote: >> >> Hi, >> when I want to count(distinct userId) by company,I met the data skew and >> the task takes too long time,how to count distinct by keys on skew data in >> spark sql ? >> >> thanks for any reply >> >> >> >> > >