https://docs.databricks.com/spark/latest/spark-sql/skew-join.html
The above might help, in case you are using a join.
On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 wrote:
> but how to get count(distinct userId) group by company from count(distinct
> userId) group by company+x?
> count(userId) is differen
but how to get count(distinct userId) group by company from count(distinct
userId) group by company+x?
count(userId) is different from count(distinct userId)
在 2018-07-21 00:49:58,Xiaomeng Wan 写道:
try divide and conquer, create a column x for the fist character of userid, and
group by company+x
try divide and conquer, create a column x for the fist character of userid,
and group by company+x. if still too large, try first two character.
On 17 July 2018 at 02:25, 崔苗 wrote:
> 30G user data, how to get distinct users count after creating a composite
> key based on company and userid?
>
>
Just thinking out loud… repartition by key? create a composite key based on
company and userid?
How big is your dataset?
> On Jul 13, 2018, at 06:20, 崔苗 wrote:
>
> Hi,
> when I want to count(distinct userId) by company,I met the data skew and the
> task takes too long time,how to count disti