I tried a new way by using JOIN

select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL

It’s faster and seem that Spark rather optimize for JOIN than sub query.


Regards,
Chanh


> On Feb 21, 2017, at 4:56 PM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Hi everyone,
> 
> I am working on a dataset like this
> user_id         url 
> 1                  lao.com/buy <http://lao.com/buy>
> 2                  bao.com/sell <http://bao.com/sell>
> 2                  cao.com/market <http://cao.com/market>
> 1                  lao.com/sell <http://lao.com/sell>
> 3                  vui.com/sell <http://vui.com/sell>
> 
> I have to find all user_id with url not contain sell. Which means I need to 
> query all user_id contains sell and put it into a set then do another query 
> to find all user_id not in that set.
> SELECT user_id 
> FROM data
> WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;
> 
> My data is about 20 million records and it’s growing. When I tried in 
> zeppelin I need to set spark.sql.crossJoin.enabled = true
> Then I ran the query and the driver got extremely high CPU percentage and the 
> process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
> 
> I am using Spark 2.0.2 and my data store in HDFS with parquet format.
> 
> Any advices for me in this situation?
> 
> Thank you in advance!.
> 
> Regards,
> Chanh

Reply via email to