I tried a new way by using JOIN select user_id from data a left join (select user_id from data where url like ‘%sell%') b on a.user_id = b.user_id where b.user_id is NULL
It’s faster and seem that Spark rather optimize for JOIN than sub query. Regards, Chanh > On Feb 21, 2017, at 4:56 PM, Chanh Le <giaosu...@gmail.com> wrote: > > Hi everyone, > > I am working on a dataset like this > user_id url > 1 lao.com/buy <http://lao.com/buy> > 2 bao.com/sell <http://bao.com/sell> > 2 cao.com/market <http://cao.com/market> > 1 lao.com/sell <http://lao.com/sell> > 3 vui.com/sell <http://vui.com/sell> > > I have to find all user_id with url not contain sell. Which means I need to > query all user_id contains sell and put it into a set then do another query > to find all user_id not in that set. > SELECT user_id > FROM data > WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’; > > My data is about 20 million records and it’s growing. When I tried in > zeppelin I need to set spark.sql.crossJoin.enabled = true > Then I ran the query and the driver got extremely high CPU percentage and the > process get stuck and I need to kill it. > I am running at client mode that submit to a Mesos cluster. > > I am using Spark 2.0.2 and my data store in HDFS with parquet format. > > Any advices for me in this situation? > > Thank you in advance!. > > Regards, > Chanh