First thing i would do is to add distinct, both inner and outer queries On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le <giaosu...@gmail.com> wrote:
> Hi everyone, > > I am working on a dataset like this > *user_id url * > 1 lao.com/buy > 2 bao.com/sell > 2 cao.com/market > 1 lao.com/sell > 3 vui.com/sell > > I have to find all *user_id* with *url* not contain *sell*. Which means I > need to query all *user_id* contains *sell* and put it into a set then do > another query to find all *user_id* not in that set. > > > > *SELECT user_id FROM dataWHERE user_id not in ( SELECT user_id FROM data > WHERE url like ‘%sell%’;* > My data is about *20 million records and it’s growing*. When I tried in > zeppelin I need to *set spark.sql.crossJoin.enabled = true* > Then I ran the query and the driver got extremely high CPU percentage and > the process get stuck and I need to kill it. > I am running at client mode that submit to a Mesos cluster. > > I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format* > . > > Any advices for me in this situation? > > Thank you in advance!. > > Regards, > Chanh > -- Best Regards, Ayan Guha