First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le <> wrote:

> Hi everyone,
> I am working on a dataset like this
> *user_id         url *
> 1    
> 2
> 2    
> 1
> 3    
> I have to find all *user_id* with *url* not contain *sell*. Which means I
> need to query all *user_id* contains *sell* and put it into a set then do
> another query to find all *user_id* not in that set.
> *SELECT user_id FROM dataWHERE user_id not in ( SELECT user_id FROM data
> WHERE url like ‘%sell%’;*
> My data is about *20 million records and it’s growing*. When I tried in
> zeppelin I need to *set spark.sql.crossJoin.enabled = true*
> Then I ran the query and the driver got extremely high CPU percentage and
> the process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
> I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format*
> .
> Any advices for me in this situation?
> Thank you in advance!.
> Regards,
> Chanh
Best Regards,
Ayan Guha

Reply via email to