First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le <giaosu...@gmail.com> wrote:

> Hi everyone,
>
> I am working on a dataset like this
> *user_id         url *
> 1              lao.com/buy
> 2      bao.com/sell
> 2              cao.com/market
> 1       lao.com/sell
> 3              vui.com/sell
>
> I have to find all *user_id* with *url* not contain *sell*. Which means I
> need to query all *user_id* contains *sell* and put it into a set then do
> another query to find all *user_id* not in that set.
>
>
>
> *SELECT user_id FROM dataWHERE user_id not in ( SELECT user_id FROM data
> WHERE url like ‘%sell%’;*
> My data is about *20 million records and it’s growing*. When I tried in
> zeppelin I need to *set spark.sql.crossJoin.enabled = true*
> Then I ran the query and the driver got extremely high CPU percentage and
> the process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
>
> I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format*
> .
>
> Any advices for me in this situation?
>
> Thank you in advance!.
>
> Regards,
> Chanh
>
-- 
Best Regards,
Ayan Guha

Reply via email to