Thank you YZ,
Now I understand why it causes high CPU usage on driver side.

Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries

I believe that would reduce number of record to join.

Regards,
Chanh

Hi everyone,

I am working on a dataset like this
user_id         url 
1                    lao.com/buy <http://lao.com/buy>
2                    bao.com/sell <http://bao.com/sell>
2                    cao.com/market <http://cao.com/market>
1                    lao.com/sell <http://lao.com/sell>
3                    vui.com/sell <http://vui.com/sell>

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh





> On Feb 22, 2017, at 8:52 AM, Yong Zhang <java8...@hotmail.com> wrote:
> 
> If you read the source code of SparkStrategies
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
>  
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106>
> 
> If there is no joining keys, Join implementations are chosen with the 
> following precedence:
> BroadcastNestedLoopJoin: if one side of the join could be broadcasted
> CartesianProduct: for Inner join
> BroadcastNestedLoopJoin
> 
> So your case will use BroadcastNestedLoopJoin, as there is no joining keys.
> 
> In this case, if there are lots of userId where url not like '%sell%', then 
> Spark has to retrieve them back to Driver (to be broadcast), that explains 
> why the high CPU usage on the driver side. 
> 
> So if there are lots of userId where url not like '%sell%', then you can just 
> try left semi join, which Spark will use SortMerge join in this case, I guess.
> 
> Yong
> 
> From: Yong Zhang <java8...@hotmail.com <mailto:java8...@hotmail.com>>
> Sent: Tuesday, February 21, 2017 1:17 PM
> To: Sidney Feiner; Chanh Le; user @spark
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Sorry, didn't pay attention to the originally requirement.
> 
> Did you try the left outer join, or left semi join?
> 
> What is the explain plan when you use "not in"? Is it leading to a 
> broadcastNestedLoopJoin?
> 
> spark.sql("select user_id from data where user_id not in (select user_id from 
> data where url like '%sell%')").explain(true)
> 
> Yong
> 
> 
> From: Sidney Feiner <sidney.fei...@startapp.com 
> <mailto:sidney.fei...@startapp.com>>
> Sent: Tuesday, February 21, 2017 10:46 AM
> To: Yong Zhang; Chanh Le; user @spark
> Subject: RE: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Chanh wants to return user_id's that don't have any record with a url 
> containing "sell". Without a subquery/join, it can only filter per record 
> without knowing about the rest of the user_id's record
>  
> Sidney Feiner   /  SW Developer
> M: +972.528197720  /  Skype: sidney.feiner.startapp
>  
> <image001.png> <http://www.startapp.com/>
>  
> From: Yong Zhang [mailto:java8...@hotmail.com <mailto:java8...@hotmail.com>] 
> Sent: Tuesday, February 21, 2017 4:10 PM
> To: Chanh Le <giaosu...@gmail.com <mailto:giaosu...@gmail.com>>; user @spark 
> <user@spark.apache.org <mailto:user@spark.apache.org>>
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Not sure if I misunderstand your question, but what's wrong doing it this way?
>  
> scala> spark.version
> res6: String = 2.0.2
> scala> val df = Seq((1,"lao.com/sell <http://lao.com/sell>"), (2, 
> "lao.com/buy <http://lao.com/buy>")).toDF("user_id", "url")
> df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
>  
> scala> df.registerTempTable("data")
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
>  
> scala> spark.sql("select user_id from data where url not like '%sell%'").show
> +-------+
> |user_id|
> +-------+
> |      2|
> +-------+
>  
> Yong
>  
> From: Chanh Le <giaosu...@gmail.com <mailto:giaosu...@gmail.com>>
> Sent: Tuesday, February 21, 2017 4:56 AM
> To: user @spark
> Subject: How to query a query with not contain, not start_with, not end_with 
> condition effective?
>  
> Hi everyone, 
>  
> I am working on a dataset like this
> user_id         url 
> 1              lao.com/buy <http://lao.com/buy>
> 2      bao.com/sell <http://bao.com/sell>
> 2              cao.com/market <http://cao.com/market>
> 1       lao.com/sell <http://lao.com/sell>
> 3              vui.com/sell <http://vui.com/sell>
> 
> I have to find all user_id with url not contain sell. Which means I need to 
> query alluser_id contains sell and put it into a set then do another query to 
> find all user_id not in that set.
> SELECT user_id 
> FROM data
> WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;
> 
> My data is about 20 million records and it’s growing. When I tried in 
> zeppelin I need to set spark.sql.crossJoin.enabled = true
> Then I ran the query and the driver got extremely high CPU percentage and the 
> process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
>  
> I am using Spark 2.0.2 and my data store in HDFS with parquet format.
>  
> Any advices for me in this situation?
>  
> Thank you in advance!.
>  
> Regards,
> Chanh

Reply via email to