If you read the source code of SparkStrategies
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106 If there is no joining keys, Join implementations are chosen with the following precedence: * BroadcastNestedLoopJoin: if one side of the join could be broadcasted * CartesianProduct: for Inner join * BroadcastNestedLoopJoin So your case will use BroadcastNestedLoopJoin, as there is no joining keys. In this case, if there are lots of userId where url not like '%sell%', then Spark has to retrieve them back to Driver (to be broadcast), that explains why the high CPU usage on the driver side. So if there are lots of userId where url not like '%sell%', then you can just try left semi join, which Spark will use SortMerge join in this case, I guess. Yong ________________________________ From: Yong Zhang <java8...@hotmail.com> Sent: Tuesday, February 21, 2017 1:17 PM To: Sidney Feiner; Chanh Le; user @spark Subject: Re: How to query a query with not contain, not start_with, not end_with condition effective? Sorry, didn't pay attention to the originally requirement. Did you try the left outer join, or left semi join? What is the explain plan when you use "not in"? Is it leading to a broadcastNestedLoopJoin? spark.sql("select user_id from data where user_id not in (select user_id from data where url like '%sell%')").explain(true) Yong ________________________________ From: Sidney Feiner <sidney.fei...@startapp.com> Sent: Tuesday, February 21, 2017 10:46 AM To: Yong Zhang; Chanh Le; user @spark Subject: RE: How to query a query with not contain, not start_with, not end_with condition effective? Chanh wants to return user_id's that don't have any record with a url containing "sell". Without a subquery/join, it can only filter per record without knowing about the rest of the user_id's record Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/> From: Yong Zhang [mailto:java8...@hotmail.com] Sent: Tuesday, February 21, 2017 4:10 PM To: Chanh Le <giaosu...@gmail.com>; user @spark <user@spark.apache.org> Subject: Re: How to query a query with not contain, not start_with, not end_with condition effective? Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala> df.registerTempTable("data") warning: there was one deprecation warning; re-run with -deprecation for details scala> spark.sql("select user_id from data where url not like '%sell%'").show +-------+ |user_id| +-------+ | 2| +-------+ Yong ________________________________ From: Chanh Le <giaosu...@gmail.com<mailto:giaosu...@gmail.com>> Sent: Tuesday, February 21, 2017 4:56 AM To: user @spark Subject: How to query a query with not contain, not start_with, not end_with condition effective? Hi everyone, I am working on a dataset like this user_id url 1 lao.com/buy<http://lao.com/buy> 2 bao.com/sell<http://bao.com/sell> 2 cao.com/market<http://cao.com/market> 1 lao.com/sell<http://lao.com/sell> 3 vui.com/sell<http://vui.com/sell> I have to find all user_id with url not contain sell. Which means I need to query all user_id contains sell and put it into a set then do another query to find all user_id not in that set. SELECT user_id FROM data WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’; My data is about 20 million records and it’s growing. When I tried in zeppelin I need to set spark.sql.crossJoin.enabled = true Then I ran the query and the driver got extremely high CPU percentage and the process get stuck and I need to kill it. I am running at client mode that submit to a Mesos cluster. I am using Spark 2.0.2 and my data store in HDFS with parquet format. Any advices for me in this situation? Thank you in advance!. Regards, Chanh