Thank you YZ, Now I understand why it causes high CPU usage on driver side.
Thank you Ayan, > First thing i would do is to add distinct, both inner and outer queries I believe that would reduce number of record to join. Regards, Chanh Hi everyone, I am working on a dataset like this user_id url 1 lao.com/buy <http://lao.com/buy> 2 bao.com/sell <http://bao.com/sell> 2 cao.com/market <http://cao.com/market> 1 lao.com/sell <http://lao.com/sell> 3 vui.com/sell <http://vui.com/sell> I have to find all user_id with url not contain sell. Which means I need to query all user_id contains sell and put it into a set then do another query to find all user_id not in that set. SELECT user_id FROM data WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’; My data is about 20 million records and it’s growing. When I tried in zeppelin I need to set spark.sql.crossJoin.enabled = true Then I ran the query and the driver got extremely high CPU percentage and the process get stuck and I need to kill it. I am running at client mode that submit to a Mesos cluster. I am using Spark 2.0.2 and my data store in HDFS with parquet format. Any advices for me in this situation? Thank you in advance!. Regards, Chanh > On Feb 22, 2017, at 8:52 AM, Yong Zhang <java8...@hotmail.com> wrote: > > If you read the source code of SparkStrategies > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106 > > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106> > > If there is no joining keys, Join implementations are chosen with the > following precedence: > BroadcastNestedLoopJoin: if one side of the join could be broadcasted > CartesianProduct: for Inner join > BroadcastNestedLoopJoin > > So your case will use BroadcastNestedLoopJoin, as there is no joining keys. > > In this case, if there are lots of userId where url not like '%sell%', then > Spark has to retrieve them back to Driver (to be broadcast), that explains > why the high CPU usage on the driver side. > > So if there are lots of userId where url not like '%sell%', then you can just > try left semi join, which Spark will use SortMerge join in this case, I guess. > > Yong > > From: Yong Zhang <java8...@hotmail.com <mailto:java8...@hotmail.com>> > Sent: Tuesday, February 21, 2017 1:17 PM > To: Sidney Feiner; Chanh Le; user @spark > Subject: Re: How to query a query with not contain, not start_with, not > end_with condition effective? > > Sorry, didn't pay attention to the originally requirement. > > Did you try the left outer join, or left semi join? > > What is the explain plan when you use "not in"? Is it leading to a > broadcastNestedLoopJoin? > > spark.sql("select user_id from data where user_id not in (select user_id from > data where url like '%sell%')").explain(true) > > Yong > > > From: Sidney Feiner <sidney.fei...@startapp.com > <mailto:sidney.fei...@startapp.com>> > Sent: Tuesday, February 21, 2017 10:46 AM > To: Yong Zhang; Chanh Le; user @spark > Subject: RE: How to query a query with not contain, not start_with, not > end_with condition effective? > > Chanh wants to return user_id's that don't have any record with a url > containing "sell". Without a subquery/join, it can only filter per record > without knowing about the rest of the user_id's record > > Sidney Feiner / SW Developer > M: +972.528197720 / Skype: sidney.feiner.startapp > > <image001.png> <http://www.startapp.com/> > > From: Yong Zhang [mailto:java8...@hotmail.com <mailto:java8...@hotmail.com>] > Sent: Tuesday, February 21, 2017 4:10 PM > To: Chanh Le <giaosu...@gmail.com <mailto:giaosu...@gmail.com>>; user @spark > <user@spark.apache.org <mailto:user@spark.apache.org>> > Subject: Re: How to query a query with not contain, not start_with, not > end_with condition effective? > > Not sure if I misunderstand your question, but what's wrong doing it this way? > > scala> spark.version > res6: String = 2.0.2 > scala> val df = Seq((1,"lao.com/sell <http://lao.com/sell>"), (2, > "lao.com/buy <http://lao.com/buy>")).toDF("user_id", "url") > df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] > > scala> df.registerTempTable("data") > warning: there was one deprecation warning; re-run with -deprecation for > details > > scala> spark.sql("select user_id from data where url not like '%sell%'").show > +-------+ > |user_id| > +-------+ > | 2| > +-------+ > > Yong > > From: Chanh Le <giaosu...@gmail.com <mailto:giaosu...@gmail.com>> > Sent: Tuesday, February 21, 2017 4:56 AM > To: user @spark > Subject: How to query a query with not contain, not start_with, not end_with > condition effective? > > Hi everyone, > > I am working on a dataset like this > user_id url > 1 lao.com/buy <http://lao.com/buy> > 2 bao.com/sell <http://bao.com/sell> > 2 cao.com/market <http://cao.com/market> > 1 lao.com/sell <http://lao.com/sell> > 3 vui.com/sell <http://vui.com/sell> > > I have to find all user_id with url not contain sell. Which means I need to > query alluser_id contains sell and put it into a set then do another query to > find all user_id not in that set. > SELECT user_id > FROM data > WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’; > > My data is about 20 million records and it’s growing. When I tried in > zeppelin I need to set spark.sql.crossJoin.enabled = true > Then I ran the query and the driver got extremely high CPU percentage and the > process get stuck and I need to kill it. > I am running at client mode that submit to a Mesos cluster. > > I am using Spark 2.0.2 and my data store in HDFS with parquet format. > > Any advices for me in this situation? > > Thank you in advance!. > > Regards, > Chanh