Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Thank you YZ, Now I understand why it causes high CPU usage on driver side. Thank you Ayan, > First thing i would do is to add distinct, both inner and outer queries I believe that would reduce number of record to join. Regards, Chanh Hi everyone, I am working on a dataset like this user_id

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
If you read the source code of SparkStrategies https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106 If there is no joining keys, Join implementations are chosen with the following precedence: *

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Sorry, didn't pay attention to the originally requirement. Did you try the left outer join, or left semi join? What is the explain plan when you use "not in"? Is it leading to a broadcastNestedLoopJoin? spark.sql("select user_id from data where user_id not in (select user_id from data where

RE: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Sidney Feiner
Chanh wants to return user_id's that don't have any record with a url containing "sell". Without a subquery/join, it can only filter per record without knowing about the rest of the user_id's record Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala>

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread ayan guha
First thing i would do is to add distinct, both inner and outer queries On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le wrote: > Hi everyone, > > I am working on a dataset like this > *user_id url * > 1 lao.com/buy > 2 bao.com/sell > 2

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
I tried a new way by using JOIN select user_id from data a left join (select user_id from data where url like ‘%sell%') b on a.user_id = b.user_id where b.user_id is NULL It’s faster and seem that Spark rather optimize for JOIN than sub query. Regards, Chanh > On Feb 21, 2017, at 4:56 PM,

How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Hi everyone, I am working on a dataset like this user_id url 1lao.com/buy 2bao.com/sell 2cao.com/market 1lao.com/sell 3vui.com/sell I have to find all user_id with url not contain sell.