ike '%sell%', then you can just
> try left semi join, which Spark will use SortMerge join in this case, I guess.
>
> Yong
>
> From: Yong Zhang mailto:java8...@hotmail.com>>
> Sent: Tuesday, February 21, 2017 1:17 PM
> To: Sidney Feiner; Chanh Le; user @spar
___
From: Yong Zhang
Sent: Tuesday, February 21, 2017 1:17 PM
To: Sidney Feiner; Chanh Le; user @spark
Subject: Re: How to query a query with not contain, not start_with, not
end_with condition effective?
Sorry, didn't pay attention to the originally requirement.
Did you try the
_id from
data where url like '%sell%')").explain(true)
Yong
From: Sidney Feiner
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not
end_with co
einer.startapp
[StartApp]<http://www.startapp.com/>
From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark
Subject: Re: How to query a query with not contain, not start_with, not
end_with condition effective?
Not sure if I m
Not sure if I misunderstand your question, but what's wrong doing it this way?
scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id",
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
scala> df.registerTempTabl
First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le wrote:
> Hi everyone,
>
> I am working on a dataset like this
> *user_id url *
> 1 lao.com/buy
> 2 bao.com/sell
> 2 cao.com/market
> 1 lao.
I tried a new way by using JOIN
select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL
It’s faster and seem that Spark rather optimize for JOIN than sub query.
Regards,
Chanh
> On Feb 21, 2017, at 4:56 PM, Cha