Thank you YZ,
Now I understand why it causes high CPU usage on driver side.
Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries
I believe that would reduce number of record to join.
Regards,
Chanh
Hi everyone,
I am working on a dataset like this
user_id
If you read the source code of SparkStrategies
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
If there is no joining keys, Join implementations are chosen with the following
precedence:
*
Sorry, didn't pay attention to the originally requirement.
Did you try the left outer join, or left semi join?
What is the explain plan when you use "not in"? Is it leading to a
broadcastNestedLoopJoin?
spark.sql("select user_id from data where user_id not in (select user_id from
data where
Chanh wants to return user_id's that don't have any record with a url
containing "sell". Without a subquery/join, it can only filter per record
without knowing about the rest of the user_id's record
Sidney Feiner / SW Developer
M: +972.528197720 / Skype: sidney.feiner.startapp
Not sure if I misunderstand your question, but what's wrong doing it this way?
scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id",
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
scala>
First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le wrote:
> Hi everyone,
>
> I am working on a dataset like this
> *user_id url *
> 1 lao.com/buy
> 2 bao.com/sell
> 2
I tried a new way by using JOIN
select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL
It’s faster and seem that Spark rather optimize for JOIN than sub query.
Regards,
Chanh
> On Feb 21, 2017, at 4:56 PM,
Hi everyone,
I am working on a dataset like this
user_id url
1lao.com/buy
2bao.com/sell
2cao.com/market
1lao.com/sell
3vui.com/sell
I have to find all user_id with url not contain sell.