Hi,
I am having a hard time running outer join operation on two parquet
datasets. The dataset size is large ~500GB with a lot of culumns in tune of
1000.
As per YARN administer imposed limits in the queue, I can have a total of 20
vcores and 8GB memory per executor.
I specified meory overhead
Hi Gurdit Singh
Thanks. It is very helpful.
发件人: Gurdit Singh [mailto:gurdit.si...@bitwiseglobal.com]
发送时间: 2017年2月22日 13:31
收件人: Linyuxin ; Irving Duran ;
Yong Zhang
抄送: Jacek Laskowski ; user
Hi, you can use spark sql Antlr grammer for pre check you syntax.
https://github.com/apache/spark/blob/acf71c63cdde8dced8d108260cdd35e1cc992248/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
From: Linyuxin [mailto:linyu...@huawei.com]
Sent: Wednesday, February 22,
I am reading from a kafka topic which has 8 partitions. My spark app is given
40 executors (1 core per executor). After reading the data, I repartition
the dstream by 500, map it and save it to cassandra.
However, I see that only 2 executors are being used per batch. even though I
see 500 tasks
Thank you YZ,
Now I understand why it causes high CPU usage on driver side.
Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries
I believe that would reduce number of record to join.
Regards,
Chanh
Hi everyone,
I am working on a dataset like this
user_id
Actually,I want a standalone jar as I can check the syntax without spark
execution environment
发件人: Irving Duran [mailto:irving.du...@gmail.com]
发送时间: 2017年2月21日 23:29
收件人: Yong Zhang
抄送: Jacek Laskowski ; Linyuxin ; user
If you read the source code of SparkStrategies
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
If there is no joining keys, Join implementations are chosen with the following
precedence:
*
I am afraid your requirement is not very clear. Can you post some example
data and what output are you expecting?
On Wed, 22 Feb 2017 at 9:13 am, nimrodo
wrote:
> Hi all,
>
> I have a DStream that contains very long comma separated values. I want to
> convert
Hi all,
I have a DStream that contains very long comma separated values. I want to
convert this DStream to a DataFrame. I thought of using split on the RDD and
toDF however I can't get it to work.
Can anyone help me here?
Nimrod
--
View this message in context:
Sorry, didn't pay attention to the originally requirement.
Did you try the left outer join, or left semi join?
What is the explain plan when you use "not in"? Is it leading to a
broadcastNestedLoopJoin?
spark.sql("select user_id from data where user_id not in (select user_id from
data where
Chanh wants to return user_id's that don't have any record with a url
containing "sell". Without a subquery/join, it can only filter per record
without knowing about the rest of the user_id's record
Sidney Feiner / SW Developer
M: +972.528197720 / Skype: sidney.feiner.startapp
You can also run it on REPL and test to see if you are getting the expected
result.
Thank You,
Irving Duran
On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang wrote:
> You can always use explain method to validate your DF or SQL, before any
> action.
>
>
> Yong
>
>
>
Thanks Yan and Yong,
Yes, from Spark, I can access ORC files loaded to Hive tables.
Thanks.
From: 颜发才(Yan Facai) [mailto:facai@gmail.com]
Sent: Friday, February 17, 2017 6:59 PM
To: Yong Zhang
Cc: Begar, Veena ; smartzjp ;
Not sure if I misunderstand your question, but what's wrong doing it this way?
scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id",
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
scala>
You can always use explain method to validate your DF or SQL, before any action.
Yong
From: Jacek Laskowski
Sent: Tuesday, February 21, 2017 4:34 AM
To: Linyuxin
Cc: user
Subject: Re: [SparkSQL] pre-check syntex before running spark job?
Hi,
Hi!
I'm trying to execute this code:
StructField datetime = new StructField("DateTime", DataTypes.DateType,
true, Metadata.empty());
StructField tagname = new StructField("Tagname", DataTypes.StringType,
true, Metadata.empty());
StructField value = new StructField("Value", DataTypes.DoubleType,
First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le wrote:
> Hi everyone,
>
> I am working on a dataset like this
> *user_id url *
> 1 lao.com/buy
> 2 bao.com/sell
> 2
I tried a new way by using JOIN
select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL
It’s faster and seem that Spark rather optimize for JOIN than sub query.
Regards,
Chanh
> On Feb 21, 2017, at 4:56 PM,
Hi everyone,
I am working on a dataset like this
user_id url
1lao.com/buy
2bao.com/sell
2cao.com/market
1lao.com/sell
3vui.com/sell
I have to find all user_id with url not contain sell.
Hi,
Never heard about such a tool before. You could use Antlr to parse SQLs
(just as Spark SQL does while parsing queries). I think it's a one-hour
project.
Jacek
On 21 Feb 2017 4:44 a.m., "Linyuxin" wrote:
Hi All,
Is there any tool/api to check the sql syntax without
Hi,
Please send me a copy of pom.xml as I am getting no sources to compile
error how much eve i try to set source in pom.xml
its not recognizing source fils from my src/main/scala
So please send me one
(includes hive context and spark core)
21 matches
Mail list logo