Spark SQL : Join operation failure

2017-02-21 Thread jatinpreet
Hi, I am having a hard time running outer join operation on two parquet datasets. The dataset size is large ~500GB with a lot of culumns in tune of 1000. As per YARN administer imposed limits in the queue, I can have a total of 20 vcores and 8GB memory per executor. I specified meory overhead

答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
Hi Gurdit Singh Thanks. It is very helpful. 发件人: Gurdit Singh [mailto:gurdit.si...@bitwiseglobal.com] 发送时间: 2017年2月22日 13:31 收件人: Linyuxin ; Irving Duran ; Yong Zhang 抄送: Jacek Laskowski ; user

RE: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Gurdit Singh
Hi, you can use spark sql Antlr grammer for pre check you syntax. https://github.com/apache/spark/blob/acf71c63cdde8dced8d108260cdd35e1cc992248/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 From: Linyuxin [mailto:linyu...@huawei.com] Sent: Wednesday, February 22,

Spark executors in streaming app always uses 2 executors

2017-02-21 Thread satishl
I am reading from a kafka topic which has 8 partitions. My spark app is given 40 executors (1 core per executor). After reading the data, I repartition the dstream by 500, map it and save it to cassandra. However, I see that only 2 executors are being used per batch. even though I see 500 tasks

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Thank you YZ, Now I understand why it causes high CPU usage on driver side. Thank you Ayan, > First thing i would do is to add distinct, both inner and outer queries I believe that would reduce number of record to join. Regards, Chanh Hi everyone, I am working on a dataset like this user_id

答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
Actually,I want a standalone jar as I can check the syntax without spark execution environment 发件人: Irving Duran [mailto:irving.du...@gmail.com] 发送时间: 2017年2月21日 23:29 收件人: Yong Zhang 抄送: Jacek Laskowski ; Linyuxin ; user

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
If you read the source code of SparkStrategies https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106 If there is no joining keys, Join implementations are chosen with the following precedence: *

Re: CSV DStream to Hive

2017-02-21 Thread ayan guha
I am afraid your requirement is not very clear. Can you post some example data and what output are you expecting? On Wed, 22 Feb 2017 at 9:13 am, nimrodo wrote: > Hi all, > > I have a DStream that contains very long comma separated values. I want to > convert

CSV DStream to Hive

2017-02-21 Thread nimrodo
Hi all, I have a DStream that contains very long comma separated values. I want to convert this DStream to a DataFrame. I thought of using split on the RDD and toDF however I can't get it to work. Can anyone help me here? Nimrod -- View this message in context:

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Sorry, didn't pay attention to the originally requirement. Did you try the left outer join, or left semi join? What is the explain plan when you use "not in"? Is it leading to a broadcastNestedLoopJoin? spark.sql("select user_id from data where user_id not in (select user_id from data where

RE: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Sidney Feiner
Chanh wants to return user_id's that don't have any record with a url containing "sell". Without a subquery/join, it can only filter per record without knowing about the rest of the user_id's record Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Irving Duran
You can also run it on REPL and test to see if you are getting the expected result. Thank You, Irving Duran On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang wrote: > You can always use explain method to validate your DF or SQL, before any > action. > > > Yong > > >

RE: How to specify default value for StructField?

2017-02-21 Thread Begar, Veena
Thanks Yan and Yong, Yes, from Spark, I can access ORC files loaded to Hive tables. Thanks. From: 颜发才(Yan Facai) [mailto:facai@gmail.com] Sent: Friday, February 17, 2017 6:59 PM To: Yong Zhang Cc: Begar, Veena ; smartzjp ;

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala>

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action. Yong From: Jacek Laskowski Sent: Tuesday, February 21, 2017 4:34 AM To: Linyuxin Cc: user Subject: Re: [SparkSQL] pre-check syntex before running spark job? Hi,

Error when trying to filter

2017-02-21 Thread Marco Mans
Hi! I'm trying to execute this code: StructField datetime = new StructField("DateTime", DataTypes.DateType, true, Metadata.empty()); StructField tagname = new StructField("Tagname", DataTypes.StringType, true, Metadata.empty()); StructField value = new StructField("Value", DataTypes.DoubleType,

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread ayan guha
First thing i would do is to add distinct, both inner and outer queries On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le wrote: > Hi everyone, > > I am working on a dataset like this > *user_id url * > 1 lao.com/buy > 2 bao.com/sell > 2

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
I tried a new way by using JOIN select user_id from data a left join (select user_id from data where url like ‘%sell%') b on a.user_id = b.user_id where b.user_id is NULL It’s faster and seem that Spark rather optimize for JOIN than sub query. Regards, Chanh > On Feb 21, 2017, at 4:56 PM,

How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Hi everyone, I am working on a dataset like this user_id url 1lao.com/buy 2bao.com/sell 2cao.com/market 1lao.com/sell 3vui.com/sell I have to find all user_id with url not contain sell.

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Jacek Laskowski
Hi, Never heard about such a tool before. You could use Antlr to parse SQLs (just as Spark SQL does while parsing queries). I think it's a one-hour project. Jacek On 21 Feb 2017 4:44 a.m., "Linyuxin" wrote: Hi All, Is there any tool/api to check the sql syntax without

please send me pom.xml for scala 2.10

2017-02-21 Thread nancy henry
Hi, Please send me a copy of pom.xml as I am getting no sources to compile error how much eve i try to set source in pom.xml its not recognizing source fils from my src/main/scala So please send me one (includes hive context and spark core)