Spark SQL : Join operation failure

2017-02-21 Thread jatinpreet
Hi,

I am having a hard time running outer join operation on two parquet
datasets. The dataset size is large ~500GB with a lot of culumns in tune of
1000.

As per YARN administer imposed limits in the queue, I can have a total of 20
vcores and 8GB memory per executor.

I specified meory overhead and increased number of shuffle partitions to no
avail. This is how I submitted the job with pyspark,

spark-submit --master yarn-cluster --executor-memory 5500m --num-executors
19 --executor-cores 1 --conf spark.yarn.executor.memoryOverhead=2000 --conf
spark.sql.shuffle.partitions=2048 --driver-memory 7g --queue
./

The relevant code is, 

cm_go.registerTempTable("x")
ko.registerTempTable("y")
joined_df = sqlCtx.sql("select * from x FULL OUTER JOIN y ON field1=field2")
joined_df.write.save("/user/data/output")


I am getting errors like these:

ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
Reason: Container marked as failed:
container_e36_1487531133522_0058_01_06 on host: dn2.bigdatalab.org. Exit
status: 52. Diagnostics: Exception from container-launch.
Container id: container_e36_1487531133522_0058_01_06
Exit code: 52
Stack trace: ExitCodeException exitCode=52: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 52

--

FetchFailed(null, shuffleId=0, mapId=-1, reduceId=508, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691)
at
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145)
at
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

)



I would appreciate if someone can help me out on this.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-operation-failure-tp28414.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
Hi Gurdit Singh
Thanks. It is very helpful.

发件人: Gurdit Singh [mailto:gurdit.si...@bitwiseglobal.com]
发送时间: 2017年2月22日 13:31
收件人: Linyuxin ; Irving Duran ; 
Yong Zhang 
抄送: Jacek Laskowski ; user 
主题: RE: [SparkSQL] pre-check syntex before running spark job?

Hi, you can use spark sql Antlr grammer for pre check you syntax.

https://github.com/apache/spark/blob/acf71c63cdde8dced8d108260cdd35e1cc992248/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4


From: Linyuxin [mailto:linyu...@huawei.com]
Sent: Wednesday, February 22, 2017 7:34 AM
To: Irving Duran mailto:irving.du...@gmail.com>>; Yong 
Zhang mailto:java8...@hotmail.com>>
Cc: Jacek Laskowski mailto:ja...@japila.pl>>; user 
mailto:user@spark.apache.org>>
Subject: 答复: [SparkSQL] pre-check syntex before running spark job?

Actually,I want a standalone jar as I can check the syntax without spark 
execution environment

发件人: Irving Duran [mailto:irving.du...@gmail.com]
发送时间: 2017年2月21日 23:29
收件人: Yong Zhang mailto:java8...@hotmail.com>>
抄送: Jacek Laskowski mailto:ja...@japila.pl>>; Linyuxin 
mailto:linyu...@huawei.com>>; user 
mailto:user@spark.apache.org>>
主题: Re: [SparkSQL] pre-check syntex before running spark job?

You can also run it on REPL and test to see if you are getting the expected 
result.


Thank You,

Irving Duran

On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:

You can always use explain method to validate your DF or SQL, before any action.



Yong


From: Jacek Laskowski mailto:ja...@japila.pl>>
Sent: Tuesday, February 21, 2017 4:34 AM
To: Linyuxin
Cc: user
Subject: Re: [SparkSQL] pre-check syntex before running spark job?

Hi,

Never heard about such a tool before. You could use Antlr to parse SQLs (just 
as Spark SQL does while parsing queries). I think it's a one-hour project.

Jacek

On 21 Feb 2017 4:44 a.m., "Linyuxin" 
mailto:linyu...@huawei.com>> wrote:
Hi All,
Is there any tool/api to check the sql syntax without running spark job 
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




RE: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Gurdit Singh
Hi, you can use spark sql Antlr grammer for pre check you syntax.

https://github.com/apache/spark/blob/acf71c63cdde8dced8d108260cdd35e1cc992248/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4


From: Linyuxin [mailto:linyu...@huawei.com]
Sent: Wednesday, February 22, 2017 7:34 AM
To: Irving Duran ; Yong Zhang 
Cc: Jacek Laskowski ; user 
Subject: 答复: [SparkSQL] pre-check syntex before running spark job?

Actually,I want a standalone jar as I can check the syntax without spark 
execution environment

发件人: Irving Duran [mailto:irving.du...@gmail.com]
发送时间: 2017年2月21日 23:29
收件人: Yong Zhang mailto:java8...@hotmail.com>>
抄送: Jacek Laskowski mailto:ja...@japila.pl>>; Linyuxin 
mailto:linyu...@huawei.com>>; user 
mailto:user@spark.apache.org>>
主题: Re: [SparkSQL] pre-check syntex before running spark job?

You can also run it on REPL and test to see if you are getting the expected 
result.


Thank You,

Irving Duran

On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:

You can always use explain method to validate your DF or SQL, before any action.



Yong


From: Jacek Laskowski mailto:ja...@japila.pl>>
Sent: Tuesday, February 21, 2017 4:34 AM
To: Linyuxin
Cc: user
Subject: Re: [SparkSQL] pre-check syntex before running spark job?

Hi,

Never heard about such a tool before. You could use Antlr to parse SQLs (just 
as Spark SQL does while parsing queries). I think it's a one-hour project.

Jacek

On 21 Feb 2017 4:44 a.m., "Linyuxin" 
mailto:linyu...@huawei.com>> wrote:
Hi All,
Is there any tool/api to check the sql syntax without running spark job 
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Spark executors in streaming app always uses 2 executors

2017-02-21 Thread satishl
I am reading from a kafka topic which has 8 partitions. My spark app is given
40 executors (1 core per executor). After reading the data, I repartition
the dstream by 500, map it and save it to cassandra.
However, I see that only 2 executors are being used per batch. even though I
see 500 tasks for the stage all of them are sequentially scheduled on the 2
executors picked. My spark concepts are still forming and I missing
something obvious.
I expected that 8 executors will be picked for reading data from the 8
partitions in kafka, and then with the repartition this data will be
distributed between 40 executors and then saved to cassandra.
How should I think about this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executors-in-streaming-app-always-uses-2-executors-tp28413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Thank you YZ,
Now I understand why it causes high CPU usage on driver side.

Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries

I believe that would reduce number of record to join.

Regards,
Chanh

Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy 
2bao.com/sell 
2cao.com/market 
1lao.com/sell 
3vui.com/sell 

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh





> On Feb 22, 2017, at 8:52 AM, Yong Zhang  wrote:
> 
> If you read the source code of SparkStrategies
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
>  
> 
> 
> If there is no joining keys, Join implementations are chosen with the 
> following precedence:
> BroadcastNestedLoopJoin: if one side of the join could be broadcasted
> CartesianProduct: for Inner join
> BroadcastNestedLoopJoin
> 
> So your case will use BroadcastNestedLoopJoin, as there is no joining keys.
> 
> In this case, if there are lots of userId where url not like '%sell%', then 
> Spark has to retrieve them back to Driver (to be broadcast), that explains 
> why the high CPU usage on the driver side. 
> 
> So if there are lots of userId where url not like '%sell%', then you can just 
> try left semi join, which Spark will use SortMerge join in this case, I guess.
> 
> Yong
> 
> From: Yong Zhang mailto:java8...@hotmail.com>>
> Sent: Tuesday, February 21, 2017 1:17 PM
> To: Sidney Feiner; Chanh Le; user @spark
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Sorry, didn't pay attention to the originally requirement.
> 
> Did you try the left outer join, or left semi join?
> 
> What is the explain plan when you use "not in"? Is it leading to a 
> broadcastNestedLoopJoin?
> 
> spark.sql("select user_id from data where user_id not in (select user_id from 
> data where url like '%sell%')").explain(true)
> 
> Yong
> 
> 
> From: Sidney Feiner  >
> Sent: Tuesday, February 21, 2017 10:46 AM
> To: Yong Zhang; Chanh Le; user @spark
> Subject: RE: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Chanh wants to return user_id's that don't have any record with a url 
> containing "sell". Without a subquery/join, it can only filter per record 
> without knowing about the rest of the user_id's record
>  
> Sidney Feiner   /  SW Developer
> M: +972.528197720  /  Skype: sidney.feiner.startapp
>  
>  
>  
> From: Yong Zhang [mailto:java8...@hotmail.com ] 
> Sent: Tuesday, February 21, 2017 4:10 PM
> To: Chanh Le mailto:giaosu...@gmail.com>>; user @spark 
> mailto:user@spark.apache.org>>
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Not sure if I misunderstand your question, but what's wrong doing it this way?
>  
> scala> spark.version
> res6: String = 2.0.2
> scala> val df = Seq((1,"lao.com/sell "), (2, 
> "lao.com/buy ")).toDF("user_id", "url")
> df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
>  
> scala> df.registerTempTable("data")
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
>  
> scala> spark.sql("select user_id from data where url not like '%sell%'").show
> +---+
> |user_id|
> +---+
> |  2|
> +---+
>  
> Yong
>  
> From: Chanh Le mailto:giaosu...@gmail.com>>
> Sent: Tuesday, February 21, 2017 4:56 AM
> To: user @spark
> Subject: How to query a query with not contain, not start_with, not end_with 
> condition effective?
>  
> Hi everyone, 
>  
> I am working on a dataset like this
> user_id url 
> 1  lao.com/buy 
> 2  bao.com/sell 
> 2  cao.com/market 

答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
Actually,I want a standalone jar as I can check the syntax without spark 
execution environment

发件人: Irving Duran [mailto:irving.du...@gmail.com]
发送时间: 2017年2月21日 23:29
收件人: Yong Zhang 
抄送: Jacek Laskowski ; Linyuxin ; user 

主题: Re: [SparkSQL] pre-check syntex before running spark job?

You can also run it on REPL and test to see if you are getting the expected 
result.


Thank You,

Irving Duran

On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:

You can always use explain method to validate your DF or SQL, before any action.



Yong


From: Jacek Laskowski mailto:ja...@japila.pl>>
Sent: Tuesday, February 21, 2017 4:34 AM
To: Linyuxin
Cc: user
Subject: Re: [SparkSQL] pre-check syntex before running spark job?

Hi,

Never heard about such a tool before. You could use Antlr to parse SQLs (just 
as Spark SQL does while parsing queries). I think it's a one-hour project.

Jacek

On 21 Feb 2017 4:44 a.m., "Linyuxin" 
mailto:linyu...@huawei.com>> wrote:
Hi All,
Is there any tool/api to check the sql syntax without running spark job 
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
If you read the source code of SparkStrategies


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106


If there is no joining keys, Join implementations are chosen with the following 
precedence:

  *   BroadcastNestedLoopJoin: if one side of the join could be broadcasted
  *   CartesianProduct: for Inner join
  *   BroadcastNestedLoopJoin


So your case will use BroadcastNestedLoopJoin, as there is no joining keys.


In this case, if there are lots of userId where url not like '%sell%', then 
Spark has to retrieve them back to Driver (to be broadcast), that explains why 
the high CPU usage on the driver side.

So if there are lots of userId where url not like '%sell%', then you can just 
try left semi join, which Spark will use SortMerge join in this case, I guess.


Yong


From: Yong Zhang 
Sent: Tuesday, February 21, 2017 1:17 PM
To: Sidney Feiner; Chanh Le; user @spark
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?


Sorry, didn't pay attention to the originally requirement.


Did you try the left outer join, or left semi join?

What is the explain plan when you use "not in"? Is it leading to a 
broadcastNestedLoopJoin?


spark.sql("select user_id from data where user_id not in (select user_id from 
data where url like '%sell%')").explain(true)


Yong



From: Sidney Feiner 
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not 
end_with condition effective?


Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record



Sidney Feiner   /  SW Developer

M: +972.528197720  /  Skype: sidney.feiner.startapp



[StartApp]



From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?



Not sure if I misunderstand your question, but what's wrong doing it this way?



scala> spark.version

res6: String = 2.0.2

scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")

df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]



scala> df.registerTempTable("data")

warning: there was one deprecation warning; re-run with -deprecation for details



scala> spark.sql("select user_id from data where url not like '%sell%'").show

+---+

|user_id|

+---+

|  2|

+---+



Yong





From: Chanh Le mailto:giaosu...@gmail.com>>
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?



Hi everyone,



I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true

Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.

I am running at client mode that submit to a Mesos cluster.



I am using Spark 2.0.2 and my data store in HDFS with parquet format.



Any advices for me in this situation?



Thank you in advance!.



Regards,

Chanh


Re: CSV DStream to Hive

2017-02-21 Thread ayan guha
I am afraid your requirement is not very clear. Can you post some example
data and what output are you expecting?


On Wed, 22 Feb 2017 at 9:13 am, nimrodo 
wrote:

> Hi all,
>
> I have a DStream that contains very long comma separated values. I want to
> convert this DStream to a DataFrame. I thought of using split on the RDD
> and
> toDF however I can't get it to work.
>
> Can anyone help me here?
>
> Nimrod
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/CSV-DStream-to-Hive-tp28410.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha


CSV DStream to Hive

2017-02-21 Thread nimrodo
Hi all,

I have a DStream that contains very long comma separated values. I want to
convert this DStream to a DataFrame. I thought of using split on the RDD and
toDF however I can't get it to work.

Can anyone help me here?

Nimrod





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-DStream-to-Hive-tp28410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Sorry, didn't pay attention to the originally requirement.


Did you try the left outer join, or left semi join?

What is the explain plan when you use "not in"? Is it leading to a 
broadcastNestedLoopJoin?


spark.sql("select user_id from data where user_id not in (select user_id from 
data where url like '%sell%')").explain(true)


Yong



From: Sidney Feiner 
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not 
end_with condition effective?


Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record



Sidney Feiner   /  SW Developer

M: +972.528197720  /  Skype: sidney.feiner.startapp



[StartApp]



From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?



Not sure if I misunderstand your question, but what's wrong doing it this way?



scala> spark.version

res6: String = 2.0.2

scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")

df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]



scala> df.registerTempTable("data")

warning: there was one deprecation warning; re-run with -deprecation for details



scala> spark.sql("select user_id from data where url not like '%sell%'").show

+---+

|user_id|

+---+

|  2|

+---+



Yong





From: Chanh Le mailto:giaosu...@gmail.com>>
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?



Hi everyone,



I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true

Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.

I am running at client mode that submit to a Mesos cluster.



I am using Spark 2.0.2 and my data store in HDFS with parquet format.



Any advices for me in this situation?



Thank you in advance!.



Regards,

Chanh


RE: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Sidney Feiner
Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record

Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?


Not sure if I misunderstand your question, but what's wrong doing it this way?


scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]

scala> df.registerTempTable("data")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select user_id from data where url not like '%sell%'").show
+---+
|user_id|
+---+
|  2|
+---+


Yong


From: Chanh Le mailto:giaosu...@gmail.com>>
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?

Hi everyone,

I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like '%sell%';

My data is about 20 million records and it's growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh


Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Irving Duran
You can also run it on REPL and test to see if you are getting the expected
result.


Thank You,

Irving Duran

On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang  wrote:

> You can always use explain method to validate your DF or SQL, before any
> action.
>
>
> Yong
>
>
> --
> *From:* Jacek Laskowski 
> *Sent:* Tuesday, February 21, 2017 4:34 AM
> *To:* Linyuxin
> *Cc:* user
> *Subject:* Re: [SparkSQL] pre-check syntex before running spark job?
>
> Hi,
>
> Never heard about such a tool before. You could use Antlr to parse SQLs
> (just as Spark SQL does while parsing queries). I think it's a one-hour
> project.
>
> Jacek
>
> On 21 Feb 2017 4:44 a.m., "Linyuxin"  wrote:
>
> Hi All,
> Is there any tool/api to check the sql syntax without running spark job
> actually?
>
> Like the siddhiQL on storm here:
> SiddhiManagerService. validateExecutionPlan
> https://github.com/wso2/siddhi/blob/master/modules/siddhi-
> core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
> it can validate the syntax before running the sql on storm
>
> this is very useful for exposing sql string as a DSL of the platform.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


RE: How to specify default value for StructField?

2017-02-21 Thread Begar, Veena
Thanks Yan and Yong,

Yes, from Spark, I can access ORC files loaded to Hive tables.

Thanks.

From: 颜发才(Yan Facai) [mailto:facai@gmail.com]
Sent: Friday, February 17, 2017 6:59 PM
To: Yong Zhang 
Cc: Begar, Veena ; smartzjp ; 
user@spark.apache.org
Subject: Re: How to specify default value for StructField?

I agree with Yong Zhang,
perhaps spark sql with hive could solve the problem:

http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



On Thu, Feb 16, 2017 at 12:42 AM, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:

If it works under hive, do you try just create the DF from Hive table directly 
in Spark? That should work, right?



Yong


From: Begar, Veena mailto:veena.be...@hpe.com>>
Sent: Wednesday, February 15, 2017 10:16 AM
To: Yong Zhang; smartzjp; user@spark.apache.org

Subject: RE: How to specify default value for StructField?


Thanks Yong.



I know about merging the schema option.

Using Hive we can read AVRO files having different schemas. And also we can do 
the same in Spark also.

Similarly we can read ORC files having different schemas in Hive. But, we can’t 
do the same in Spark using dataframe. How we can do it using dataframe?



Thanks.

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 14, 2017 8:31 PM
To: Begar, Veena mailto:veena.be...@hpe.com>>; smartzjp 
mailto:zjp_j...@163.com>>; 
user@spark.apache.org
Subject: Re: How to specify default value for StructField?



You maybe are looking for something like "spark.sql.parquet.mergeSchema" for 
ORC. Unfortunately, I don't think it is available, unless someone tells me I am 
wrong.

You can create a JIRA to request this feature, but we all know that Parquet is 
the first citizen format [😊]



Yong





From: Begar, Veena mailto:veena.be...@hpe.com>>
Sent: Tuesday, February 14, 2017 10:37 AM
To: smartzjp; user@spark.apache.org
Subject: RE: How to specify default value for StructField?



Thanks, it didn't work. Because, the folder has files from 2 different schemas.
It fails with the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`f2`' given input 
columns: [f1];


-Original Message-
From: smartzjp [mailto:zjp_j...@163.com]
Sent: Tuesday, February 14, 2017 10:32 AM
To: Begar, Veena mailto:veena.be...@hpe.com>>; 
user@spark.apache.org
Subject: Re: How to specify default value for StructField?

You can try the below code.

val df = spark.read.format("orc").load("/user/hos/orc_files_test_together")
df.select(“f1”,”f2”).show





在 2017/2/14 上午6:54,“vbegar”mailto:user-return-67879-zjp_jdev=163@spark.apache.org%20代表%20veena.be...@hpe.com>>
 写入:

>Hello,
>
>I specified a StructType like this:
>
>*val mySchema = StructType(Array(StructField("f1", StringType,
>true),StructField("f2", StringType, true)))*
>
>I have many ORC files stored in HDFS location:*
>/user/hos/orc_files_test_together
>*
>
>These files use different schema : some of them have only f1 columns
>and other have both f1 and f2 columns.
>
>I read the data from these files to a dataframe:
>*val df =
>spark.read.format("orc").schema(mySchema).load("/user/hos/orc_files_tes
>t_together")*
>
>But, now when I give the following command to see the data, it fails:
>*df.show*
>
>The error message is like "f2" comun doesn't exist.
>
>Since I have specified nullable attribute as true for f2 column, why it
>fails?
>
>Or, is there any way to specify default vaule for StructField?
>
>Because, in AVRO schema, we can specify the default value in this way
>and can read AVRO files in a folder which have 2 different schemas
>(either only
>f1 column or both f1 and f2 columns):
>
>*{
>   "type": "record",
>   "name": "myrecord",
>   "fields":
>   [
>  {
> "name": "f1",
> "type": "string",
> "default": ""
>  },
>  {
> "name": "f2",
> "type": "string",
> "default": ""
>  }
>   ]
>}*
>
>Wondering why it doesn't work with ORC files.
>
>thanks.
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/How-to-specify-defa
>ult-value-for-StructField-tp28386.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe e-mail: 
>user-unsubscr...@spark.apache.org
>



Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way?


scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]

scala> df.registerTempTable("data")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select user_id from data where url not like '%sell%'").show
+---+
|user_id|
+---+
|  2|
+---+


Yong



From: Chanh Le 
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?

Hi everyone,

I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh


Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action.


Yong



From: Jacek Laskowski 
Sent: Tuesday, February 21, 2017 4:34 AM
To: Linyuxin
Cc: user
Subject: Re: [SparkSQL] pre-check syntex before running spark job?

Hi,

Never heard about such a tool before. You could use Antlr to parse SQLs (just 
as Spark SQL does while parsing queries). I think it's a one-hour project.

Jacek

On 21 Feb 2017 4:44 a.m., "Linyuxin" 
mailto:linyu...@huawei.com>> wrote:
Hi All,
Is there any tool/api to check the sql syntax without running spark job 
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Error when trying to filter

2017-02-21 Thread Marco Mans
Hi!

I'm trying to execute this code:

StructField datetime = new StructField("DateTime", DataTypes.DateType,
true, Metadata.empty());
StructField tagname = new StructField("Tagname", DataTypes.StringType,
true, Metadata.empty());
StructField value = new StructField("Value", DataTypes.DoubleType,
false, Metadata.empty());
StructField quality = new StructField("Quality",
DataTypes.IntegerType, true, Metadata.empty());

StructType schema = new StructType(new StructField[]{datetime,
tagname, value, quality});

Dataset allData = spark.read().option("header",
"true").option("dateFormat", "-MM-dd
HH:mm:ss.SSS").option("comment",
"-").schema(schema).csv("/ingest/davis/landing");

allData = allData.filter((Row value1) -> {
// SOME COOL FILTER-CODE.
return true;
});

allData.show();

I get this error on the executors:

java.lang.NoSuchMethodError:
org.apache.commons.lang3.time.FastDateFormat.parse(Ljava/lang/String;)Ljava/util/Date;


I'm running spark 2.0.0.cloudera1


Does anyone know why this error occurs?


Regards,

Marco


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread ayan guha
First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le  wrote:

> Hi everyone,
>
> I am working on a dataset like this
> *user_id url *
> 1  lao.com/buy
> 2  bao.com/sell
> 2  cao.com/market
> 1   lao.com/sell
> 3  vui.com/sell
>
> I have to find all *user_id* with *url* not contain *sell*. Which means I
> need to query all *user_id* contains *sell* and put it into a set then do
> another query to find all *user_id* not in that set.
>
>
>
> *SELECT user_id FROM dataWHERE user_id not in ( SELECT user_id FROM data
> WHERE url like ‘%sell%’;*
> My data is about *20 million records and it’s growing*. When I tried in
> zeppelin I need to *set spark.sql.crossJoin.enabled = true*
> Then I ran the query and the driver got extremely high CPU percentage and
> the process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
>
> I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format*
> .
>
> Any advices for me in this situation?
>
> Thank you in advance!.
>
> Regards,
> Chanh
>
-- 
Best Regards,
Ayan Guha


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
I tried a new way by using JOIN

select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL

It’s faster and seem that Spark rather optimize for JOIN than sub query.


Regards,
Chanh


> On Feb 21, 2017, at 4:56 PM, Chanh Le  wrote:
> 
> Hi everyone,
> 
> I am working on a dataset like this
> user_id url 
> 1  lao.com/buy 
> 2  bao.com/sell 
> 2  cao.com/market 
> 1  lao.com/sell 
> 3  vui.com/sell 
> 
> I have to find all user_id with url not contain sell. Which means I need to 
> query all user_id contains sell and put it into a set then do another query 
> to find all user_id not in that set.
> SELECT user_id 
> FROM data
> WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;
> 
> My data is about 20 million records and it’s growing. When I tried in 
> zeppelin I need to set spark.sql.crossJoin.enabled = true
> Then I ran the query and the driver got extremely high CPU percentage and the 
> process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
> 
> I am using Spark 2.0.2 and my data store in HDFS with parquet format.
> 
> Any advices for me in this situation?
> 
> Thank you in advance!.
> 
> Regards,
> Chanh



How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy
2bao.com/sell
2cao.com/market
1lao.com/sell
3vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Jacek Laskowski
Hi,

Never heard about such a tool before. You could use Antlr to parse SQLs
(just as Spark SQL does while parsing queries). I think it's a one-hour
project.

Jacek

On 21 Feb 2017 4:44 a.m., "Linyuxin"  wrote:

Hi All,
Is there any tool/api to check the sql syntax without running spark job
actually?

Like the siddhiQL on storm here:
SiddhiManagerService. validateExecutionPlan
https://github.com/wso2/siddhi/blob/master/modules/
siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java
it can validate the syntax before running the sql on storm

this is very useful for exposing sql string as a DSL of the platform.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


please send me pom.xml for scala 2.10

2017-02-21 Thread nancy henry
Hi,

Please send me a copy of pom.xml as I am getting no sources to compile
error how much eve i try to set source in pom.xml

its not recognizing source fils from my src/main/scala

So please send me one

(includes hive context and spark core)