Re: Does Spark SQL support rollup like HQL

2015-12-29 Thread Davies Liu
Just sent out a PR[1] to support cube/rollup as function, it works
with both SQLContext and HiveContext.

https://github.com/apache/spark/pull/10522/files

On Tue, Dec 29, 2015 at 9:35 PM, Yi Zhang  wrote:
> Hi Hao,
>
> Thanks. I'll take a look at it.
>
>
> On Wednesday, December 30, 2015 12:47 PM, "Cheng, Hao" 
> wrote:
>
>
> Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and
> doesn’t support the rollup, but you can check the code
>
> https://github.com/apache/spark/pull/5080/ , which aimed to add the support,
> just in case you can patch it in your own branch.
>
> In Spark 2.0, the simple SQL Parser will be replaced by HQL Parser, so it
> will not be the problem then.
>
> Hao
>
> From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
> Sent: Wednesday, December 30, 2015 11:41 AM
> To: User
> Subject: Does Spark SQL support rollup like HQL
>
> Hi guys,
>
> As we know, hqlContext support rollup like this:
>
> hiveContext.sql("select a, b, sum(c) from t group by a, b with rollup")
>
> And I also knows that dataframe provides rollup function to support it:
>
> dataframe.rollup($"a", $"b").agg(Map("c" -> "sum"))
>
> But in my scenario, I'd better use sql syntax in SqlContext to support
> rollup it seems like what HqlContext does. Any suggestion?
>
> Thanks.
>
> Regards,
> Yi Zhang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread vivek.meghanathan
Thank you mwy and Sun for your response. Yes basic things are working for me 
using this connector(guava issue was encountered earlier but with proper 
exclusion of old version we have resolved it).

The current issue is strange one �C we have a kafka-spark-cassandra streaming 
job in spark. The all jobs are working fine in local cluster (local lab) using 
1.3.0 spark. We are trying the same setup in Google cloud platform in 1.3.0 the 
jobs are up but it does not seems to be processing the kafka messages. Using 
1.5.2 spark with 1.5.0-M3 connector + 0.8.2.2 kafka I am able to make most of 
the jobs work but one of them is processing only 1 out of 50 messages from 
kafka.
We have wrote a test program(scala we use) to parse data from kafka stream �C 
it is working fine and always receives the messages. It connects to Cassandra 
gets some data works fine and prints the input data till that time,  but when I 
enable the join,map,reduce login , it starts to miss the message(not printing 
above worked lines also). Please find the lines added below, once I add the 
commented line to the job , it does not print any incoming data while I am able 
to print till then. Any ideas how to debug?

val cqlOfferSummaryRdd = ssc.sc.cql3Cassandra[OfferSummary](casOfferSummary)
  .map(summary => (summary.listingId, (summary.offerId, 
summary.redeem, summary.viewed, summary.reserved)))


/**
  //RDD -> ((listingId, searchId), Iterable(offerId, redeemCount, 
viewCount))
  val directListingOfferSummary = directViews.transform(Rdd => { 
Rdd.join(cqlOfferSummaryRdd,10)}) //RDD -> ((listingId), (Direct, (offerId, 
redeemCount, viewCount)))
  .map(rdd => ((rdd._1, rdd._2._1.searchId, rdd._2._2._1), 
(rdd._2._2._2, rdd._2._2._3, rdd._2._2._4))) //RDD -> ((listingId, searchId, 
offerId), (redeemCount, viewCount))
  .reduceByKey((a, b) => (a._1 + b._1, a._2 + b._2, a._3 + b._3))
  .map(rdd => ((rdd._1._1, rdd._1._2), (rdd._1._3, rdd._2._1, 
rdd._2._2, rdd._2._3))).groupByKey(10) //RDD -> ((listingId, searchId), 
Iterable(offerId, redeemCount, viewCount))
  directListingOfferSummary.print()

**/

Regards,
Vivek

From: mwy [mailto:wenyao...@dewmobile.net]
Sent: 30 December 2015 08:27
To: fightf...@163.com; Vivek Meghanathan (WT01 - NEP) 
; user 
Subject: Re: Spark 1.5.2 compatible spark-cassandra-connector

2.10-1.5.0-M3 & spark 1.5.2 work for me. The jar is built by sbt-assembly.

Just for reference.

发件人: "fightf...@163.com" 
mailto:fightf...@163.com>>
日期: Wednesday, December 30, 2015 at 10:22
至: "vivek.meghanat...@wipro.com" 
mailto:vivek.meghanat...@wipro.com>>, user 
mailto:user@spark.apache.org>>
主题: Re: Spark 1.5.2 compatible spark-cassandra-connector

Hi, Vivek M
I had ever tried 1.5.x spark-cassandra connector and indeed encounter some 
classpath issues, mainly for the guaua dependency.
I believe that can be solved by some maven config, but have not tried that yet.

Best,
Sun.


fightf...@163.com

From: vivek.meghanat...@wipro.com
Date: 2015-12-29 20:40
To: user@spark.apache.org
Subject: Spark 1.5.2 compatible spark-cassandra-connector
All,
What is the compatible spark-cassandra-connector for spark 1.5.2? I can only 
find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which 
has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath 
issues needs to be handled or any jars needs to be excluded while packaging the 
application jar?

http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom

Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 

Re: Does Spark SQL support rollup like HQL

2015-12-29 Thread Yi Zhang
Hi Hao,
Thanks. I'll take a look at it. 

On Wednesday, December 30, 2015 12:47 PM, "Cheng, Hao" 
 wrote:
 

 #yiv7928789615 #yiv7928789615 -- _filtered #yiv7928789615 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv7928789615 
{font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv7928789615 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv7928789615 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv7928789615 
{font-family:Verdana;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv7928789615 
{panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv7928789615 #yiv7928789615 
p.yiv7928789615MsoNormal, #yiv7928789615 li.yiv7928789615MsoNormal, 
#yiv7928789615 div.yiv7928789615MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv7928789615 a:link, 
#yiv7928789615 span.yiv7928789615MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv7928789615 a:visited, 
#yiv7928789615 span.yiv7928789615MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv7928789615 
span.yiv7928789615EmailStyle17 {color:#1F497D;}#yiv7928789615 
.yiv7928789615MsoChpDefault {font-size:10.0pt;} _filtered #yiv7928789615 
{margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv7928789615 
div.yiv7928789615WordSection1 {}#yiv7928789615 Hi, currently, the Simple SQL 
Parser of SQLContext is quite weak, and doesn’t support the rollup, but you can 
check the code    https://github.com/apache/spark/pull/5080/ , which aimed to 
add the support, just in case you can patch it in your own branch.    In Spark 
2.0, the simple SQL Parser will be replaced by HQL Parser, so it will not be 
the problem then.    Hao    From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Wednesday, December 30, 2015 11:41 AM
To: User
Subject: Does Spark SQL support rollup like HQL    Hi guys,    As we know, 
hqlContext support rollup like this:    hiveContext.sql("select a, b, sum(c) 
from t group by a, b with rollup")    And I also knows that dataframe provides 
rollup function to support it:    dataframe.rollup($"a", $"b").agg(Map("c" -> 
"sum"))    But in my scenario, I'd better use sql syntax in SqlContext to 
support rollup it seems like what HqlContext does. Any suggestion?    Thanks.   
 Regards, Yi Zhang 

  

RE: Does Spark SQL support rollup like HQL

2015-12-29 Thread Cheng, Hao
Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and doesn’t 
support the rollup, but you can check the code

https://github.com/apache/spark/pull/5080/ , which aimed to add the support, 
just in case you can patch it in your own branch.

In Spark 2.0, the simple SQL Parser will be replaced by HQL Parser, so it will 
not be the problem then.

Hao

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Wednesday, December 30, 2015 11:41 AM
To: User
Subject: Does Spark SQL support rollup like HQL

Hi guys,

As we know, hqlContext support rollup like this:

hiveContext.sql("select a, b, sum(c) from t group by a, b with rollup")

And I also knows that dataframe provides rollup function to support it:

dataframe.rollup($"a", $"b").agg(Map("c" -> "sum"))

But in my scenario, I'd better use sql syntax in SqlContext to support rollup 
it seems like what HqlContext does. Any suggestion?

Thanks.

Regards,
Yi Zhang


Does Spark SQL support rollup like HQL

2015-12-29 Thread Yi Zhang
Hi guys,
As we know, hqlContext support rollup like this:
hiveContext.sql("select a, b, sum(c) from t group by a, b with rollup")

And I also knows that dataframe provides rollup function to support it:
dataframe.rollup($"a", $"b").agg(Map("c" -> "sum"))
But in my scenario, I'd better use sql syntax in SqlContext to support rollup 
it seems like what HqlContext does. Any suggestion?
Thanks.
Regards,Yi Zhang

Re: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-29 Thread Sourav Mazumder
Alternatively you can also try the ML library from System ML (
http://systemml.apache.org/) for covariance computation on Spark.

Regards,
Sourav

On Mon, Dec 28, 2015 at 11:29 PM, Sun, Rui  wrote:

> Spark does not support computing cov matrix  now. But there is a PR for
> it. Maybe you can try it:
> https://issues.apache.org/jira/browse/SPARK-11057
>
>
>
>
>
> *From:* zhangjp [mailto:592426...@qq.com]
> *Sent:* Tuesday, December 29, 2015 3:21 PM
> *To:* Felix Cheung; Andy Davidson; Yanbo Liang
> *Cc:* user
> *Subject:* 回复: how to use sparkR or spark MLlib load csv file on hdfs
> thencalculate covariance
>
>
>
>
>
> Now i have huge columns about 5k -20k, so if i want to Calculate
> covariance matrix ,which is the best method or common method ?
>
>
>
> -- 原始邮件 --
>
> *发件人**:* "Felix Cheung";;
>
> *发送时间**:* 2015年12月29日(星期二) 中午12:45
>
> *收件人**:* "Andy Davidson"; "zhangjp"<
> 592426...@qq.com>; "Yanbo Liang";
>
> *抄送**:* "user";
>
> *主题**:* Re: how to use sparkR or spark MLlib load csv file on hdfs
> thencalculate covariance
>
>
>
> Make sure you add the csv spark package as this example here so that the
> source parameter in R read.df would work:
>
>
>
>
> https://spark.apache.org/docs/latest/sparkr.html#from-data-sources
>
>
>
> _
> From: Andy Davidson 
> Sent: Monday, December 28, 2015 10:24 AM
> Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then
> calculate covariance
> To: zhangjp <592426...@qq.com>, Yanbo Liang 
> Cc: user 
>
> Hi Yanbo
>
>
>
> I use spark.csv to load my data set. I work with both Java and Python. I
> would recommend you print the first couple of rows and also print the
> schema to make sure your data is loaded as you expect. You might find the
> following code example helpful. You may need to programmatically set the
> schema depending on what you data looks like
>
>
>
>
>
> public class LoadTidyDataFrame {
>
> static  DataFrame fromCSV(SQLContext sqlContext, String file) {
>
> DataFrame df = sqlContext.read()
>
> .format("com.databricks.spark.csv")
>
> .option("inferSchema", "true")
>
> .option("header", "true")
>
> .load(file);
>
>
>
> return df;
>
> }
>
> }
>
>
>
>
>
>
>
> *From: *Yanbo Liang < yblia...@gmail.com>
> *Date: *Monday, December 28, 2015 at 2:30 AM
> *To: *zhangjp < 592426...@qq.com>
> *Cc: *"user @spark" < user@spark.apache.org>
> *Subject: *Re: how to use sparkR or spark MLlib load csv file on hdfs
> then calculate covariance
>
>
>
> Load csv file:
>
> df <- read.df(sqlContext, "file-path", source =
> "com.databricks.spark.csv", header = "true")
>
> Calculate covariance:
>
> cov <- cov(df, "col1", "col2")
>
>
>
> Cheers
>
> Yanbo
>
>
>
>
>
> 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:
>
> hi  all,
>
> I want  to use sparkR or spark MLlib  load csv file on hdfs then
> calculate  covariance, how to do it .
>
> thks.
>
>
>
>
>


RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
It’s not released yet, probably you need to compile it yourself. In the 
meantime, can you increase the partition number? By setting the " 
spark.sql.shuffle.partitions” to a bigger value.

And more details about your cluster size, partition size, yarn/standalone, 
executor resources etc. will be more helpful in understanding your problem.

From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:49 AM
To: Cheng, Hao
Subject: Re: Problem with WINDOW functions?

I use 1.5.2.

Where can I get 1.6? I do not see it on http://spark.apache.org/downloads.html

Thanks,
Vadim


On Tue, Dec 29, 2015 at 6:47 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
Which version are you using? Have you tried the 1.6?

From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:17 AM

To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

When I allocate 200g to executor, it is able to make better progress,
that is I see 189 tasks executed instead of 169 previously.
But eventually it fails with the same error.

On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com 
[mailto:va...@percona.com] On Behalf Of Vadim 
Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org 
> For
> additional commands, e-mail: 
> user-h...@spark.apache.org
>




Re: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread mwy
2.10-1.5.0-M3 & spark 1.5.2 work for me. The jar is built by sbt-assembly.

Just for reference.

发件人:  "fightf...@163.com" 
日期:  Wednesday, December 30, 2015 at 10:22
至:  "vivek.meghanat...@wipro.com" , user

主题:  Re: Spark 1.5.2 compatible spark-cassandra-connector

Hi, Vivek M
I had ever tried 1.5.x spark-cassandra connector and indeed encounter some
classpath issues, mainly for the guaua dependency.
I believe that can be solved by some maven config, but have not tried that
yet. 

Best,
Sun.


fightf...@163.com
>  
> From: vivek.meghanat...@wipro.com
> Date: 2015-12-29 20:40
> To: user@spark.apache.org
> Subject: Spark 1.5.2 compatible spark-cassandra-connector
> All,
> What is the compatible spark-cassandra-connector for spark 1.5.2? I can only
> find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3
> which has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any
> classpath issues needs to be handled or any jars needs to be excluded while
> packaging the application jar?
>  
> http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2
> .10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom
>  
> Regards,
> Vivek M
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments. WARNING: Computer viruses can be transmitted via
> email. The recipient should check this email and any attachments for the
> presence of viruses. The company accepts no liability for any damage caused by
> any virus transmitted by this email. www.wipro.com




RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Which version are you using? Have you tried the 1.6?

From: Vadim Tkachenko [mailto:apache...@gmail.com]
Sent: Wednesday, December 30, 2015 10:17 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

When I allocate 200g to executor, it is able to make better progress,
that is I see 189 tasks executed instead of 169 previously.
But eventually it fails with the same error.

On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com 
[mailto:va...@percona.com] On Behalf Of Vadim 
Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao 
mailto:hao.ch...@intel.com>> wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org 
> For
> additional commands, e-mail: 
> user-h...@spark.apache.org
>



Re: Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread fightf...@163.com
Hi, Vivek M
I had ever tried 1.5.x spark-cassandra connector and indeed encounter some 
classpath issues, mainly for the guaua dependency. 
I believe that can be solved by some maven config, but have not tried that yet. 

Best,
Sun.



fightf...@163.com
 
From: vivek.meghanat...@wipro.com
Date: 2015-12-29 20:40
To: user@spark.apache.org
Subject: Spark 1.5.2 compatible spark-cassandra-connector
All,
What is the compatible spark-cassandra-connector for spark 1.5.2? I can only 
find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which 
has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath 
issues needs to be handled or any jars needs to be excluded while packaging the 
application jar?
 
http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom
 
Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com 


Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Divya Gehlot
Hello Community Users,
I am able to resolve the issue .
The issue was input data format ,By default Excel writes the data in
2001/01/09 whereas Spark Sql takes 2001-01-09 format.

Here is the sample code below


SQL context available as sqlContext.

scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.hive.orc._

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
15/12/29 04:29:39 WARN SparkConf: The configuration key
'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark
1.3 and and may be removed in the future. Please use the new key
'spark.yarn.am.waitTime' instead.
15/12/29 04:29:39 INFO HiveContext: Initializing execution hive, version
0.13.1
hiveContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@7312f6d8

scala> import org.apache.spark.sql.types.{StructType, StructField,
StringType, IntegerType,FloatType ,LongType ,TimestampType ,DateType };
import org.apache.spark.sql.types.{StructType, StructField, StringType,
IntegerType, FloatType, LongType, TimestampType, DateType}

scala> val customSchema = StructType(Seq(StructField("year", DateType,
true),StructField("make", StringType, true),StructField("model",
StringType, true),StructField("comment", StringType,
true),StructField("blank", StringType, true)))
customSchema: org.apache.spark.sql.types.StructType =
StructType(StructField(year,DateType,true),
StructField(make,StringType,true), StructField(model,StringType,true),
StructField(comment,StringType,true), StructField(blank,StringType,true))

scala> val df =
hiveContext.read.format("com.databricks.spark.csv").option("header",
"true").schema(customSchema).load("/tmp/TestDivya/carsdate.csv")
15/12/29 04:30:27 INFO HiveContext: Initializing HiveMetastoreConnection
version 0.13.1 using Spark classes.
df: org.apache.spark.sql.DataFrame = [year: date, make: string, model:
string, comment: string, blank: string]

scala> df.printSchema()
root
 |-- year: date (nullable = true)
 |-- make: string (nullable = true)
 |-- model: string (nullable = true)
 |-- comment: string (nullable = true)
 |-- blank: string (nullable = true)


scala> val selectedData = df.select("year", "model")
selectedData: org.apache.spark.sql.DataFrame = [year: date, model: string]

scala> selectedData.show()
15/12/29 04:31:20 INFO MemoryStore: ensureFreeSpace(216384) called with
curMem=0, maxMem=278302556
15/12/29 04:31:20 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 211.3 KB, free 265.2 MB)

15/12/29 04:31:24 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have
all completed, from pool
15/12/29 04:31:24 INFO DAGScheduler: ResultStage 2 (show at :35)
finished in 0.051 s
15/12/29 04:31:24 INFO DAGScheduler: Job 2 finished: show at :35,
took 0.063356 s
+--+-+
|  year|model|
+--+-+
|2001-01-01|S|
|2010-12-10| |
|2009-01-11| E350|
|2008-01-01| Volt|
+--+-+

On 30 December 2015 at 00:42, Annabel Melongo 
wrote:

> Divya,
>
> From reading the post, it appears that you resolved this issue. Great job!
>
> I would recommend putting the solution here as well so that it helps
> another developer down the line.
>
> Thanks
>
>
> On Monday, December 28, 2015 8:56 PM, Divya Gehlot <
> divya.htco...@gmail.com> wrote:
>
>
> Hi,
> Link to schema issue
> 
> Please let me know if have issue in viewing the above link
>
> On 28 December 2015 at 23:00, Annabel Melongo 
> wrote:
>
> Divya,
>
> Why don't you share how you create the dataframe using the schema as
> stated in 1)
>
>
> On Monday, December 28, 2015 4:42 AM, Divya Gehlot <
> divya.htco...@gmail.com> wrote:
>
>
> Hi,
> I have input data set which is CSV file where I have date columns.
> My output will also be CSV file and will using this output CSV  file as
> for hive table creation.
> I have few queries :
> 1.I tried using custom schema using Timestamp but it is returning empty
> result set when querying the dataframes.
> 2.Can I use String datatype in Spark for date column and while creating
> table can define it as date type ? Partitioning of my hive table will be
> date column.
>
> Would really  appreciate if you share some sample code for timestamp in
> Dataframe whereas same can be used while creating the hive table.
>
>
>
> Thanks,
> Divya
>
>
>
>
>
>


Re: Problem with WINDOW functions?

2015-12-29 Thread Vadim Tkachenko
When I allocate 200g to executor, it is able to make better progress,
that is I see 189 tasks executed instead of 169 previously.
But eventually it fails with the same error.

On Tue, Dec 29, 2015 at 5:58 PM, Cheng, Hao  wrote:

> Is there any improvement if you set a bigger memory for executors?
>
> -Original Message-
> From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim
> Tkachenko
> Sent: Wednesday, December 30, 2015 9:51 AM
> To: Cheng, Hao
> Cc: user@spark.apache.org
> Subject: Re: Problem with WINDOW functions?
>
> Hi,
>
> I am getting the same error with write.parquet("/path/to/file") :
>  WARN HeartbeatReceiver: Removing executor 0 with no recent
> heartbeats: 160714 ms exceeds timeout 12 ms
> 15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
> 10.10.7.167: Executor heartbeat timed out after 160714 ms
>
>
> On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao  wrote:
> > Can you try to write the result into another file instead? Let's see if
> there any issue in the executors side .
> >
> > sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> > ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> > 20").sort($"day",$"rank").write.parquet("/path/to/file")
> >
> > -Original Message-
> > From: vadimtk [mailto:apache...@gmail.com]
> > Sent: Wednesday, December 30, 2015 9:29 AM
> > To: user@spark.apache.org
> > Subject: Problem with WINDOW functions?
> >
> > Hi,
> >
> > I can't successfully execute a query with WINDOW function.
> >
> > The statements are following:
> >
> > val orcFile =
> > sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> > ject)='EN'")
> > orcFile.registerTempTable("d1")
> >  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day
> > ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> > 20").sort($"day",$"rank").collect().foreach(println)
> >
> > with default
> > spark.driver.memory
> >
> > I am getting java.lang.OutOfMemoryError: Java heap space.
> > The same if I set spark.driver.memory=10g.
> >
> > When I set spark.driver.memory=45g (the box has 256GB of RAM) the
> execution fails with a different error:
> >
> > 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no
> > recent
> > heartbeats: 129324 ms exceeds timeout 12 ms
> >
> > And I see that GC takes a lot of time.
> >
> > What is a proper way to execute statements above?
> >
> > I see the similar problems reported
> > http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> > etchfailedexception
> > http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> > count-action-in-a-big-table
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> > W-functions-tp25833.html Sent from the Apache Spark User List mailing
> > list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> > additional commands, e-mail: user-h...@spark.apache.org
> >
>


RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Is there any improvement if you set a bigger memory for executors?

-Original Message-
From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim Tkachenko
Sent: Wednesday, December 30, 2015 9:51 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Problem with WINDOW functions?

Hi,

I am getting the same error with write.parquet("/path/to/file") :
 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 160714 ms exceeds timeout 12 ms
15/12/30 01:49:05 ERROR TaskSchedulerImpl: Lost executor 0 on
10.10.7.167: Executor heartbeat timed out after 160714 ms


On Tue, Dec 29, 2015 at 5:35 PM, Cheng, Hao  wrote:
> Can you try to write the result into another file instead? Let's see if there 
> any issue in the executors side .
>
> sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day 
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").write.parquet("/path/to/file")
>
> -Original Message-
> From: vadimtk [mailto:apache...@gmail.com]
> Sent: Wednesday, December 30, 2015 9:29 AM
> To: user@spark.apache.org
> Subject: Problem with WINDOW functions?
>
> Hi,
>
> I can't successfully execute a query with WINDOW function.
>
> The statements are following:
>
> val orcFile =
> sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(pro
> ject)='EN'")
> orcFile.registerTempTable("d1")
>  sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day 
> ORDER BY pageviews DESC) as rank FROM d1").filter("rank <=
> 20").sort($"day",$"rank").collect().foreach(println)
>
> with default
> spark.driver.memory
>
> I am getting java.lang.OutOfMemoryError: Java heap space.
> The same if I set spark.driver.memory=10g.
>
> When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
> fails with a different error:
>
> 15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no 
> recent
> heartbeats: 129324 ms exceeds timeout 12 ms
>
> And I see that GC takes a lot of time.
>
> What is a proper way to execute statements above?
>
> I see the similar problems reported
> http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-f
> etchfailedexception 
> http://stackoverflow.com/questions/32544478/spark-memory-settings-for-
> count-action-in-a-big-table
>
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDO
> W-functions-tp25833.html Sent from the Apache Spark User List mailing 
> list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
>


Re: Problem with WINDOW functions?

2015-12-29 Thread Chris Fregly
on quick glance, it appears that you're calling collect() in there which is 
bringing down a huge amount of data down to the single Driver.  this is why, 
when you allocated more memory to the Driver, a different error emerges most 
-definitely related to stop-the-world GC to cause the node to become 
unresponsive.

in general, collect() is bad and should only be used on small datasets for 
debugging or sanity-check purposes.  anything serious should be done within the 
executors running on worker nodes - not on the Driver node.

think of the Driver not as simply a coordinator - coordinating and allocating 
tasks to workers with the help of the cluster resource manager (i.e. YARN, 
Spark Standalone, Mesos, etc)

very little memory should be allocated to the Driver.  just enough to 
coordinate and a little extra for Driver-side debugging, but that's it.  leave 
the rest up to the cluster nodes.

> On Dec 29, 2015, at 8:28 PM, vadimtk  wrote:
> 
> table
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Can you try to write the result into another file instead? Let's see if there 
any issue in the executors side .

sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY 
pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").write.parquet("/path/to/file")

-Original Message-
From: vadimtk [mailto:apache...@gmail.com] 
Sent: Wednesday, December 30, 2015 9:29 AM
To: user@spark.apache.org
Subject: Problem with WINDOW functions?

Hi,

I can't successfully execute a query with WINDOW function.

The statements are following:

val orcFile =
sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(project)='EN'")
orcFile.registerTempTable("d1")
 sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY 
pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").collect().foreach(println)

with default
spark.driver.memory 

I am getting java.lang.OutOfMemoryError: Java heap space.
The same if I set spark.driver.memory=10g.

When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution 
fails with a different error:

15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 129324 ms exceeds timeout 12 ms

And I see that GC takes a lot of time.

What is a proper way to execute statements above?

I see the similar problems reported
http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-fetchfailedexception
http://stackoverflow.com/questions/32544478/spark-memory-settings-for-count-action-in-a-big-table









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDOW-functions-tp25833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Problem with WINDOW functions?

2015-12-29 Thread vadimtk
Hi,

I can't successfully execute a query with WINDOW function.

The statements are following:

val orcFile =
sqlContext.read.parquet("/data/flash/spark/dat14sn").filter("upper(project)='EN'")
orcFile.registerTempTable("d1")
 sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER
BY pageviews DESC) as rank FROM d1").filter("rank <=
20").sort($"day",$"rank").collect().foreach(println)

with default
spark.driver.memory 

I am getting java.lang.OutOfMemoryError: Java heap space.
The same if I set spark.driver.memory=10g.

When I set spark.driver.memory=45g (the box has 256GB of RAM) the execution
fails with a different error:

15/12/29 23:03:19 WARN HeartbeatReceiver: Removing executor 0 with no recent
heartbeats: 129324 ms exceeds timeout 12 ms

And I see that GC takes a lot of time.

What is a proper way to execute statements above?

I see the similar problems reported
http://stackoverflow.com/questions/32196859/org-apache-spark-shuffle-fetchfailedexception
http://stackoverflow.com/questions/32544478/spark-memory-settings-for-count-action-in-a-big-table









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-WINDOW-functions-tp25833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Davies Liu
Hi Andy,  

Could you change logging level to INFO and post some here? There will be some 
logging about the memory usage of a task when OOM.  

In 1.6, the memory for a task is : (HeapSize  - 300M) * 0.75 / number of tasks. 
Is it possible that the heap is too small?

Davies  

--  
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

已使用 Sparrow (http://www.sparrowmailapp.com/?sig)  

在 2015年12月29日 星期二,下午4:28,Andy Davidson 写道:

> Hi Michael
>  
> https://github.com/apache/spark/archive/v1.6.0.tar.gz
>  
> Both 1.6.0 and 1.5.2 my unit test work when I call reparation(1) before 
> saving output. Coalesce still fails.  
>  
> Coalesce(1) spark-1.5.2
> Caused by:
> java.io.IOException: Unable to acquire 33554432 bytes of memory
>  
>  
> Coalesce(1) spark-1.6.0
>  
> Caused by:  
> java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
>  
> Hope this helps
>  
> Andy
>  
> From: Michael Armbrust  (mailto:mich...@databricks.com)>
> Date: Monday, December 28, 2015 at 2:41 PM
> To: Andrew Davidson  (mailto:a...@santacruzintegration.com)>
> Cc: "user @spark" mailto:user@spark.apache.org)>
> Subject: Re: trouble understanding data frame memory usage 
> ³java.io.IOException: Unable to acquire memory²
>  
> > Unfortunately in 1.5 we didn't force operators to spill when ran out of 
> > memory so there is not a lot you can do. It would be awesome if you could 
> > test with 1.6 and see if things are any better?
> >  
> > On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson 
> > mailto:a...@santacruzintegration.com)> 
> > wrote:
> > > I am using spark 1.5.1. I am running into some memory problems with a 
> > > java unit test. Yes I could fix it by setting –Xmx (its set to 1024M) how 
> > > ever I want to better understand what is going on so I can write better 
> > > code in the future. The test runs on a Mac, master="Local[2]"
> > >  
> > > I have a java unit test that starts by reading a 672K ascii file. I my 
> > > output data file is 152K. Its seems strange that such a small amount of 
> > > data would cause an out of memory exception. I am running a pretty 
> > > standard machine learning process
> > >  
> > > Load data
> > > create a ML pipeline
> > > transform the data
> > > Train a model
> > > Make predictions
> > > Join the predictions back to my original data set
> > > Coalesce(1), I only have a small amount of data and want to save it in a 
> > > single file
> > > Save final results back to disk
> > >  
> > >  
> > > Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to 
> > > acquire memory”
> > >  
> > > To try and figure out what is going I put log messages in to count the 
> > > number of partitions
> > >  
> > > Turns out I have 20 input files, each one winds up in a separate 
> > > partition. Okay so after loading I call coalesce(1) and check to make 
> > > sure I only have a single partition.
> > >  
> > > The total number of observations is 1998.
> > >  
> > > After calling step 7 I count the number of partitions and discovered I 
> > > have 224 partitions!. Surprising given I called Coalesce(1) before I did 
> > > anything with the data. My data set should easily fit in memory. When I 
> > > save them to disk I get 202 files created with 162 of them being empty!
> > >  
> > > In general I am not explicitly using cache.
> > >  
> > > Some of the data frames get registered as tables. I find it easier to use 
> > > sql.
> > >  
> > > Some of the data frames get converted back to RDDs. I find it easier to 
> > > create RDD this way
> > >  
> > > I put calls to unpersist(true). In several places
> > >  
> > >  
> > > private void memoryCheck(String name) {
> > >  
> > >  
> > > Runtime rt = Runtime.getRuntime();
> > >  
> > >  
> > > logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size: {}",  
> > >  
> > >  
> > > name,  
> > >  
> > >  
> > > String.format("%,d", rt.totalMemory()),  
> > >  
> > >  
> > > String.format("%,d", rt.freeMemory()));
> > >  
> > >  
> > > }
> > >  
> > >  
> > >  
> > > Any idea how I can get a better understanding of what is going on? My 
> > > goal is to learn to write better spark code.
> > >  
> > > Kind regards
> > >  
> > > Andy
> > >  
> > > Memory usages at various points in my unit test
> > >  
> > > name: rawInput totalMemory: 447,741,952 freeMemory: 233,203,184  
> > >  
> > >  
> > > name: naiveBayesModel totalMemory: 509,083,648 freeMemory: 403,504,128  
> > >  
> > >  
> > > name: lpRDD totalMemory: 509,083,648 freeMemory: 402,288,104  
> > >  
> > >  
> > > name: results totalMemory: 509,083,648 freeMemory: 368,011,008  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > DataFrame exploreDF = results.select(results.col("id"),  
> > >  
> > >  
> > > results.col("label"),  
> > >  
> > >  
> > > results.col("binomialLabel"),  
> > >  
> > >  
> > > results.col("labelIndex"),  
> > >  
> > >  
> > > results.col("prediction"),  
> > >  
> > >  
> > > results.col("words"));
> > >  
> > >  
> > > exploreD

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no
equivalent for data frames. Anyone know a good way to do this?

I thought I could just convert to RDD, do the zip, and then convert back,
but ...

   1. I don't see a way (outside developer API) to convert RDD[Row]
   directly back to DataFrame. Is there really no way to do this?
   2. I don't see any way to modify Row objects or create new rows with
   additional columns. In other words, no way to convert RDD[(Row, Row)] to
   RDD[Row]

It seems the only way to get what I want is to extract out the data into a
case class and then convert back to a data frame. Did I miss something?


Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-29 Thread Andy Davidson
Hi Michael

https://github.com/apache/spark/archive/v1.6.0.tar.gz

Both 1.6.0 and 1.5.2 my unit test work when I call reparation(1) before
saving output. Coalesce still fails.

Coalesce(1) spark-1.5.2
   Caused by:

java.io.IOException: Unable to acquire 33554432 bytes of memory


Coalesce(1) spark-1.6.0

   Caused by:

java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory,
got 0


Hope this helps

Andy

From:  Michael Armbrust 
Date:  Monday, December 28, 2015 at 2:41 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble understanding data frame memory usage
³java.io.IOException: Unable to acquire memory²

> Unfortunately in 1.5 we didn't force operators to spill when ran out of memory
> so there is not a lot you can do.  It would be awesome if you could test with
> 1.6 and see if things are any better?
> 
> On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson 
> wrote:
>> I am using spark 1.5.1. I am running into some memory problems with a java
>> unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I
>> want to better understand what is going on so I can write better code in the
>> future. The test runs on a Mac, master="Local[2]"
>> 
>> I have a java unit test that starts by reading a 672K ascii file. I my output
>> data file is 152K. Its seems strange that such a small amount of data would
>> cause an out of memory exception. I am running a pretty standard machine
>> learning process
>> 
>> 1. Load data
>> 2. create a ML pipeline
>> 3. transform the data
>> 4. Train a model
>> 5. Make predictions
>> 6. Join the predictions back to my original data set
>> 7. Coalesce(1), I only have a small amount of data and want to save it in a
>> single file
>> 8. Save final results back to disk
>> 
>> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
>> acquire memory”
>> 
>> To try and figure out what is going I put log messages in to count the number
>> of partitions
>> 
>> Turns out I have 20 input files, each one winds up in a separate partition.
>> Okay so after loading I call coalesce(1) and check to make sure I only have a
>> single partition.
>> 
>> The total number of observations is 1998.
>> 
>> After calling step 7 I count the number of partitions and discovered I have
>> 224 partitions!. Surprising given I called Coalesce(1) before I did anything
>> with the data. My data set should easily fit in memory. When I save them to
>> disk I get 202 files created with 162 of them being empty!
>> 
>> In general I am not explicitly using cache.
>> 
>> Some of the data frames get registered as tables. I find it easier to use
>> sql.
>> 
>> Some of the data frames get converted back to RDDs. I find it easier to
>> create RDD this way
>> 
>> I put calls to unpersist(true). In several places
>> 
>>private void memoryCheck(String name) {
>> 
>> Runtime rt = Runtime.getRuntime();
>> 
>> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size:
>> {}", 
>> 
>> name,
>> 
>> String.format("%,d", rt.totalMemory()),
>> 
>> String.format("%,d", rt.freeMemory()));
>> 
>> }
>> 
>> 
>> Any idea how I can get a better understanding of what is going on? My goal is
>> to learn to write better spark code.
>> 
>> Kind regards
>> 
>> Andy
>> 
>> Memory usages at various points in my unit test
>> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>> 
>> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:   403,504,128
>> 
>> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>> 
>> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>> 
>> 
>> 
>>DataFrame exploreDF = results.select(results.col("id"),
>> 
>> results.col("label"),
>> 
>> results.col("binomialLabel"),
>> 
>> results.col("labelIndex"),
>> 
>> results.col("prediction"),
>> 
>> results.col("words"));
>> 
>> exploreDF.show(10);
>> 
>> 
>> 
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>> 
>> 
>> 
>> final String exploreTable = "exploreTable";
>> 
>> exploreDF.registerTempTable(exploreTable);
>> 
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>> 
>> String stmt = String.format(fmt, exploreTable);
>> 
>> 
>> 
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>> 
>> 
>> 
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>> 
>> 
>> 
>> exploreDF.unpersist(true); does not resolve memory issue
>> 
>> 
>> 
> 




Re: Can't submit job to stand alone cluster

2015-12-29 Thread Daniel Valdivia
That makes things more clear! Thanks

Issue resolved

Sent from my iPhone

> On Dec 29, 2015, at 2:43 PM, Annabel Melongo  
> wrote:
> 
> Thanks Andrew for this awesome explanation 
> 
> 
> On Tuesday, December 29, 2015 5:30 PM, Andrew Or  
> wrote:
> 
> 
> Let me clarify a few things for everyone:
> 
> There are three cluster managers: standalone, YARN, and Mesos. Each cluster 
> manager can run in two deploy modes, client or cluster. In client mode, the 
> driver runs on the machine that submitted the application (the client). In 
> cluster mode, the driver runs on one of the worker machines in the cluster.
> 
> When I say "standalone cluster mode" I am referring to the standalone cluster 
> manager running in cluster deploy mode.
> 
> Here's how the resources are distributed in each mode (omitting Mesos):
> 
> Standalone / YARN client mode. The driver runs on the client machine (i.e. 
> machine that ran Spark submit) so it should already have access to the jars. 
> The executors then pull the jars from an HTTP server started in the driver.
> 
> Standalone cluster mode. Spark submit does not upload your jars to the 
> cluster, so all the resources you need must already be on all of the worker 
> machines. The executors, however, actually just pull the jars from the driver 
> as in client mode instead of finding it in their own local file systems.
> 
> YARN cluster mode. Spark submit does upload your jars to the cluster. In 
> particular, it puts the jars in HDFS so your driver can just read from there. 
> As in other deployments, the executors pull the jars from the driver.
> 
> When the docs say "If your application is launched through Spark submit, then 
> the application jar is automatically distributed to all worker nodes," it is 
> actually saying that your executors get their jars from the driver. This is 
> true whether you're running in client mode or cluster mode.
> 
> If the docs are unclear (and they seem to be), then we should update them. I 
> have filed SPARK-12565 to track this.
> 
> Please let me know if there's anything else I can help clarify.
> 
> Cheers,
> -Andrew
> 
> 
> 
> 
> 2015-12-29 13:07 GMT-08:00 Annabel Melongo :
> Andrew,
> 
> Now I see where the confusion lays. Standalone cluster mode, your link, is 
> nothing but a combination of client-mode and standalone mode, my link, 
> without YARN.
> 
> But I'm confused by this paragraph in your link:
> 
> If your application is launched through Spark submit, then the 
> application jar is automatically distributed to all worker nodes. For any 
> additional jars that your
>   application depends on, you should specify them through the --jars 
> flag using comma as a delimiter (e.g. --jars jar1,jar2).
> 
> That can't be true; this is only the case when Spark runs on top of YARN. 
> Please correct me, if I'm wrong.
> 
> Thanks
>   
> 
> 
> On Tuesday, December 29, 2015 2:54 PM, Andrew Or  
> wrote:
> 
> 
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
> 
> 2015-12-29 11:48 GMT-08:00 Annabel Melongo :
> Greg,
> 
> Can you please send me a doc describing the standalone cluster mode? 
> Honestly, I never heard about it.
> 
> The three different modes, I've listed appear in the last paragraph of this 
> doc: Running Spark Applications
>  
>  
>  
>  
>  
>  
> Running Spark Applications
> --class The FQCN of the class containing the main method of the application. 
> For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> Preview by Yahoo
>  
> 
> 
> 
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or  
> wrote:
> 
> 
> The confusion here is the expression "standalone cluster mode". Either it's 
> stand-alone or it's cluster mode but it can't be both.
> 
> @Annabel That's not true. There is a standalone cluster mode where driver 
> runs on one of the workers instead of on the client machine. What you're 
> describing is standalone client mode.
> 
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
> Greg,
> 
> The confusion here is the expression "standalone cluster mode". Either it's 
> stand-alone or it's cluster mode but it can't be both.
> 
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine; use 
> --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines; 
> additionally driver runs as a thread in ApplicationMaster; use --jars option 
> with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver is 
> NOT a thread in ApplicationMaster; use --packages to submit a jar
> 
> 
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
> wrote:
> 
> 
> Hi Greg,
> 
> It's actually intentional for standalone cluster mode to not upload jars. One 
> of the reasons why YARN takes at least 10 seconds before running any simple 
> application is because there's a lot of random overhead (e.g. puttin

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Thanks Andrew for this awesome explanation  

On Tuesday, December 29, 2015 5:30 PM, Andrew Or  
wrote:
 

 Let me clarify a few things for everyone:
There are three cluster managers: standalone, YARN, and Mesos. Each cluster 
manager can run in two deploy modes, client or cluster. In client mode, the 
driver runs on the machine that submitted the application (the client). In 
cluster mode, the driver runs on one of the worker machines in the cluster.
When I say "standalone cluster mode" I am referring to the standalone cluster 
manager running in cluster deploy mode.
Here's how the resources are distributed in each mode (omitting Mesos):

Standalone / YARN client mode. The driver runs on the client machine (i.e. 
machine that ran Spark submit) so it should already have access to the jars. 
The executors then pull the jars from an HTTP server started in the driver.
Standalone cluster mode. Spark submit does not upload your jars to the cluster, 
so all the resources you need must already be on all of the worker machines. 
The executors, however, actually just pull the jars from the driver as in 
client mode instead of finding it in their own local file systems.
YARN cluster mode. Spark submit does upload your jars to the cluster. In 
particular, it puts the jars in HDFS so your driver can just read from there. 
As in other deployments, the executors pull the jars from the driver.

When the docs say "If your application is launched through Spark submit, then 
the application jar is automatically distributed to all worker nodes," it is 
actually saying that your executors get their jars from the driver. This is 
true whether you're running in client mode or cluster mode.
If the docs are unclear (and they seem to be), then we should update them. I 
have filed SPARK-12565 to track this.
Please let me know if there's anything else I can help clarify.
Cheers,-Andrew



2015-12-29 13:07 GMT-08:00 Annabel Melongo :

Andrew,
Now I see where the confusion lays. Standalone cluster mode, your link, is 
nothing but a combination of client-mode and standalone mode, my link, without 
YARN.
But I'm confused by this paragraph in your link:
        If your application is launched through Spark submit, then the 
application jar is automatically distributed to all worker nodes. For any 
additional jars that your          application depends on, you should specify 
them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
That can't be true; this is only the case when Spark runs on top of YARN. 
Please correct me, if I'm wrong.
Thanks   

On Tuesday, December 29, 2015 2:54 PM, Andrew Or  
wrote:
 

 
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo :

Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or  
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo :

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill :



On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to 

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Each
cluster manager can run in two *deploy modes*, client or cluster. In client
mode, the driver runs on the machine that submitted the application (the
client). In cluster mode, the driver runs on one of the worker machines in
the cluster.

When I say "standalone cluster mode" I am referring to the standalone
cluster manager running in cluster deploy mode.

Here's how the resources are distributed in each mode (omitting Mesos):

*Standalone / YARN client mode. *The driver runs on the client machine
(i.e. machine that ran Spark submit) so it should already have access to
the jars. The executors then pull the jars from an HTTP server started in
the driver.

*Standalone cluster mode. *Spark submit does *not* upload your jars to the
cluster, so all the resources you need must already be on all of the worker
machines. The executors, however, actually just pull the jars from the
driver as in client mode instead of finding it in their own local file
systems.

*YARN cluster mode. *Spark submit *does* upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read from
there. As in other deployments, the executors pull the jars from the driver.


When the docs say "If your application is launched through Spark submit,
then the application jar is automatically distributed to all worker nodes," it
is actually saying that your executors get their jars from the driver. This
is true whether you're running in client mode or cluster mode.

If the docs are unclear (and they seem to be), then we should update them.
I have filed SPARK-12565 
to track this.

Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew




2015-12-29 13:07 GMT-08:00 Annabel Melongo :

> Andrew,
>
> Now I see where the confusion lays. Standalone cluster mode, your link, is
> nothing but a combination of client-mode and standalone mode, my link,
> without YARN.
>
> But I'm confused by this paragraph in your link:
>
> If your application is launched through Spark submit, then the
> application jar is automatically distributed to all worker nodes. For any
> additional jars that your
>   application depends on, you should specify them through the
> --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
>
> That can't be true; this is only the case when Spark runs on top of YARN.
> Please correct me, if I'm wrong.
>
> Thanks
>
>
>
> On Tuesday, December 29, 2015 2:54 PM, Andrew Or 
> wrote:
>
>
>
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
>
> 2015-12-29 11:48 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> 
>
>
>
>
>
>
> Running Spark Applications
> 
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> 
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g

Re: Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Ted Yu
Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ?

In the snippet you posted, I don't see registration of this Executor.

Cheers

On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett 
wrote:

> We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4,
> mesos-0.26) cluster.  It seems as if the logic in
> MesosExternalShuffleService.scala isn't working for some reason (new in 1.6
> I believe).
>
> spark application sees this:
> ...
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered executor
> NettyRpcEndpointRef(null) (ip-10-1-201-165.ec2.internal:37660) with ID
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new total is 1)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 53854)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1,
> ip-10-1-201-132.ec2.internal, 12793)
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new total is 2)
> ...
> 15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB, free: 13.8GB)
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-202-121.ec2.internal:59734
> 15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 1 is 1671814 bytes
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-201-165.ec2.internal:37660
> ...
> 15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage 1.0
> (TID 2191, ip-10-1-200-232.ec2.internal, partition 63,PROCESS_LOCAL, 2171
> bytes)
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID
> 2149, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4,
> ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
> ...
> 15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage 0
> (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48) and
> ShuffleMapStage 1 (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50) due to
> fetch failure
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0 (TID
> 2140, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6, reduceId=12,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
>
> shuffle service itself (on executor's IP sees:
> 15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received
> registration request from app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014
> (remote address /10.1.200.165:37889).  (that's the driver IP)
> 15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown /
> 10.1.201.165:52562 disconnected. (executor IP)
> 15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler: Application
> a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected (address was /
> 10.1.200.165:37889). (driver IP again)
> 15/12/29 18:58:07 ERROR TransportRequestHandler: Error while invoking
> RpcHandler#receive() on RPC id 6244044000322436908
> java.lang.RuntimeException: Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
> at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
> At first I wondered if reducing spark.shuffle.io.numConnectionsP

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Andrew,
Now I see where the confusion lays. Standalone cluster mode, your link, is 
nothing but a combination of client-mode and standalone mode, my link, without 
YARN.
But I'm confused by this paragraph in your link:
        If your application is launched through Spark submit, then the 
application jar is automatically distributed to all worker nodes. For any 
additional jars that your          application depends on, you should specify 
them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
That can't be true; this is only the case when Spark runs on top of YARN. 
Please correct me, if I'm wrong.
Thanks   

On Tuesday, December 29, 2015 2:54 PM, Andrew Or  
wrote:
 

 
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo :

Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or  
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo :

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill :



On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





   



   



  

Executor deregistered after 2mins (mesos, 1.6.0-rc4)

2015-12-29 Thread Adrian Bridgett
We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4, 
mesos-0.26) cluster.  It seems as if the logic in 
MesosExternalShuffleService.scala isn't working for some reason (new in 
1.6 I believe).


spark application sees this:
...
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully 
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external 
shuffle service.
15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully 
registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external 
shuffle service.
15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered executor 
NettyRpcEndpointRef(null) (ip-10-1-201-165.ec2.internal:37660) with ID 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new total is 1)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block 
manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM, 
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6, 
ip-10-1-201-165.ec2.internal, 53854)
15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block 
manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM, 
BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1, 
ip-10-1-201-132.ec2.internal, 12793)
15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor 
f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new total is 2)

...
15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in 
memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB, free: 13.8GB)
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 1 to ip-10-1-202-121.ec2.internal:59734
15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output statuses 
for shuffle 1 is 1671814 bytes
15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 1 to ip-10-1-201-165.ec2.internal:37660

...
15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage 1.0 
(TID 2191, ip-10-1-200-232.ec2.internal, partition 63,PROCESS_LOCAL, 
2171 bytes)
15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID 
2149, ip-10-1-200-232.ec2.internal): 
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, 
ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21, 
message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)

...
15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 
(reduceByKey at 
/home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48) and 
ShuffleMapStage 1 (reduceByKey at 
/home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50) due to 
fetch failure
15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0 (TID 
2140, ip-10-1-200-232.ec2.internal): 
FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6, 
ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6, reduceId=12, 
message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)



shuffle service itself (on executor's IP sees:
15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received 
registration request from app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 
(remote address /10.1.200.165:37889).  (that's the driver IP)
15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown 
/10.1.201.165:52562 disconnected. (executor IP)
15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler: Application 
a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected (address was 
/10.1.200.165:37889). (driver IP again)
15/12/29 18:58:07 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() on RPC id 6244044000322436908
java.lang.RuntimeException: Executor is not registered 
(appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, 
execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)


At first I wondered if reducing spark.shuffle.io.numConnectionsPerPeer 
back down to 1 (from 4) would help - maybe it wasn't keeping track of 
the number of connections.However now with the extra debug I wonder 
if the driver is disconnecting after 2mins and the mesos shuffle service 
takes this as a sign that the whole app has finished and tidies up?


I couldn't increase the time f

Re: Task hang problem

2015-12-29 Thread Darren Govoni

  

  
  
here's executor trace.

  

  
  
Thread 58: Executor task launch
worker-3 (RUNNABLE)

  
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketInputStream.read(SocketInputStream.java:122)
java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
java.io.BufferedInputStream.read(BufferedInputStream.java:254)
java.io.DataInputStream.readInt(DataInputStream.java:387)
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 41: BLOCK_MANAGER cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 42: BROADCAST_VARS cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 54: driver-heartbeater
(TIMED_WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 3: Finalizer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
  

  
  
Thread 25:
ForkJoinPool-3-worker-15 (WAITING)

  
sun.misc.Unsafe.park(Native Method)
scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  

  
  
Thread 35: Hashed wheel timer #2
(TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483)
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 68: Idle Worker Monitor
for /usr/bin/python2.7 (TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:229)
  

  
  
Thread 1: main (WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:819)
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:788

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo :

> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> 
>
>
>
>
>
>
> Running Spark Applications
> 
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> 
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or  
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo :

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill :



On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





   



  

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or
>
> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.


Actually, it just happens to be backward compatible because we didn't
change the shuffle file formats. This may not necessarily be the case
moving forward as Spark offers no such guarantees. Just thought it's worth
clarifying.

2015-12-27 22:34 GMT-08:00 Saisai Shao :

> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.
>
> Thanks
> Saisai
>
> On Mon, Dec 28, 2015 at 2:33 PM, 顾亮亮  wrote:
>
>> Is it possible to support both spark-1.5.1 and spark-1.6.0 on one yarn
>> cluster?
>>
>>
>>
>> *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com]
>> *Sent:* Monday, December 28, 2015 2:29 PM
>> *To:* Jeff Zhang
>> *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺
>> *Subject:* Re: Opening Dynamic Scaling Executors on Yarn
>>
>>
>>
>> Replace all the shuffle jars and restart the NodeManager is enough, no
>> need to restart NN.
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang  wrote:
>>
>> See
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮  wrote:
>>
>> Hi all,
>>
>>
>>
>> SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a
>> useful feature to save resources on yarn.
>>
>> We want to open this feature on our yarn cluster.
>>
>> I have a question about the version of shuffle service.
>>
>>
>>
>> I’m now using spark-1.5.1 (shuffle service).
>>
>> If I want to upgrade to spark-1.6.0, should I replace the shuffle service
>> jar and restart all the namenode on yarn ?
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Mars
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.


@Annabel That's not true. There *is* a standalone cluster mode where driver
runs on one of the workers instead of on the client machine. What you're
describing is standalone client mode.

2015-12-29 11:32 GMT-08:00 Annabel Melongo :

> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill :



On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
bq. same case with sc.parallelize() or sc.makeRDD()

I think so.

On Tue, Dec 29, 2015 at 10:50 AM, Gokula Krishnan D 
wrote:

> Ted - Thanks for the updates. Then its the same case with sc.parallelize()
> or sc.makeRDD() right.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu  wrote:
>
>> From RDD.scala :
>>
>>   def ++(other: RDD[T]): RDD[T] = withScope {
>> this.union(other)
>>
>> They should be the same.
>>
>> On Tue, Dec 29, 2015 at 10:41 AM, email2...@gmail.com <
>> email2...@gmail.com> wrote:
>>
>>> Hello All -
>>>
>>> tried couple of operations by using ++ and union on RDD's but realized
>>> that
>>> the end results are same. Do you know any differences?.
>>>
>>> val odd_partA  = List(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
>>> odd_partA: List[Int] = List(1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11, 1, 3,
>>> 5,
>>> 7, 9, 11)
>>>
>>> val odd_partB  = List(1,3,13,15,9)
>>> odd_partB: List[Int] = List(1, 3, 13, 15, 9)
>>>
>>> val odd_partC  = List(15,9,1,3,13)
>>> odd_partC: List[Int] = List(15, 9, 1, 3, 13)
>>>
>>> val odd_partA_RDD = sc.parallelize(odd_partA)
>>> odd_partA_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9]
>>> at
>>> parallelize at :17
>>>
>>> val odd_partB_RDD = sc.parallelize(odd_partB)
>>> odd_partB_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10]
>>> at
>>> parallelize at :17
>>>
>>> val odd_partC_RDD = sc.parallelize(odd_partC)
>>> odd_partC_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11]
>>> at
>>> parallelize at :17
>>>
>>> val odd_PARTAB_pp = odd_partA_RDD ++(odd_partB_RDD)
>>> odd_PARTAB_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at
>>> $plus$plus at
>>> :23
>>>
>>> val odd_PARTAB_union = odd_partA_RDD.union(odd_partB_RDD)
>>> odd_PARTAB_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[13] at union
>>> at
>>> :23
>>>
>>> odd_PARTAB_pp.count
>>> res8: Long = 23
>>>
>>> odd_PARTAB_union.count
>>> res9: Long = 23
>>>
>>> val odd_PARTABC_pp = odd_partA_RDD ++(odd_partB_RDD) ++ (odd_partC_RDD)
>>> odd_PARTABC_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at
>>> $plus$plus
>>> at :27
>>>
>>> val odd_PARTABC_union =
>>> odd_partA_RDD.union(odd_partB_RDD).union(odd_partC_RDD)
>>> odd_PARTABC_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union
>>> at
>>> :27
>>>
>>> odd_PARTABC_pp.count
>>> res10: Long = 28
>>>
>>> odd_PARTABC_union.count
>>> res11: Long = 28
>>>
>>> Thanks
>>> Gokul
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-and-Union-of-a-RDD-tp25830.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Spark submit does automatically upload the jar to cluster?

2015-12-29 Thread jiml
And for more clarification on this:

For non-YARN installs this bug has been filed to make the Spark driver
upload jars   

The point of confusion, that I along with other newcomers commonly suffer
from is this. In non-YARN installs:

*The **driver** does NOT push your jars to the cluster. The **master**
in the cluster DOES push your jars to the **workers**. In theory.*

Thanks to an email response on the email list from Greg Hill for this
clarification, hope he doesn't mind me copying the relevant part here, since
I can't link to it:

" spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers."
That's funny I didn't delete that answer!

I think I have two accounts crossing, here was the answer:

I don't know if this is going to help, but I agree that some of the docs
would lead one to believe that the Spark driver  or master is going to
spread your jars around for you. But there's other docs that seem to
contradict this, esp related to EC2 clusters.

I wrote a Stack Overflow answer dealing with a similar situation, see if it
helps:

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774

Pay attention to this section about the spark-submit docs:

I must admit, as a limitation on this, it confuses me in the Spark docs that
for spark.executor.extraClassPath it says:

Users typically should not need to set this option

I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it should like
the script handles moving your code around the cluster but I think it only
moves the classpath around for you. For example is this line from  Launching
Applications with spark-submit

  
explicitly says you have to move the jars yourself or make them "globally
available":

application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-does-automatically-upload-the-jar-to-cluster-tp25762p25831.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
Hi Greg,

It's actually intentional for standalone cluster mode to not upload jars.
One of the reasons why YARN takes at least 10 seconds before running any
simple application is because there's a lot of random overhead (e.g.
putting jars in HDFS). If this missing functionality is not documented
somewhere then we should add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I
have filed https://issues.apache.org/jira/browse/SPARK-12559.

-Andrew

2015-12-29 4:18 GMT-08:00 Greg Hill :

>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: difference between ++ and Union of a RDD

2015-12-29 Thread Gokula Krishnan D
Ted - Thanks for the updates. Then its the same case with sc.parallelize()
or sc.makeRDD() right.

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu  wrote:

> From RDD.scala :
>
>   def ++(other: RDD[T]): RDD[T] = withScope {
> this.union(other)
>
> They should be the same.
>
> On Tue, Dec 29, 2015 at 10:41 AM, email2...@gmail.com  > wrote:
>
>> Hello All -
>>
>> tried couple of operations by using ++ and union on RDD's but realized
>> that
>> the end results are same. Do you know any differences?.
>>
>> val odd_partA  = List(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
>> odd_partA: List[Int] = List(1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11, 1, 3, 5,
>> 7, 9, 11)
>>
>> val odd_partB  = List(1,3,13,15,9)
>> odd_partB: List[Int] = List(1, 3, 13, 15, 9)
>>
>> val odd_partC  = List(15,9,1,3,13)
>> odd_partC: List[Int] = List(15, 9, 1, 3, 13)
>>
>> val odd_partA_RDD = sc.parallelize(odd_partA)
>> odd_partA_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at
>> parallelize at :17
>>
>> val odd_partB_RDD = sc.parallelize(odd_partB)
>> odd_partB_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10]
>> at
>> parallelize at :17
>>
>> val odd_partC_RDD = sc.parallelize(odd_partC)
>> odd_partC_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11]
>> at
>> parallelize at :17
>>
>> val odd_PARTAB_pp = odd_partA_RDD ++(odd_partB_RDD)
>> odd_PARTAB_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at $plus$plus
>> at
>> :23
>>
>> val odd_PARTAB_union = odd_partA_RDD.union(odd_partB_RDD)
>> odd_PARTAB_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[13] at union at
>> :23
>>
>> odd_PARTAB_pp.count
>> res8: Long = 23
>>
>> odd_PARTAB_union.count
>> res9: Long = 23
>>
>> val odd_PARTABC_pp = odd_partA_RDD ++(odd_partB_RDD) ++ (odd_partC_RDD)
>> odd_PARTABC_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at $plus$plus
>> at :27
>>
>> val odd_PARTABC_union =
>> odd_partA_RDD.union(odd_partB_RDD).union(odd_partC_RDD)
>> odd_PARTABC_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union
>> at
>> :27
>>
>> odd_PARTABC_pp.count
>> res10: Long = 28
>>
>> odd_PARTABC_union.count
>> res11: Long = 28
>>
>> Thanks
>> Gokul
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-and-Union-of-a-RDD-tp25830.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Task hang problem

2015-12-29 Thread Ted Yu
Can you log onto 10.65.143.174 , find task 31 and take a stack trace ?

Thanks

On Tue, Dec 29, 2015 at 9:19 AM, Darren Govoni  wrote:

> Hi,
>   I've had this nagging problem where a task will hang and the entire job
> hangs. Using pyspark. Spark 1.5.1
>
> The job output looks like this, and hangs after the last task:
>
> ..
> 15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage 0.0
> (TID 15) in 11668 ms on 10.65.143.174 (29/32)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage 0.0
> (TID 23) in 11684 ms on 10.65.143.174 (30/32)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage 0.0
> (TID 7) in 11717 ms on 10.65.143.174 (31/32)
> {nothing here for a while, ~6mins}
>
>
> Here is the executor status, from UI.
>
> 31 31 0 RUNNING PROCESS_LOCAL 2 / 10.65.143.174 2015/12/29 17:00:28 6.8
> min 0 ms 0 ms 60 ms 0 ms 0 ms 0.0 B
> Here is executor 2 from 10.65.143.174. Never see task 31 get to the
> executor.any ideas?
>
> .
> 15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
> variable 0
> 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called with
> curMem=0, maxMem=2223023063
> 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 5.8 KB, free 2.1 GB)
> 15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable 0
> took 208 ms
> 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called with
> curMem=5979, maxMem=2223023063
> 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 8.3 KB, free 2.1 GB)
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747, init
> = 166, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0 (TID
> 15). 967 bytes result sent to driver
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735, init
> = 220, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0 (TID
> 23). 967 bytes result sent to driver
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812, init
> = 158, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 967 bytes result sent to driver
> root@ip-10-65-143-174 2]$
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>


Re: difference between ++ and Union of a RDD

2015-12-29 Thread Ted Yu
>From RDD.scala :

  def ++(other: RDD[T]): RDD[T] = withScope {
this.union(other)

They should be the same.

On Tue, Dec 29, 2015 at 10:41 AM, email2...@gmail.com 
wrote:

> Hello All -
>
> tried couple of operations by using ++ and union on RDD's but realized that
> the end results are same. Do you know any differences?.
>
> val odd_partA  = List(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
> odd_partA: List[Int] = List(1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11, 1, 3, 5,
> 7, 9, 11)
>
> val odd_partB  = List(1,3,13,15,9)
> odd_partB: List[Int] = List(1, 3, 13, 15, 9)
>
> val odd_partC  = List(15,9,1,3,13)
> odd_partC: List[Int] = List(15, 9, 1, 3, 13)
>
> val odd_partA_RDD = sc.parallelize(odd_partA)
> odd_partA_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at
> parallelize at :17
>
> val odd_partB_RDD = sc.parallelize(odd_partB)
> odd_partB_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at
> parallelize at :17
>
> val odd_partC_RDD = sc.parallelize(odd_partC)
> odd_partC_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at
> parallelize at :17
>
> val odd_PARTAB_pp = odd_partA_RDD ++(odd_partB_RDD)
> odd_PARTAB_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at $plus$plus
> at
> :23
>
> val odd_PARTAB_union = odd_partA_RDD.union(odd_partB_RDD)
> odd_PARTAB_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[13] at union at
> :23
>
> odd_PARTAB_pp.count
> res8: Long = 23
>
> odd_PARTAB_union.count
> res9: Long = 23
>
> val odd_PARTABC_pp = odd_partA_RDD ++(odd_partB_RDD) ++ (odd_partC_RDD)
> odd_PARTABC_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at $plus$plus
> at :27
>
> val odd_PARTABC_union =
> odd_partA_RDD.union(odd_partB_RDD).union(odd_partC_RDD)
> odd_PARTABC_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union at
> :27
>
> odd_PARTABC_pp.count
> res10: Long = 28
>
> odd_PARTABC_union.count
> res11: Long = 28
>
> Thanks
> Gokul
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-and-Union-of-a-RDD-tp25830.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


difference between ++ and Union of a RDD

2015-12-29 Thread email2...@gmail.com
Hello All - 

tried couple of operations by using ++ and union on RDD's but realized that
the end results are same. Do you know any differences?.

val odd_partA  = List(1,3,5,7,9,11,1,3,5,7,9,11,1,3,5,7,9,11)
odd_partA: List[Int] = List(1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11, 1, 3, 5,
7, 9, 11)

val odd_partB  = List(1,3,13,15,9)
odd_partB: List[Int] = List(1, 3, 13, 15, 9)

val odd_partC  = List(15,9,1,3,13)
odd_partC: List[Int] = List(15, 9, 1, 3, 13)

val odd_partA_RDD = sc.parallelize(odd_partA)
odd_partA_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at
parallelize at :17

val odd_partB_RDD = sc.parallelize(odd_partB)
odd_partB_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at
parallelize at :17

val odd_partC_RDD = sc.parallelize(odd_partC)
odd_partC_RDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at
parallelize at :17

val odd_PARTAB_pp = odd_partA_RDD ++(odd_partB_RDD)
odd_PARTAB_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at $plus$plus at
:23

val odd_PARTAB_union = odd_partA_RDD.union(odd_partB_RDD)
odd_PARTAB_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[13] at union at
:23

odd_PARTAB_pp.count
res8: Long = 23

odd_PARTAB_union.count
res9: Long = 23

val odd_PARTABC_pp = odd_partA_RDD ++(odd_partB_RDD) ++ (odd_partC_RDD)
odd_PARTABC_pp: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at $plus$plus
at :27

val odd_PARTABC_union =
odd_partA_RDD.union(odd_partB_RDD).union(odd_partC_RDD)
odd_PARTABC_union: org.apache.spark.rdd.RDD[Int] = UnionRDD[17] at union at
:27

odd_PARTABC_pp.count
res10: Long = 28

odd_PARTABC_union.count
res11: Long = 28

Thanks
Gokul



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-and-Union-of-a-RDD-tp25830.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL Hive orc snappy table

2015-12-29 Thread Dawid Wysakowicz
Hi,

I have a table in hive stored as orc with compression = snappy. I try to
execute a query on that table that fails (previously I run it on table in
orc-zlib format and parquet so it is not the matter of query).

I managed to execute this query with hive on tez on that tables.

The exception i get is as follows:

15/12/29 17:16:46 WARN scheduler.DAGScheduler: Creating new stage failed
due to exception - job: 3
java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.getPartitions(MapPartitionsWithPreparationRDD.scala:40)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
at
org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
at org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:388)
at
org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:405)
at
org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:370)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:253)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:354)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:351)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:351)
at
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:363)
at
org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:266)
at
org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:300)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:734)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1466)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.util.concurrent.ExecutionException:
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1016)
... 48 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4454)
at
org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240)
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651)
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java

Task hang problem

2015-12-29 Thread Darren Govoni


Hi,

  I've had this nagging problem where a task will hang and the
entire job hangs. Using pyspark. Spark 1.5.1



The job output looks like this, and hangs after the last task:



..

15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage
0.0 (TID 15) in 11668 ms on 10.65.143.174 (29/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage
0.0 (TID 23) in 11684 ms on 10.65.143.174 (30/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage
0.0 (TID 7) in 11717 ms on 10.65.143.174 (31/32)

{nothing here for a while, ~6mins}





Here is the executor status, from UI.





  

  31
  31
  0
  RUNNING
  PROCESS_LOCAL
  2 / 10.65.143.174
  2015/12/29 17:00:28
  6.8 min
  0 ms
  0 ms
  60 ms
  0 ms
  0 ms
  0.0 B

  



Here is executor 2 from 10.65.143.174. Never see task 31 get to the
executor.any ideas?



.

15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
variable 0

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called
with curMem=0, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 5.8 KB, free 2.1 GB)

15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable
0 took 208 ms

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called
with curMem=5979, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 8.3 KB, free 2.1 GB)

15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747,
init = 166, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0
(TID 15). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735,
init = 220, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0
(TID 23). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812,
init = 158, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID
7). 967 bytes result sent to driver

root@ip-10-65-143-174 2]$ 


Sent from my Verizon Wireless 4G LTE smartphone

Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Annabel Melongo
Eugene,
The example I gave you was in Python. I used it on my end and it works fine. 
Sorry, I don't know Scala.
Thanks 

On Tuesday, December 29, 2015 5:24 AM, Eugene Morozov 
 wrote:
 

 Annabel, 
That might work in Scala, but I use Java. Three quotes just don't compile =)If 
your example is in Scala, then, I believe, semicolon is not required.
--
Be well!
Jean Morozov
On Mon, Dec 28, 2015 at 8:49 PM, Annabel Melongo  
wrote:

Jean,
Try this:df.select("""select * from tmptable where x1 = '3.0'""").show();
Note: you have to use 3 double quotes as marked  

On Friday, December 25, 2015 11:30 AM, Eugene Morozov 
 wrote:
 

 Thanks for the comments, although the issue is not in limit() predicate. It's 
something with spark being unable to resolve the expression.

I can do smth like this. It works as it suppose to:  
df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5);
But I think old fashioned sql style have to work also. I have 
df.registeredTempTable("tmptable") and then df.select("select * from tmptable 
where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot 
resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, 
x3, x2;
 at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca

>From the first statement I conclude that my custom datasource is perfectly 
>fine.Just wonder how to fix / workaround that. --
Be well!
Jean Morozov
On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman  wrote:

sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 
supported)

or use Dmitriy's solution. select() defines your projection when you've 
specified entire query
On 25 December 2015 at 15:42, Василец Дмитрий  wrote:

hello
you can try to use df.limit(5).show()
just trick :)

On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov  
wrote:

Hello, I'm basically stuck as I have no idea where to look;
Following simple code, given that my Datasource is working gives me an 
exception.DataFrame df = sqlc.load(filename, 
"com.epam.parso.spark.ds.DefaultSource");
df.cache();
df.printSchema();   <-- prints the schema perfectly fine!

df.show();  <-- Works perfectly fine (shows table with 20 
lines)!
df.registerTempTable("table");
df.select("select * from table limit 5").show(); <-- gives weird 
exceptionException is:AnalysisException: cannot resolve 'select * from table 
limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
I can do a collect on a dataframe, but cannot select any specific columns 
either "select * from table" or "select VER, CREATED from table".
I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5.
Thanks.
--
Be well!
Jean Morozov







   



  

Re: ClassNotFoundException when executing spark jobs in standalone/cluster mode on Spark 1.5.2

2015-12-29 Thread Saiph Kappa
I found out that by commenting this line in the application code:
sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:FreqInlineSize=300
-XX:MaxInlineSize=300 ")

the exception does not occur anymore.  Not entirely sure why, but
everything goes fine without that line.

Thanks!

On Tue, Dec 29, 2015 at 1:39 PM, Prem Spark  wrote:

> you need make sure this class is accessible to all servers since its a
> cluster mode and drive can be on any of the worker nodes.
>
>
> On Fri, Dec 25, 2015 at 5:57 PM, Saiph Kappa 
> wrote:
>
>> Hi,
>>
>> I'm submitting a spark job like this:
>>
>> ~/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --class Benchmark --master
>>> spark://machine1:6066 --deploy-mode cluster --jars
>>> target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar
>>> /home/user/bench/target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar 1
>>> machine2  1000
>>>
>>
>> and in the driver stderr, I get the following exception:
>>
>>  WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 74,
>>> XXX.XXX.XX.XXX): java.lang.ClassNotFoundException: Benchmark$$anonfun$main$1
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:270)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>>> at
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>>> at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>
>> Note that everything works fine when using deploy-mode as 'client'.
>> This is the application that I'm trying to run:
>> https://github.com/tdas/spark-streaming-benchmark (this problem also
>> happens for non streaming applications)
>>
>> What can I do to sort this out?
>>
>> Thanks.
>>
>
>


Re: ClassNotFoundException when executing spark jobs in standalone/cluster mode on Spark 1.5.2

2015-12-29 Thread Prem Spark
you need make sure this class is accessible to all servers since its a
cluster mode and drive can be on any of the worker nodes.


On Fri, Dec 25, 2015 at 5:57 PM, Saiph Kappa  wrote:

> Hi,
>
> I'm submitting a spark job like this:
>
> ~/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --class Benchmark --master
>> spark://machine1:6066 --deploy-mode cluster --jars
>> target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar
>> /home/user/bench/target/scala-2.10/benchmark-app_2.10-0.1-SNAPSHOT.jar 1
>> machine2  1000
>>
>
> and in the driver stderr, I get the following exception:
>
>  WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 74, XXX.XXX.XX.XXX):
>> java.lang.ClassNotFoundException: Benchmark$$anonfun$main$1
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>> at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>> at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>
> Note that everything works fine when using deploy-mode as 'client'.
> This is the application that I'm trying to run:
> https://github.com/tdas/spark-streaming-benchmark (this problem also
> happens for non streaming applications)
>
> What can I do to sort this out?
>
> Thanks.
>


Spark 1.5.2 compatible spark-cassandra-connector

2015-12-29 Thread vivek.meghanathan
All,
What is the compatible spark-cassandra-connector for spark 1.5.2? I can only 
find the latest connector version spark-cassandra-connector_2.10-1.5.0-M3 which 
has dependency with 1.5.1 spark. Can we use the same for 1.5.2? Any classpath 
issues needs to be handled or any jars needs to be excluded while packaging the 
application jar?

http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.5.0-M3/spark-cassandra-connector_2.10-1.5.0-M3.pom

Regards,
Vivek M
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Can't submit job to stand alone cluster

2015-12-29 Thread Greg Hill


On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Eugene Morozov
Annabel,

That might work in Scala, but I use Java. Three quotes just don't compile =)
If your example is in Scala, then, I believe, semicolon is not required.

--
Be well!
Jean Morozov

On Mon, Dec 28, 2015 at 8:49 PM, Annabel Melongo 
wrote:

> Jean,
>
> Try this:
>
> df.select("""select * from tmptable where x1 = '3.0'""").show();
>
>
> *Note: *you have to use 3 double quotes as marked
>
>
>
> On Friday, December 25, 2015 11:30 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>
> Thanks for the comments, although the issue is not in limit() predicate.
> It's something with spark being unable to resolve the expression.
>
> I can do smth like this. It works as it suppose to:
>  df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5);
>
> But I think old fashioned sql style have to work also. I have
> df.registeredTempTable("tmptable") and then
>
> df.select("select * from tmptable where x1 = '3.0'").show();
>
> org.apache.spark.sql.AnalysisException: cannot resolve 'select * from tmp
> where x1 = '1.0'' given input columns x1, x4, x5, x3, x2;
>
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca
>
>
> From the first statement I conclude that my custom datasource is perfectly
> fine.
> Just wonder how to fix / workaround that.
> --
> Be well!
> Jean Morozov
>
> On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman 
> wrote:
>
> sqlContext.sql("select * from table limit 5").show() (not sure if limit 5
> supported)
>
> or use Dmitriy's solution. select() defines your projection when you've
> specified entire query
>
> On 25 December 2015 at 15:42, Василец Дмитрий 
> wrote:
>
> hello
> you can try to use df.limit(5).show()
> just trick :)
>
> On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
> Hello, I'm basically stuck as I have no idea where to look;
>
> Following simple code, given that my Datasource is working gives me an
> exception.
>
> DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource");
> df.cache();
> df.printSchema();   <-- prints the schema perfectly fine!
>
> df.show();  <-- Works perfectly fine (shows table with 20 
> lines)!
> df.registerTempTable("table");
> df.select("select * from table limit 5").show(); <-- gives weird exception
>
> Exception is:
>
> AnalysisException: cannot resolve 'select * from table limit 5' given input 
> columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
>
> I can do a collect on a dataframe, but cannot select any specific columns
> either "select * from table" or "select VER, CREATED from table".
>
> I use spark 1.5.2.
> The same code perfectly works through Zeppelin 0.5.5.
>
> Thanks.
> --
> Be well!
> Jean Morozov
>
>
>
>
>
>
>


Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Hyukjin Kwon
I see, as far as I know Spark CSV datasource does not support custom date
format but formal ones such as “2015-08-20 15:57:00”.

Internally this uses Timestamp.valueOf() and Date.valueOf() to parse them.

For me, it looks you can

   1.

   modify and build  the library by yourself for custom date time (it does
   not have to change a lot of codes but just here
   
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TypeCast.scala#L62-L64
   ).
   2.

   read the files as RDD first, parse the date part to an appropriate
   format and then make a DataFrame with the library.​


2015-12-29 16:41 GMT+09:00 Divya Gehlot :

> yes I am using spark -csv only
>
> below is the sample code for your reference
>
>
>1. 15/12/28 03:34:27 INFO SparkILoop: Created sql context (with Hive 
> support)..
>2. SQL context available as sqlContext.
>3.
>4. scala> import org.apache.spark.sql.hive.HiveContext
>5. import org.apache.spark.sql.hive.HiveContext
>6.
>7. scala> import org.apache.spark.sql.hive.orc._
>8. import org.apache.spark.sql.hive.orc._
>9.
>10. scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>11. 15/12/28 03:34:57 WARN SparkConf: The configuration key 
> 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 
> and and may be removed in the future. Please use the new key 
> 'spark.yarn.am.waitTime' instead.
>12. 15/12/28 03:34:57 INFO HiveContext: Initializing execution hive, 
> version 0.13.1
>13. hiveContext: org.apache.spark.sql.hive.HiveContext = 
> org.apache.spark.sql.hive.HiveContext@3413fbe
>14.
>15. scala> import org.apache.spark.sql.types.{StructType, StructField, 
> StringType, IntegerType,FloatType ,LongType ,TimestampType,NullType };
>16. import org.apache.spark.sql.types.{StructType, StructField, 
> StringType, IntegerType, FloatType, LongType, TimestampType, NullType}
>17.
>18. scala> val loandepoSchema = StructType(Seq(
>19.  | StructField("COLUMN1", StringType, true),
>20.  | StructField("COLUMN2", StringType  , true),
>21.  | StructField("COLUMN3", TimestampType , true),
>22.  | StructField("COLUMN4", TimestampType , true),
>23.  | StructField("COLUMN5", StringType , true),
>24.  | StructField("COLUMN6", StringType, true),
>25.  | StructField("COLUMN7", IntegerType, true),
>26.  | StructField("COLUMN8", IntegerType, true),
>27.  | StructField("COLUMN9", StringType, true),
>28.  | StructField("COLUMN10", IntegerType, true),
>29.  | StructField("COLUMN11", IntegerType, true),
>30.  | StructField("COLUMN12", IntegerType, true),
>31.  | StructField("COLUMN13", StringType, true),
>32.  | StructField("COLUMN14", StringType, true),
>33.  | StructField("COLUMN15", StringType, true),
>34.  | StructField("COLUMN16", StringType, true),
>35.  | StructField("COLUMN17", StringType, true),
>36.  | StructField("COLUMN18", StringType, true),
>37.  | StructField("COLUMN19", StringType, true),
>38.  | StructField("COLUMN20", StringType, true),
>39.  | StructField("COLUMN21", StringType, true),
>40.  | StructField("COLUMN22", StringType, true)))
>41. loandepoSchema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(COLUMN1,StringType,true), 
> StructField(COLUMN2,StringType,true), 
> StructField(COLUMN3,TimestampType,true), 
> StructField(COLUMN4,TimestampType,true), 
> StructField(COLUMN5,StringType,true), StructField(COLUMN6,StringType,true), 
> StructField(COLUMN7,IntegerType,true), StructField(COLUMN8,IntegerType,true), 
> StructField(COLUMN9,StringType,true), StructField(COLUMN10,IntegerType,true), 
> StructField(COLUMN11,IntegerType,true), 
> StructField(COLUMN12,IntegerType,true), 
> StructField(COLUMN13,StringType,true), StructField(COLUMN14,StringType,true), 
> StructField(COLUMN15,StringType,true), StructField(COLUMN16,StringType,true), 
> StructField(COLUMN17,StringType,true), StructField(COLUMN18,StringType,true), 
> StructField(COLUMN19,Strin...
>42. scala> val lonadepodf = 
> hiveContext.read.format("com.databricks.spark.csv").option("header", 
> "true").schema(loandepoSchema).load("/tmp/TestDivya/loandepo_10K.csv")
>43. 15/12/28 03:37:52 INFO HiveContext: Initializing 
> HiveMetastoreConnection version 0.13.1 using Spark classes.
>44. lonadepodf: org.apache.spark.sql.DataFrame = [COLUMN1: string, 
> COLUMN2: string, COLUMN3: timestamp, COLUMN4: timestamp, COLUMN5: string, 
> COLUMN6: string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int, 
> COLUMN11: int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15: 
> string, COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19: 
> string, COLUMN20: string, COLUMN21: string, COLUMN22: string]
>45.
>46. scala> lonadepodf.select("COLUMN1").show(10)
>47. 15/1

Re: [Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-29 Thread Raghavendra Pandey
U can use date type...
On Dec 29, 2015 9:02 AM, "Divya Gehlot"  wrote:

> Hi,
> I am newbee to Spark ,
> My appologies for such a naive question
> I am using Spark 1.4.1 and wrtiting code in scala . I have input data as
> CSVfile  which I am parsing using spark-csv package . I am creating custom
> schema to process the CSV file .
> Now my query is which dataype or can say  Structfield should I use for
> Date column of my CSV file.
> I am using hivecontext and have requirement to create hive table after
> processing the CSV file.
> For example my date columnin CSV file  looks like
>
> 25/11/2014 20/9/2015 25/10/2015 31/10/2012 25/9/2013 25/11/2012 20/10/2013
> 25/10/2011
>
>