Re: Can't submit job to stand alone cluster

2015-12-30 Thread SparkUser

Sorry need to clarify:

When you say:

   /When the docs say //"If your application is launched through Spark
   submit, then the application jar is automatically distributed to all
   worker nodes,"//it is actually saying that your executors get their
   jars from the driver. This is true whether you're running in client
   mode or cluster mode./


Don't you mean the master, not the driver? I thought the whole point of 
confusion is that people expect the driver to distribute jars but they 
have to be visible to the master on the file system local to the master?


I see a lot of people tripped up by this and a nice mail from Greg Hill 
to the list cleared this up for me but now I am confused again. I am a 
couple days away from having a way to test this myself, so I am just "in 
theory" right now.


   On 12/29/2015 05:18 AM, Greg Hill wrote:

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs
access to
the JAR.  spark-submit does not pass the JAR along to the Driver,
but the
Driver will pass it to the executors.  I ended up putting the JAR
in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle
difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit. 
It's

really confusing for newcomers.



Thanks,

Jim


On 12/29/2015 04:36 PM, Daniel Valdivia wrote:

That makes things more clear! Thanks

Issue resolved

Sent from my iPhone

On Dec 29, 2015, at 2:43 PM, Annabel Melongo 
mailto:melongo_anna...@yahoo.com>> wrote:



Thanks Andrew for this awesome explanation *:) happy


On Tuesday, December 29, 2015 5:30 PM, Andrew Or 
mailto:and...@databricks.com>> wrote:



Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Each 
cluster manager can run in two *deploy modes*, client or cluster. In 
client mode, the driver runs on the machine that submitted the 
application (the client). In cluster mode, the driver runs on one of 
the worker machines in the cluster.


When I say "standalone cluster mode" I am referring to the standalone 
cluster manager running in cluster deploy mode.


Here's how the resources are distributed in each mode (omitting Mesos):

*Standalone / YARN client mode. *The driver runs on the client
machine (i.e. machine that ran Spark submit) so it should already
have access to the jars. The executors then pull the jars from an
HTTP server started in the driver.

*Standalone cluster mode. *Spark submit does /not/ upload your
jars to the cluster, so all the resources you need must already
be on all of the worker machines. The executors, however,
actually just pull the jars from the driver as in client mode
instead of finding it in their own local file systems.

*YARN cluster mode. *Spark submit /does/ upload your jars to the
cluster. In particular, it puts the jars in HDFS so your driver
can just read from there. As in other deployments, the executors
pull the jars from the driver.


When the docs say "If your application is launched through Spark 
submit, then the application jar is automatically distributed to all 
worker nodes," it is actually saying that your executors get their 
jars from the driver. This is true whether you're running in client 
mode or cluster mode.


If the docs are unclear (and they seem to be), then we should update 
them. I have filed SPARK-12565 
 to track this.


Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew




2015-12-29 13:07 GMT-08:00 Annabel Melongo >:


Andrew,

Now I see where the confusion lays. Standalone cluster mode, your
link, is nothing but a combination of client-mode and standalone
mode, my link, without YARN.

But I'm confused by this paragraph in your link:

If your application is launched through Spark submit, then the
application jar is automatically distributed to all worker nodes.
For any additional jars that your
application depends on, you should specify them through
the |--jars| flag using comma as a delimiter (e.g. |--jars
jar1,jar2|).

That can't be true; this is only the case when Spark runs on top
of YARN. Please correct me, if I'm wrong.

Thanks


On Tuesday, December 29, 2015 2:54 PM, Andrew Or
mailto:and...@databricks.com>> wrote:



http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo
mailto:melongo_anna...@yahoo.com>>:

Greg,

Can you please send me a doc describing the standalone
cluster mode? Honestly, I never heard about it.

   

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread SparkUser
Sounds like you guys are on the right track, this is purely FYI because 
I haven't seen it posted, just responding to the line in the original 
post that your data structure should fit in memory.


OK two more disclaimers "FWIW" and "maybe this is not relevant or 
already covered" OK here goes...


 from 
http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks


Sometimes, you will get an OutOfMemoryError not because your RDDs don’t 
fit in memory, but because the working set of one of your tasks, such as 
one of the reduce tasks in |groupByKey|, was too large. Spark’s shuffle 
operations (|sortByKey|, |groupByKey|, |reduceByKey|, |join|, etc) build 
a hash table within each task to perform the grouping, which can often 
be large. The simplest fix here is to /increase the level of 
parallelism/, so that each task’s input set is smaller. Spark can 
efficiently support tasks as short as 200 ms, because it reuses one 
executor JVM across many tasks and it has a low task launching cost, so 
you can safely increase the level of parallelism to more than the number 
of cores in your clusters.


I would be curious if that helps at all. Sounds like an interesting 
problem you are working on.


Jim

On 12/29/2015 05:51 PM, Davies Liu wrote:

Hi Andy,

Could you change logging level to INFO and post some here? There will be some 
logging about the memory usage of a task when OOM.

In 1.6, the memory for a task is : (HeapSize  - 300M) * 0.75 / number of tasks. 
Is it possible that the heap is too small?

Davies

--
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

已使用 Sparrow (http://www.sparrowmailapp.com/?sig)

在 2015年12月29日 星期二,下午4:28,Andy Davidson 写道:


Hi Michael
  
https://github.com/apache/spark/archive/v1.6.0.tar.gz
  
Both 1.6.0 and 1.5.2 my unit test work when I call reparation(1) before saving output. Coalesce still fails.
  
Coalesce(1) spark-1.5.2

Caused by:
java.io.IOException: Unable to acquire 33554432 bytes of memory
  
  
Coalesce(1) spark-1.6.0
  
Caused by:

java.lang.OutOfMemoryError: Unable to acquire 28 bytes of memory, got 0
  
Hope this helps
  
Andy
  
From: Michael Armbrust mailto:mich...@databricks.com)>

Date: Monday, December 28, 2015 at 2:41 PM
To: Andrew Davidson mailto:a...@santacruzintegration.com)>
Cc: "user @spark" mailto:user@spark.apache.org)>
Subject: Re: trouble understanding data frame memory usage 
³java.io.IOException: Unable to acquire memory²
  

Unfortunately in 1.5 we didn't force operators to spill when ran out of memory 
so there is not a lot you can do. It would be awesome if you could test with 
1.6 and see if things are any better?
  
On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson mailto:a...@santacruzintegration.com)> wrote:

I am using spark 1.5.1. I am running into some memory problems with a java unit test. Yes 
I could fix it by setting –Xmx (its set to 1024M) how ever I want to better understand 
what is going on so I can write better code in the future. The test runs on a Mac, 
master="Local[2]"
  
I have a java unit test that starts by reading a 672K ascii file. I my output data file is 152K. Its seems strange that such a small amount of data would cause an out of memory exception. I am running a pretty standard machine learning process
  
Load data

create a ML pipeline
transform the data
Train a model
Make predictions
Join the predictions back to my original data set
Coalesce(1), I only have a small amount of data and want to save it in a single 
file
Save final results back to disk
  
  
Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to acquire memory”
  
To try and figure out what is going I put log messages in to count the number of partitions
  
Turns out I have 20 input files, each one winds up in a separate partition. Okay so after loading I call coalesce(1) and check to make sure I only have a single partition.
  
The total number of observations is 1998.
  
After calling step 7 I count the number of partitions and discovered I have 224 partitions!. Surprising given I called Coalesce(1) before I did anything with the data. My data set should easily fit in memory. When I save them to disk I get 202 files created with 162 of them being empty!
  
In general I am not explicitly using cache.
  
Some of the data frames get registered as tables. I find it easier to use sql.
  
Some of the data frames get converted back to RDDs. I find it easier to create RDD this way
  
I put calls to unpersist(true). In several places
  
  
private void memoryCheck(String name) {
  
  
Runtime rt = Runtime.getRuntime();
  
  
logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size: {}",
  
  
name,
  
  
String.format("%,d", rt.totalMemory()),
  
  
String.format("%,d", rt.freeMemory()));
  
  
}
  
  
  
Any idea how I can get a better understanding of what is going on? My goal is to learn to write better spark code.
  
Kind regards
  
Andy
  
Memory usages at various points in my unit test
  
name: ra

Re: map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread SparkUser

Wouldn't Amazon Elastic IP do this for you?

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html

On 12/28/2015 10:58 PM, Divya Gehlot wrote:


Hi,

I have HDP2.3.2 cluster installed in Amazon EC2.

I want to update the IP adress of spark.driver.appUIAddress,which is 
currently mapped to private IP of EC2.


Searched in spark config in ambari,could find 
spark.driver.appUIAddress property.


Because of this private IP mapping,the spark webUI page is not getting 
displayed


Would really appreciate the help.

Thanks,

Divya




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to calculate percentiles with Spark?

2014-10-21 Thread sparkuser
Hi,

What would be the best way to get percentiles from a Spark RDD? I can see
JavaDoubleRDD or MLlib's  MultivariateStatisticalSummary
   provide the
mean() but not percentiles.

Thank you!

Horace



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentiles-with-Spark-tp16937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org