from:"Andrew Or"

Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or

Hi Patrick,

I think he means just write `/tmp/sparkserverlog` instead of
`file:/tmp/sparkserverlog`. However, I think both should work. What mode
are you running in, client mode (the default) or cluster mode? If the
latter your driver will be run on the cluster, and so your event logs won't
be on the machine you ran spark-submit from. Also, are you running
standalone, YARN or Mesos?

As Jeff commented above, if event log is in fact enabled you should see the
log message from EventLoggingListener. If the log message is not present in
your driver logs, it's likely that the configurations in your
spark-defaults.conf are not passed correctly.

-Andrew

2016-03-03 19:57 GMT-08:00 PatrickYu :

> alvarobrandon wrote
> > Just write /tmp/sparkserverlog without the file part.
>
> I don't get your point, what's mean of 'without the file part'
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Andrew Or

Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled =
true` then the workers will each start a shuffle service automatically. No
need to start the shuffle services yourself separately.

-Andrew

2016-03-08 11:21 GMT-08:00 Silvio Fiorito :

> There’s a script to start it up under sbin, start-shuffle-service.sh. Run
> that on each of your worker nodes.
>
>
>
>
>
>
>
> *From: *Yuval Itzchakov 
> *Sent: *Tuesday, March 8, 2016 2:17 PM
> *To: *Silvio Fiorito ;
> user@spark.apache.org
> *Subject: *Re: Using dynamic allocation and shuffle service in Standalone
> Mode
>
>
> Actually, I assumed that setting the flag in the spark job would turn on
> the shuffle service in the workers. I now understand that assumption was
> wrong.
>
> Is there any way to set the flag via the driver? Or must I manually set it
> via spark-env.sh on each worker?
>
>
> On Tue, Mar 8, 2016, 20:14 Silvio Fiorito 
> wrote:
>
>> You’ve started the external shuffle service on all worker nodes, correct?
>> Can you confirm they’re still running and haven’t exited?
>>
>>
>>
>>
>>
>>
>>
>> *From: *Yuval.Itzchakov 
>> *Sent: *Tuesday, March 8, 2016 12:41 PM
>> *To: *user@spark.apache.org
>> *Subject: *Using dynamic allocation and shuffle service in Standalone
>> Mode
>>
>>
>> Hi,
>> I'm using Spark 1.6.0, and according to the documentation, dynamic
>> allocation and spark shuffle service should be enabled.
>>
>> When I submit a spark job via the following:
>>
>> spark-submit \
>> --master  \
>> --deploy-mode cluster \
>> --executor-cores 3 \
>> --conf "spark.streaming.backpressure.enabled=true" \
>> --conf "spark.dynamicAllocation.enabled=true" \
>> --conf "spark.dynamicAllocation.minExecutors=2" \
>> --conf "spark.dynamicAllocation.maxExecutors=24" \
>> --conf "spark.shuffle.service.enabled=true" \
>> --conf "spark.executor.memory=8g" \
>> --conf "spark.driver.memory=10g" \
>> --class SparkJobRunner
>>
>> /opt/clicktale/entityCreator/com.clicktale.ai.entity-creator-assembly-0.0.2.jar
>>
>> I'm seeing error logs from the workers being unable to connect to the
>> shuffle service:
>>
>> 16/03/08 17:33:15 ERROR storage.BlockManager: Failed to connect to
>> external
>> shuffle server, will retry 2 more times after waiting 5 seconds...
>> java.io.IOException: Failed to connect to 
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>> at
>>
>> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>> at
>>
>> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>> at
>>
>> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
>> at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> at
>>
>> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
>> at
>> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
>> at org.apache.spark.executor.Executor.(Executor.scala:85)
>> at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>> at
>>
>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>> at
>>
>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I verified all relevant ports are open. Has anyone else experienced such a
>> failure?
>>
>> Yuval.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or

Hi Alex,

Yes, you can set `spark.cleaner.ttl`:
http://spark.apache.org/docs/1.6.0/configuration.html, but I would not
recommend it!

We are actually removing this property in Spark 2.0 because it has caused
problems for many users in the past. In particular, if you accidentally use
a variable that has been automatically cleaned, then you will run into
problems like shuffle fetch failures or broadcast variable not found etc,
which may fail your job.

Alternatively, Spark already automatically cleans up all variables that
have been garbage collected, including RDDs, shuffle dependencies,
broadcast variables and accumulators. This context-based cleaning has been
enabled by default for many versions by now so it should be reliable. The
only caveat is that it may not work super well in a shell environment,
where some variables may never exit the scope.

Please let me know if you have more questions,
-Andrew


2016-01-13 11:36 GMT-08:00 Alexander Pivovarov :

> Is it possible to automatically unpersist RDDs which are not used for 24
> hours?
>

Re: Read Accumulator value while running

2016-01-13 Thread Andrew Or

Hi Kira,

As you suspected, accumulator values are only updated after the task
completes. We do send accumulator updates from the executors to the driver
on periodic heartbeats, but these only concern internal accumulators, not
the ones created by the user.

In short, I'm afraid there is not currently a way (in Spark 1.6 and before)
to access the accumulator values until after the tasks that updated them
have completed. This will change in Spark 2.0, the next version, however.

Please let me know if you have more questions.
-Andrew

2016-01-13 11:24 GMT-08:00 Daniel Imberman :

> Hi Kira,
>
> I'm having some trouble understanding your question. Could you please give
> a code example?
>
>
>
> From what I think you're asking there are two issues with what you're
> looking to do. (Please keep in mind I could be totally wrong on both of
> these assumptions, but this is what I've been lead to believe)
>
> 1. The contract of an accumulator is that you can't actually read the
> value as the function is performing because the values in the accumulator
> don't actually mean anything until they are reduced. If you were looking
> for progress in a local context, you could do mapPartitions and have a
> local accumulator per partition, but I don't think it's possible to get the
> actual accumulator value in the middle of the map job.
>
> 2. As far as performing ac2 while ac1 is "always running", I'm pretty sure
> that's not possible. The way that lazy valuation works in Spark, the
> transformations have to be done serially. Having it any other way would
> actually be really bad because then you could have ac1 changing the data
> thereby making ac2's output unpredictable.
>
> That being said, with a more specific example it might be possible to help
> figure out a solution that accomplishes what you are trying to do.
>
> On Wed, Jan 13, 2016 at 5:43 AM Kira  wrote:
>
>> Hi,
>>
>> So i have an action on one RDD that is relatively long, let's call it ac1;
>> what i want to do is to execute another action (ac2) on the same RDD to
>> see
>> the evolution of the first one (ac1); for this end i want to use an
>> accumulator and read it's value progressively to see the changes on it (on
>> the fly) while ac1 is always running. My problem is that the accumulator
>> is
>> only updated once the ac1 has been finished, this is not helpful for me
>> :/ .
>>
>> I ve seen  here
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
>> >
>> what may seem like a solution for me but it doesn t work : "While Spark
>> already offers support for asynchronous reduce (collect data from workers,
>> while not interrupting execution of a parallel transformation) through
>> accumulator"
>>
>> Another post suggested to use SparkListner to do that.
>>
>> are these solutions correct ? if yes, give me a simple exemple ?
>> are there other solutions ?
>>
>> thank you.
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: Can't submit job to stand alone cluster

2015-12-30 Thread Andrew Or

Hi Jim,

Just to clarify further:

   - *Driver *is the process with SparkContext. A driver represents an
   application (e.g. spark-shell, SparkPi) so there is exactly one driver in
   each application.


   - *Executor *is the process that runs the tasks scheduled by the driver.
   There should be at least one executor in each application.


   - *Master *is the process that handles scheduling of *applications*. It
   decides where drivers and executors are launched and how many cores and how
   much memory to give to each application. This only exists in standalone
   mode.


   - *Worker *is the process that actually launches the executor and driver
   JVMs (the latter only in cluster mode). It talks to the Master to decide
   what to launch with how much memory to give to the process. This only
   exists in standalone mode.

It is actually the *driver*, not the Master, that distributes jars to
executors. The Master is largely unconcerned with individual requirements
from an application apart from cores / memory constraints. This is because
we still need to distribute jars to executors in YARN and Mesos modes, so
the common process, the driver, has to do it.

I thought the whole point of confusion is that people expect the driver to
> distribute jars but they have to be visible to the master on the file
> system local to the master?


Actually the requirement is that the jars have to be visible to the machine
running the *driver*, not the Master. In client mode, your jars have to be
visible to the machine running spark-submit. In cluster mode, your jars
have to be visible to all machines running a Worker, since the driver can
be launched on any of them.

The nice email from Greg is spot-on.

Does that make sense?

-Andrew


2015-12-30 11:23 GMT-08:00 SparkUser :

> Sorry need to clarify:
>
> When you say:
>
> *When the docs say **"If your application is launched through Spark
> submit, then the application jar is automatically distributed to all worker
> nodes,"**it is actually saying that your executors get their jars from
> the driver. This is true whether you're running in client mode or cluster
> mode.*
>
>
> Don't you mean the master, not the driver? I thought the whole point of
> confusion is that people expect the driver to distribute jars but they have
> to be visible to the master on the file system local to the master?
>
> I see a lot of people tripped up by this and a nice mail from Greg Hill to
> the list cleared this up for me but now I am confused again. I am a couple
> days away from having a way to test this myself, so I am just "in theory"
> right now.
>
> On 12/29/2015 05:18 AM, Greg Hill wrote:
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
>
> Thanks,
>
> Jim
>
>
> On 12/29/2015 04:36 PM, Daniel Valdivia wrote:
>
> That makes things more clear! Thanks
>
> Issue resolved
>
> Sent from my iPhone
>
> On Dec 29, 2015, at 2:43 PM, Annabel Melongo < 
> melongo_anna...@yahoo.com> wrote:
>
> Thanks Andrew for this awesome explanation [image: *:) happy]
>
>
> On Tuesday, December 29, 2015 5:30 PM, Andrew Or < 
> and...@databricks.com> wrote:
>
>
> Let me clarify a few things for everyone:
>
> There are three *cluster managers*: standalone, YARN, and Mesos. Each
> cluster manager can run in two *deploy modes*, client or cluster. In
> client mode, the driver runs on the machine that submitted the application
> (the client). In cluster mode, the driver runs on one of the worker
> machines in the cluster.
>
> When I say "standalone cluster mode" I am referring to the standalone
> cluster manager running in cluster deploy mode.
>
> Here's how the resources are distributed in each mode (omitting Mesos):
>
> *Standalone / YARN client mode. *The driver runs on the client machine
> (i.e. machine that ran Spark submit) so it should already have access to
> the jars. The executors then pull the jars from an HTTP server started in
> the driver.
>
> *Standalone cluster mode. *Spark submit does *not* upload your jars to
> the cluster, so all the resources you need must already be on all of the
> worker machines. The executors, however

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or

Let me clarify a few things for everyone:

There are three *cluster managers*: standalone, YARN, and Mesos. Each
cluster manager can run in two *deploy modes*, client or cluster. In client
mode, the driver runs on the machine that submitted the application (the
client). In cluster mode, the driver runs on one of the worker machines in
the cluster.

When I say "standalone cluster mode" I am referring to the standalone
cluster manager running in cluster deploy mode.

Here's how the resources are distributed in each mode (omitting Mesos):

*Standalone / YARN client mode. *The driver runs on the client machine
(i.e. machine that ran Spark submit) so it should already have access to
the jars. The executors then pull the jars from an HTTP server started in
the driver.

*Standalone cluster mode. *Spark submit does *not* upload your jars to the
cluster, so all the resources you need must already be on all of the worker
machines. The executors, however, actually just pull the jars from the
driver as in client mode instead of finding it in their own local file
systems.

*YARN cluster mode. *Spark submit *does* upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read from
there. As in other deployments, the executors pull the jars from the driver.

When the docs say "If your application is launched through Spark submit,
then the application jar is automatically distributed to all worker nodes," it
is actually saying that your executors get their jars from the driver. This
is true whether you're running in client mode or cluster mode.

If the docs are unclear (and they seem to be), then we should update them.
I have filed SPARK-12565 <https://issues.apache.org/jira/browse/SPARK-12565>
to track this.

Please let me know if there's anything else I can help clarify.

Cheers,
-Andrew

2015-12-29 13:07 GMT-08:00 Annabel Melongo :

> Andrew,
>
> Now I see where the confusion lays. Standalone cluster mode, your link, is
> nothing but a combination of client-mode and standalone mode, my link,
> without YARN.
>
> But I'm confused by this paragraph in your link:
>
> If your application is launched through Spark submit, then the
> application jar is automatically distributed to all worker nodes. For any
> additional jars that your
>   application depends on, you should specify them through the
> --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
>
> That can't be true; this is only the case when Spark runs on top of YARN.
> Please correct me, if I'm wrong.
>
> Thanks
>
>
>
> On Tuesday, December 29, 2015 2:54 PM, Andrew Or 
> wrote:
>
>
>
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
>
> 2015-12-29 11:48 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
>
>
>
>
>
>
> Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or

http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo :

> Greg,
>
> Can you please send me a doc describing the standalone cluster mode?
> Honestly, I never heard about it.
>
> The three different modes, I've listed appear in the last paragraph of
> this doc: Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
>
>
>
>
>
>
> Running Spark Applications
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> --class The FQCN of the class containing the main method of the
> application. For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html>
> Preview by Yahoo
>
>
>
>
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or 
> wrote:
>
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>
> @Annabel That's not true. There *is* a standalone cluster mode where
> driver runs on one of the workers instead of on the client machine. What
> you're describing is standalone client mode.
>
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo :
>
> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or

>
> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.


Actually, it just happens to be backward compatible because we didn't
change the shuffle file formats. This may not necessarily be the case
moving forward as Spark offers no such guarantees. Just thought it's worth
clarifying.

2015-12-27 22:34 GMT-08:00 Saisai Shao :

> External shuffle service is backward compatible, so if you deployed 1.6
> shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications.
>
> Thanks
> Saisai
>
> On Mon, Dec 28, 2015 at 2:33 PM, 顾亮亮  wrote:
>
>> Is it possible to support both spark-1.5.1 and spark-1.6.0 on one yarn
>> cluster?
>>
>>
>>
>> *From:* Saisai Shao [mailto:sai.sai.s...@gmail.com]
>> *Sent:* Monday, December 28, 2015 2:29 PM
>> *To:* Jeff Zhang
>> *Cc:* 顾亮亮; user@spark.apache.org; 刘骋昺
>> *Subject:* Re: Opening Dynamic Scaling Executors on Yarn
>>
>>
>>
>> Replace all the shuffle jars and restart the NodeManager is enough, no
>> need to restart NN.
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang  wrote:
>>
>> See
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮  wrote:
>>
>> Hi all,
>>
>>
>>
>> SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a
>> useful feature to save resources on yarn.
>>
>> We want to open this feature on our yarn cluster.
>>
>> I have a question about the version of shuffle service.
>>
>>
>>
>> I’m now using spark-1.5.1 (shuffle service).
>>
>> If I want to upgrade to spark-1.6.0, should I replace the shuffle service
>> jar and restart all the namenode on yarn ?
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Mars
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or

>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.


@Annabel That's not true. There *is* a standalone cluster mode where driver
runs on one of the workers instead of on the client machine. What you're
describing is standalone client mode.

2015-12-29 11:32 GMT-08:00 Annabel Melongo :

> Greg,
>
> The confusion here is the expression "standalone cluster mode". Either
> it's stand-alone or it's cluster mode but it can't be both.
>
>  With this in mind, here's how jars are uploaded:
> 1. Spark Stand-alone mode: client and driver run on the same machine;
> use --packages option to submit a jar
> 2. Yarn Cluster-mode: client and driver run on separate machines;
> additionally driver runs as a thread in ApplicationMaster; use --jars
> option with a globally visible path to said jar
> 3. Yarn Client-mode: client and driver run on the same machine. driver
> is *NOT* a thread in ApplicationMaster; use --packages to submit a jar
>
>
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or 
> wrote:
>
>
> Hi Greg,
>
> It's actually intentional for standalone cluster mode to not upload jars.
> One of the reasons why YARN takes at least 10 seconds before running any
> simple application is because there's a lot of random overhead (e.g.
> putting jars in HDFS). If this missing functionality is not documented
> somewhere then we should add that.
>
> Also, the packages problem seems legitimate. Thanks for reporting it. I
> have filed https://issues.apache.org/jira/browse/SPARK-12559.
>
> -Andrew
>
> 2015-12-29 4:18 GMT-08:00 Greg Hill :
>
>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or

Hi Greg,

It's actually intentional for standalone cluster mode to not upload jars.
One of the reasons why YARN takes at least 10 seconds before running any
simple application is because there's a lot of random overhead (e.g.
putting jars in HDFS). If this missing functionality is not documented
somewhere then we should add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I
have filed https://issues.apache.org/jira/browse/SPARK-12559.

-Andrew

2015-12-29 4:18 GMT-08:00 Greg Hill :

>
>
> On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:
>
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
>
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
>
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
>
> Greg
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: which aws instance type for shuffle performance

2015-12-18 Thread Andrew Or

Hi Rastan,

Unless you're using off-heap memory or starting multiple executors per
machine, I would recommend the r3.2xlarge option, since you don't actually
want gigantic heaps (100GB is more than enough). I've personally run Spark
on a very large scale with r3.8xlarge instances, but I've been using
off-heap, so much of the memory was actually not used.

Yes, if a shuffle file exists locally Spark just reads from disk.

-Andrew

2015-12-15 23:11 GMT-08:00 Rastan Boroujerdi :

> I'm trying to determine whether I should be using 10 r3.8xlarge or 40
> r3.2xlarge. I'm mostly concerned with shuffle performance of the
> application.
>
> If I go with r3.8xlarge I will need to configure 4 worker instances per
> machine to keep the JVM size down. The worker instances will likely contend
> with each other for network and disk I/O if they are on the same machine.
> If I go with 40 r3.2xlarge I will be able to allocate a single worker
> instance per box, allowing each worker instance to have its own dedicated
> network and disk I/O.
>
> Since shuffle performance is heavily impacted by disk and network
> throughput, it seems like going with 40 r3.2xlarge would be the better
> configuration between the two. Is my analysis correct? Are there other
> tradeoffs that I'm not taking into account? Does spark bypass the network
> transfer and read straight from disk if worker instances are on the same
> machine?
>
> Thanks,
>
> Rastan
>

Re: imposed dynamic resource allocation

2015-12-18 Thread Andrew Or

Hi Antony,

The configuration to enable dynamic allocation is per-application.

If you only wish to enable this for one of your applications, just set
`spark.dynamicAllocation.enabled` to true for that application only. The
way it works under the hood is that application will start sending requests
to the AM asking for executors. If you did not enable this config, your
application will not make such requests.

-Andrew

2015-12-11 14:01 GMT-08:00 Antony Mayi :

> Hi,
>
> using spark 1.5.2 on yarn (client mode) and was trying to use the dynamic
> resource allocation but it seems once it is enabled by first app then any
> following application is managed that way even if explicitly disabling.
>
> example:
> 1) yarn configured with org.apache.spark.network.yarn.YarnShuffleService
> as spark_shuffle aux class
> 2) running first app that doesnt specify dynamic allocation / shuffle
> service - it runs as expected with static executors
> 3) running second application that enables spark.dynamicAllocation.enabled
> and spark.shuffle.service.enabled - it is dynamic as expected
> 4) running another app that doesnt enable and it even disables dynamic
> allocation / shuffle service still the executors are being added/removed
> dynamically throughout the runtime.
> 5) restarting nodemanagers to reset this
>
> Is this known issue or have I missed something? Can the dynamic resource
> allocation be enabled per application?
>
> Thanks,
> Antony.
>

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Andrew Or

Hi Roy,

I believe Spark just gets its application ID from YARN, so you can just do
`sc.applicationId`.

-Andrew

2015-12-18 0:14 GMT-08:00 Deepak Sharma :

> I have never tried this but there is yarn client api's that you can use in
> your spark program to get the application id.
> Here is the link to the yarn client java doc:
>
> http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html
> getApplications() is the method for your purpose here.
>
> Thanks
> Deepak
>
>
> On Fri, Dec 18, 2015 at 1:31 PM, Kyle Lin  wrote:
>
>> Hello there
>>
>> I have the same requirement.
>>
>> I submit a streaming job with yarn-cluster mode.
>>
>> If I want to shutdown this endless YARN application, I should find out
>> the application id by myself and use "yarn appplication -kill " to
>> kill the application.
>>
>> Therefore, if I can get returned application id in my client program, it
>> will be easy for me to kill YARN application from my client program.
>>
>> Kyle
>>
>>
>>
>> 2015-06-24 13:02 GMT+08:00 canan chen :
>>
>>> I don't think there is yarn related stuff to access in spark.  Spark
>>> don't depend on yarn.
>>>
>>> BTW, why do you want the yarn application id ?
>>>
>>> On Mon, Jun 22, 2015 at 11:45 PM, roy  wrote:
>>>
 Hi,

   Is there a way to get Yarn application ID inside spark application,
 when
 running spark Job on YARN ?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Limit of application submission to cluster

2015-12-18 Thread Andrew Or

Hi Saif, have you verified that the cluster has enough resources for all 4
programs?

-Andrew

2015-12-18 5:52 GMT-08:00 :

> Hello everyone,
>
> I am testing some parallel program submission to a stand alone cluster.
> Everything works alright, the problem is, for some reason, I can’t submit
> more than 3 programs to the cluster.
> The fourth one, whether legacy or REST, simply hangs until one of the
> first three completes.
> I am not sure how to debug this, I have tried to increase the number of
> connections per peer or akka number of threads with no luck, any ideas?
>
> Thanks,
> Saif
>
>

Re: Spark job submission REST API

2015-12-10 Thread Andrew Or

Hello,

The hidden API was implemented for use internally and there are no plans to
make it public at this point. It was originally introduced to provide
backward compatibility in submission protocol across multiple versions of
Spark. A full-fledged stable REST API for submitting applications would
require a detailed design consensus among the community.

-Andrew

2015-12-10 8:26 GMT-08:00 mvle :

> Hi,
>
> I would like to use Spark as a service through REST API calls
> for uploading and submitting a job, getting results, etc.
>
> There is a project by the folks at Ooyala:
> https://github.com/spark-jobserver/spark-jobserver
>
> I also encountered some hidden job REST APIs in Spark:
> http://arturmkrtchyan.com/apache-spark-hidden-rest-api
>
> To help determine which set of APIs to use, I would like to know
> the plans for those hidden Spark APIs.
> Will they be made public and supported at some point?
>
> Thanks,
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-submission-REST-API-tp25670.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andrew Or

Hi Andy,

You must be running in cluster mode. The Spark Master accepts client mode
submissions on port 7077 and cluster mode submissions on port 6066. This is
because standalone cluster mode uses a REST API to submit applications by
default. If you submit to port 6066 instead the warning should go away.

-Andrew


2015-12-10 18:13 GMT-08:00 Andy Davidson :

> Hi Jakob
>
> The cluster was set up using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
> script
>
> Given my limited knowledge I think this looks okay?
>
> Thanks
>
> Andy
>
> $ sudo netstat -peant | grep 7077
>
> tcp0  0 :::172-31-30-51:7077:::*
>  LISTEN  0  311641427355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:57311
> ESTABLISHED 0  311591927355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42333
> ESTABLISHED 0  373666427355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:49796
> ESTABLISHED 0  311592527355/java
>
> tcp0  0 :::172-31-30-51:7077:::172-31-30-51:42290
> ESTABLISHED 0  311592327355/java
>
>
> $ ps -aux | grep 27355
>
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.8/FAQ
>
> ec2-user 23867  0.0  0.0 110404   872 pts/0S+   02:06   0:00 grep 27355
>
> root 27355  0.5  6.7 3679096 515836 ?  Sl   Nov26 107:04
> /usr/java/latest/bin/java -cp
> /root/spark/sbin/../conf/:/root/spark/lib/spark-assembly-1.5.1-hadoop1.2.1.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar:/root/ephemeral-hdfs/conf/
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip
> ec2-54-215-217-122.us-west-1.compute.amazonaws.com --port 7077
> --webui-port 8080
>
> From: Jakob Odersky 
> Date: Thursday, December 10, 2015 at 5:55 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: Warning: Master endpoint spark://ip:7077 was not a REST
> server. Falling back to legacy submission gateway instead.
>
> Is there any other process using port 7077?
>
> On 10 December 2015 at 08:52, Andy Davidson  > wrote:
>
>> Hi
>>
>> I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning.
>> My job seems to run with out any problem.
>>
>> Kind regards
>>
>> Andy
>>
>> + /root/spark/bin/spark-submit --class
>> com.pws.spark.streaming.IngestDriver --master spark://
>> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077
>> --total-executor-cores 2 --deploy-mode cluster
>> hdfs:///home/ec2-user/build/ingest-all.jar --clusterMode --dirPath week_3
>>
>> Running Spark using the REST application submission protocol.
>>
>> 15/12/10 16:46:33 WARN RestSubmissionClient: Unable to connect to server
>> spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077.
>>
>> Warning: Master endpoint
>> ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077 was not a REST
>> server. Falling back to legacy submission gateway instead.
>>
>
>

Re: create a table for csv files

2015-11-19 Thread Andrew Or

There's not an easy way. The closest thing you can do is:

import org.apache.spark.sql.functions._

val df = ...
df.withColumn("id", monotonicallyIncreasingId())

-Andrew

2015-11-19 8:23 GMT-08:00 xiaohe lan :

> Hi,
>
> I have some csv file in HDFS with headers like col1, col2, col3, I want to
> add a column named id, so the a record would be 
>
> How can I do this using Spark SQL ? Can id be auto increment ?
>
> Thanks,
> Xiaohe
>

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Andrew Or

Hi Tom,

I believe a workaround is to set `spark.dynamicAllocation.initialExecutors`
to 0. As others have mentioned, from Spark 1.5.2 onwards this should no
longer be necessary.

-Andrew

2015-11-09 8:19 GMT-08:00 Jonathan Kelly :

> Tom,
>
> You might be hitting https://issues.apache.org/jira/browse/SPARK-10790,
> which was introduced in Spark 1.5.0 and fixed in 1.5.2. Spark 1.5.2 just
> passed release candidate voting, so it should be tagged, released and
> announced soon. If you are able to build from source yourself and run with
> that, you might want to try building from the v1.5.2-rc2 tag to see if it
> fixes your issue. Otherwise, hopefully Spark 1.5.2 will be available for
> download very soon.
>
> ~ Jonathan
>
> On Mon, Nov 9, 2015 at 6:08 AM, Akhil Das 
> wrote:
>
>> Did you go through
>> http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup
>> for yarn, i guess you will have to copy the spark-1.5.1-yarn-shuffle.jar to
>> the classpath of all nodemanagers in your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Oct 30, 2015 at 7:41 PM, Tom Stewart <
>> stewartthom...@yahoo.com.invalid> wrote:
>>
>>> I am running the following command on a Hadoop cluster to launch Spark
>>> shell with DRA:
>>> spark-shell  --conf spark.dynamicAllocation.enabled=true --conf
>>> spark.shuffle.service.enabled=true --conf
>>> spark.dynamicAllocation.minExecutors=4 --conf
>>> spark.dynamicAllocation.maxExecutors=12 --conf
>>> spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=120 --conf
>>> spark.dynamicAllocation.schedulerBacklogTimeout=300 --conf
>>> spark.dynamicAllocation.executorIdleTimeout=60 --executor-memory 512m
>>> --master yarn-client --queue default
>>>
>>> This is the code I'm running within the Spark Shell - just demo stuff
>>> from teh web site.
>>>
>>> import org.apache.spark.mllib.clustering.KMeans
>>> import org.apache.spark.mllib.linalg.Vectors
>>>
>>> // Load and parse the data
>>> val data = sc.textFile("hdfs://ns/public/sample/kmeans_data.txt")
>>>
>>> val parsedData = data.map(s => Vectors.dense(s.split('
>>> ').map(_.toDouble))).cache()
>>>
>>> // Cluster the data into two classes using KMeans
>>> val numClusters = 2
>>> val numIterations = 20
>>> val clusters = KMeans.train(parsedData, numClusters, numIterations)
>>>
>>> This works fine on Spark 1.4.1 but is failing on Spark 1.5.1. Did
>>> something change that I need to do differently for DRA on 1.5.1?
>>>
>>> This is the error I am getting:
>>> 15/10/29 21:44:19 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>> 15/10/29 21:44:34 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>> 15/10/29 21:44:49 WARN YarnScheduler: Initial job has not accepted any
>>> resources; check your cluster UI to ensure that workers are registered and
>>> have sufficient resources
>>>
>>> That happens to be the same error you get if you haven't followed the
>>> steps to enable DRA, however I have done those and as I said if I just flip
>>> to Spark 1.4.1 on the same cluster it works with my YARN config.
>>>
>>>
>>
>

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andrew Or

Hi all,

Both the history server and the shuffle service are backward compatible,
but not forward compatible. This means as long as you have the latest
version of history server / shuffle service running in your cluster then
you're fine (you don't need multiple of them).

That said, an old shuffle service (e.g. 1.2) also happens to work with say
Spark 1.4 because the shuffle file formats haven't changed. However, there
are no guarantees that this will remain the case.

-Andrew

2015-10-05 16:37 GMT-07:00 Alex Rovner :

> We are running CDH 5.4 with Spark 1.3 as our main version and that version
> is configured to use the external shuffling service. We have also installed
> Spark 1.5 and have configured it not to use the external shuffling service
> and that works well for us so far. I would be interested myself how to
> configure multiple versions to use the same shuffling service.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Mon, Oct 5, 2015 at 11:06 AM, Andreas Fritzler <
> andreas.fritz...@gmail.com> wrote:
>
>> Hi Steve, Alex,
>>
>> how do you handle the distribution and configuration of
>> the spark-*-yarn-shuffle.jar on your NodeManagers if you want to use 2
>> different Spark versions?
>>
>> Regards,
>> Andreas
>>
>> On Mon, Oct 5, 2015 at 4:54 PM, Steve Loughran 
>> wrote:
>>
>>>
>>> > On 5 Oct 2015, at 16:48, Alex Rovner  wrote:
>>> >
>>> > Hey Steve,
>>> >
>>> > Are you referring to the 1.5 version of the history server?
>>> >
>>>
>>>
>>> Yes. I should warn, however, that there's no guarantee that a history
>>> server running the 1.4 code will handle the histories of a 1.5+ job. In
>>> fact, I'm fairly confident it won't, as the events to get replayed are
>>> different.
>>>
>>
>>
>

Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or

Hi Joshua,

What cluster manager are you using, standalone or YARN? (Note that
standalone here does not mean local mode).

If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7077")`,
where CLUSTER_URL is the machine that started the standalone Master. If
YARN, you need to do `setMaster("yarn")`, assuming that all the Hadoop
configurations files such as core-site.xml are already set up properly.

-Andrew


2015-09-21 8:53 GMT-07:00 Hemant Bhanawat :

> When you specify master as local[2], it starts the spark components in a
> single jvm. You need to specify the master correctly.
> I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I
> run a Spark process, it works fine -- but only on the master, as if it were
> standalone.
>
> The web-UI and logging code shows only 1 executor, the localhost.
>
> How can I diagnose this?
>
> (I create *SparkConf, *in Python, with *setMaster('local[2]'). )*
>
> (Strangely, though I don't think that this causes the problem, there is
> almost nothing spark-related on the slave machine:* /usr/lib/spark *has a
> few jars, but that's it:  *datanucleus-api-jdo.jar  datanucleus-core.jar
>  datanucleus-rdbms.jar  spark-yarn-shuffle.jar. *But this is an AWS EMR
> cluster as created by* create-cluster*, so I would assume that the slave
> and master are configured OK out-of the box.)
>
> Joshua
>

Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or

Hey Garry,

Have you verified that your particular VPC and subnet are open to the
world? In particular, have you verified the route table attached to your
VPC / subnet contains an internet gateway open to the public?

I've run into this issue myself recently and that was the problem for me.

-Andrew

2015-08-24 5:58 GMT-07:00 Robin East :

> spark-ec2 is the way to go however you may need to debug connectivity
> issues. For example do you know that the servers were correctly setup in
> AWS and can you access each node using ssh? If no then you need to work out
> why (it’s not a spark issue). If yes then you will need to work out why ssh
> via the spark-ec2 script is not working.
>
> I’ve used spark-ec2 successfully many times but have never used the
> —vpc-id and —subnet-id options and that may be the source of your problems,
> especially since it appears to be a hostname resolution issue. If you could
> confirm the above questions then maybe someone on the list can help
> diagnose the specific problem.
>
>
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/malak/
>
> On 24 Aug 2015, at 13:45, Garry Chen  wrote:
>
> So what is the best way to deploy spark cluster in EC2 environment any
> suggestions?
>
> Garry
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com
> ]
> *Sent:* Friday, August 21, 2015 4:27 PM
> *To:* Garry Chen 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark ec2 lunch problem
>
>
> It may happen that the version of spark-ec2 script you are using is buggy
> or sometime AWS have problem provisioning machines.
> On Aug 21, 2015 7:56 AM, "Garry Chen"  wrote:
>
> Hi All,
> I am trying to lunch a spark ec2 cluster by running
>  spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc
> --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cluster but
> getting following message endless.  Please help.
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host:
> SSH return code: 255
> SSH output: ssh: Could not resolve hostname : Name or service not known
>
>
>

Re: DAG related query

2015-08-20 Thread Andrew Or

Hi Bahubali,

Once RDDs are created, they are immutable (in most cases). In your case you
end up with 3 RDDs:

(1) the original rdd1 that reads from the text file
(2) rdd2, that applies a map function on (1), and
(3) the new rdd1 that applies a map function on (2)

There's no cycle because you have 3 distinct RDDs. All you're doing is
reassigning a reference `rdd1`, but the underlying RDD doesn't change.

-Andrew

2015-08-20 6:21 GMT-07:00 Sean Owen :

> No. The third line creates a third RDD whose reference simply replaces
> the reference to the first RDD in your local driver program. The first
> RDD still exists.
>
> On Thu, Aug 20, 2015 at 2:15 PM, Bahubali Jain  wrote:
> > Hi,
> > How would the DAG look like for the below code
> >
> > JavaRDD rdd1 = context.textFile();
> > JavaRDD rdd2 = rdd1.map();
> > rdd1 =  rdd2.map();
> >
> > Does this lead to any kind of cycle?
> >
> > Thanks,
> > Baahu
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or

Hi Canan,

The event log dir is a per-application setting whereas the history server
is an independent service that serves history UIs from many applications.
If you use history server locally then the `spark.history.fs.logDirectory`
will happen to point to `spark.eventLog.dir`, but the use case it provides
is broader than that.

-Andrew

2015-08-19 5:13 GMT-07:00 canan chen :

> Anyone know about this ? Or do I miss something here ?
>
> On Fri, Aug 7, 2015 at 4:20 PM, canan chen  wrote:
>
>> Is there any reason that historyserver use another property for the event
>> log dir ? Thanks
>>
>
>

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or

Hi Axel, what spark version are you using? Also, what do your
configurations look like for the following?

- spark.cores.max (also --total-executor-cores)
- spark.executor.cores (also --executor-cores)


2015-08-19 9:27 GMT-07:00 Axel Dahl :

> hmm maybe I spoke too soon.
>
> I have an apache zeppelin instance running and have configured it to use
> 48 cores (each node only has 16 cores), so I figured by setting it to 48,
> would mean that spark would grab 3 nodes.  what happens instead though is
> that spark, reports that 48 cores are being used, but only executes
> everything on 1 node, it looks like it's not grabbing the extra nodes.
>
> On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl  wrote:
>
>> That worked great, thanks Andrew.
>>
>> On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or  wrote:
>>
>>> Hi Axel,
>>>
>>> You can try setting `spark.deploy.spreadOut` to false (through your
>>> conf/spark-defaults.conf file). What this does is essentially try to
>>> schedule as many cores on one worker as possible before spilling over to
>>> other workers. Note that you *must* restart the cluster through the sbin
>>> scripts.
>>>
>>> For more information see:
>>> http://spark.apache.org/docs/latest/spark-standalone.html.
>>>
>>> Feel free to let me know whether it works,
>>> -Andrew
>>>
>>>
>>> 2015-08-18 4:49 GMT-07:00 Igor Berman :
>>>
>>>> by default standalone creates 1 executor on every worker machine per
>>>> application
>>>> number of overall cores is configured with --total-executor-cores
>>>> so in general if you'll specify --total-executor-cores=1 then there
>>>> would be only 1 core on some executor and you'll get what you want
>>>>
>>>> on the other hand, if you application needs all cores of your cluster
>>>> and only some specific job should run on single executor there are few
>>>> methods to achieve this
>>>> e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition
>>>>
>>>>
>>>> On 18 August 2015 at 01:36, Axel Dahl  wrote:
>>>>
>>>>> I have a 4 node cluster and have been playing around with the
>>>>> num-executors parameters, executor-memory and executor-cores
>>>>>
>>>>> I set the following:
>>>>> --executor-memory=10G
>>>>> --num-executors=1
>>>>> --executor-cores=8
>>>>>
>>>>> But when I run the job, I see that each worker, is running one
>>>>> executor which has  2 cores and 2.5G memory.
>>>>>
>>>>> What I'd like to do instead is have Spark just allocate the job to a
>>>>> single worker node?
>>>>>
>>>>> Is that possible in standalone mode or do I need a job/resource
>>>>> scheduler like Yarn to do that?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> -Axel
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or

Yes, in other words, a "bucket" is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>:

> Thanks Andrew for a detailed response,
>
> So the reason why key value pairs with same keys are always found in a
> single buckets in Hash based shuffle but not in Sort is because in
> sort-shuffle each mapper writes a single partitioned file, and it is up to
> the reducer to fetch correct partitions from the the files ?
>
> On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or  wrote:
>
>> Hi Muhammad,
>>
>> On a high level, in hash-based shuffle each mapper M writes R shuffle
>> files, one for each reducer where R is the number of reduce partitions.
>> This results in M * R shuffle files. Since it is not uncommon for M and R
>> to be O(1000), this quickly becomes expensive. An optimization with
>> hash-based shuffle is consolidation, where all mappers run in the same core
>> C write one file per reducer, resulting in C * R files. This is a strict
>> improvement, but it is still relatively expensive.
>>
>> Instead, in sort-based shuffle each mapper writes a single partitioned
>> file. This allows a particular reducer to request a specific portion of
>> each mapper's single output file. In more detail, the mapper first fills up
>> an internal buffer in memory and continually spills the contents of the
>> buffer to disk, then finally merges all the spilled files together to form
>> one final output file. This places much less stress on the file system and
>> requires much fewer I/O operations especially on the read side.
>>
>> -Andrew
>>
>>
>>
>> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
>> 11besemja...@seecs.edu.pk>:
>>
>>> I did check it out and although I did get a general understanding of the
>>> various classes used to implement Sort and Hash shuffles, however these
>>> slides lack details as to how they are implemented and why sort generally
>>> has better performance than hash
>>>
>>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
>>> wrote:
>>>
>>>> Have a look at this presentation.
>>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>>> of help to you.
>>>>
>>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>>> 11besemja...@seecs.edu.pk> wrote:
>>>>
>>>>> What are the major differences between how Sort based and Hash based
>>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>>> than Hash?
>>>>> Any talks that discuss both shuffles in detail, how they are
>>>>> implemented and the performance gains ?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or

Hi Satish,

The problem is that `--jars` accepts a comma-delimited list of jars! E.g.

spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar

where main.jar is your main application jar (the one that starts a
SparkContext), and lib*.jar refer to additional libraries that your main
application jar uses.

-Andrew

2015-08-13 3:22 GMT-07:00 Javier Domingo Cansino :

> Please notice that 'jars: null'
>
> I don't know why you put ///. but I would propose you just put normal
> absolute paths.
>
> dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
> --jars /home/missingmerch/postgresql-9.4-1201.jdbc41.jar
> /home/missingmerch/dse.jar
> /home/missingmerch/spark-cassandra-connector-java_2.10-1.1.1.jar
> /home/missingmerch/etl-0.0.1-SNAPSHOT.jar
>
> Hope this is helpful!
>
> [image: Fon] Javier Domingo CansinoResearch &
> Development Engineer+34 946545847Skype: javier.domingo.fonAll information
> in this email is confidential 
>
> On Tue, Aug 11, 2015 at 3:42 PM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI,
>>
>> Please find the log details below:
>>
>>
>> dse spark-submit --verbose --master local --class HelloWorld
>> etl-0.0.1-SNAPSHOT.jar --jars
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>> file:/home/missingmerch/dse.jar
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> Using properties file: /etc/dse/spark/spark-defaults.conf
>>
>> Adding default property:
>> spark.cassandra.connection.factory=com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>> Adding default property: spark.ssl.keyStore=.keystore
>>
>> Adding default property: spark.ssl.enabled=false
>>
>> Adding default property: spark.ssl.trustStore=.truststore
>>
>> Adding default property:
>> spark.cassandra.auth.conf.factory=com.datastax.bdp.spark.DseAuthConfFactory
>>
>> Adding default property: spark.ssl.keyPassword=cassandra
>>
>> Adding default property: spark.ssl.keyStorePassword=cassandra
>>
>> Adding default property: spark.ssl.protocol=TLS
>>
>> Adding default property: spark.ssl.useNodeLocalConf=true
>>
>> Adding default property: spark.ssl.trustStorePassword=cassandra
>>
>> Adding default property:
>> spark.ssl.enabledAlgorithms=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
>>
>> Parsed arguments:
>>
>>   master  local
>>
>>   deployMode  null
>>
>>   executorMemory  null
>>
>>   executorCores   null
>>
>>   totalExecutorCores  null
>>
>>   propertiesFile  /etc/dse/spark/spark-defaults.conf
>>
>>   driverMemory512M
>>
>>   driverCores null
>>
>>   driverExtraClassPathnull
>>
>>   driverExtraLibraryPath  null
>>
>>   driverExtraJavaOptions  -Dcassandra.username=missingmerch
>> -Dcassandra.password=STMbrjrlb -XX:MaxPermSize=256M
>>
>>   supervise   false
>>
>>   queue   null
>>
>>   numExecutorsnull
>>
>>   files   null
>>
>>   pyFiles null
>>
>>   archivesnull
>>
>>   mainClass   HelloWorld
>>
>>   primaryResource file:/home/missingmerch/etl-0.0.1-SNAPSHOT.jar
>>
>>   nameHelloWorld
>>
>>   childArgs   [--jars
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>> file:/home/missingmerch/dse.jar
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar]
>>
>>   jarsnull
>>
>>   verbose true
>>
>>
>>
>> Spark properties used, including those specified through
>>
>> --conf and those from the properties file
>> /etc/dse/spark/spark-defaults.conf:
>>
>>   spark.cassandra.connection.factory ->
>> com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>>   spark.ssl.useNodeLocalConf -> true
>>
>>   spark.ssl.enabled -> false
>>
>>   spark.executor.extraJavaOptions -> -XX:MaxPermSize=256M
>>
>>   spark.ssl.keyStore -> .keystore
>>
>>   spark.ssl.trustStore -> .truststore
>>
>>   spark.ssl.trustStorePassword -> cassandra
>>
>>   spark.ssl.enabledAlgorithms ->
>> TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA
>>
>>   spark.cassandra.auth.conf.factory ->
>> com.datastax.bdp.spark.DseAuthConfFactory
>>
>>   spark.ssl.protocol -> TLS
>>
>>   spark.ssl.keyPassword -> cassandra
>>
>>   spark.ssl.keyStorePassword -> cassandra
>>
>>
>>
>>
>>
>> Main class:
>>
>> HelloWorld
>>
>> Arguments:
>>
>> --jars
>>
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> file:/home/missingmerch/dse.jar
>>
>> file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar
>>
>> System properties:
>>
>> spark.cassandra.connection.factory ->
>> com.datastax.bdp.spark.DseCassandraConnectionFactory
>>
>> spark.driver.memory -> 512M
>>
>> spark.ssl.useNodeLocalConf -> true
>>
>> spark.ssl.enabled -> false
>>
>> SPARK_SUBMIT -> true
>>
>> spark.executor.extraJavaOptions -> -XX:MaxPermSize=256M
>>
>> spark.app.name -> HelloWorld
>>
>> spark.ssl.enable

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or

Hi Muhammad,

On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
hash-based shuffle is consolidation, where all mappers run in the same core
C write one file per reducer, resulting in C * R files. This is a strict
improvement, but it is still relatively expensive.

Instead, in sort-based shuffle each mapper writes a single partitioned
file. This allows a particular reducer to request a specific portion of
each mapper's single output file. In more detail, the mapper first fills up
an internal buffer in memory and continually spills the contents of the
buffer to disk, then finally merges all the spilled files together to form
one final output file. This places much less stress on the file system and
requires much fewer I/O operations especially on the read side.

-Andrew

2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>
:

> I did check it out and although I did get a general understanding of the
> various classes used to implement Sort and Hash shuffles, however these
> slides lack details as to how they are implemented and why sort generally
> has better performance than hash
>
> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
> wrote:
>
>> Have a look at this presentation.
>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>> of help to you.
>>
>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>> 11besemja...@seecs.edu.pk> wrote:
>>
>>> What are the major differences between how Sort based and Hash based
>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>> than Hash?
>>> Any talks that discuss both shuffles in detail, how they are implemented
>>> and the performance gains ?
>>>
>>
>>
>

Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or

Hi Andreas,

I believe the distinction is not between standalone and YARN mode, but
between client and cluster mode.

In client mode, your Spark submit JVM runs your driver code. In cluster
mode, one of the workers (or NodeManagers if you're using YARN) in the
cluster runs your driver code. In the latter case, it doesn't really make
sense to call `setMaster` in your driver because Spark needs to know which
cluster you're submitting the application to.

Instead, the recommended way is to set the master through the `--master`
flag in the command line, e.g.

$ bin/spark-submit
--master spark://1.2.3.4:7077
--class some.user.Clazz
--name "My app name"
--jars lib1.jar,lib2.jar
--deploy-mode cluster
app.jar

Both YARN and standalone modes support client and cluster modes, and the
spark-submit script is the common interface through which you can launch
your application. In other words, you shouldn't have to do anything more
than providing a different value to `--master` to use YARN.

-Andrew

2015-08-17 0:34 GMT-07:00 Andreas Fritzler :

> Hi all,
>
> when runnig the Spark cluster in standalone mode I am able to create the
> Spark context from Java via the following code snippet:
>
> SparkConf conf = new SparkConf()
>>.setAppName("MySparkApp")
>>.setMaster("spark://SPARK_MASTER:7077")
>>.setJars(jars);
>> JavaSparkContext sc = new JavaSparkContext(conf);
>
>
> As soon as I'm done with my processing, I can just close it via
>
>> sc.stop();
>>
> Now my question: Is the same also possible when running Spark on YARN? I
> currently don't see how this should be possible without submitting your
> application as a packaged jar file. Is there a way to get this kind of
> interactivity from within your Scala/Java code?
>
> Regards,
> Andrea
>

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or

Hi Axel,

You can try setting `spark.deploy.spreadOut` to false (through your
conf/spark-defaults.conf file). What this does is essentially try to
schedule as many cores on one worker as possible before spilling over to
other workers. Note that you *must* restart the cluster through the sbin
scripts.

For more information see:
http://spark.apache.org/docs/latest/spark-standalone.html.

Feel free to let me know whether it works,
-Andrew


2015-08-18 4:49 GMT-07:00 Igor Berman :

> by default standalone creates 1 executor on every worker machine per
> application
> number of overall cores is configured with --total-executor-cores
> so in general if you'll specify --total-executor-cores=1 then there would
> be only 1 core on some executor and you'll get what you want
>
> on the other hand, if you application needs all cores of your cluster and
> only some specific job should run on single executor there are few methods
> to achieve this
> e.g. coallesce(1) or dummyRddWithOnePartitionOnly.foreachPartition
>
>
> On 18 August 2015 at 01:36, Axel Dahl  wrote:
>
>> I have a 4 node cluster and have been playing around with the
>> num-executors parameters, executor-memory and executor-cores
>>
>> I set the following:
>> --executor-memory=10G
>> --num-executors=1
>> --executor-cores=8
>>
>> But when I run the job, I see that each worker, is running one executor
>> which has  2 cores and 2.5G memory.
>>
>> What I'd like to do instead is have Spark just allocate the job to a
>> single worker node?
>>
>> Is that possible in standalone mode or do I need a job/resource scheduler
>> like Yarn to do that?
>>
>> Thanks in advance,
>>
>> -Axel
>>
>>
>>
>

Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or

Hi Canan,

This is mainly for legacy reasons. The default behavior in standalone in
mode is that the application grabs all available resources in the cluster.
This effectively means we want one executor per worker, where each executor
grabs all the available cores and memory on that worker. In this model, it
doesn't really make sense to express number of executors, because that's
equivalent to the number of workers.

In 1.4+, however, we do support multiple executors per worker, but that's
not the default so we decided not to add support for the --num-executors
setting to avoid potential confusion.

-Andrew


2015-08-18 2:35 GMT-07:00 canan chen :

> num-executor only works for yarn mode. In standalone mode, I have to set
> the --total-executor-cores and --executor-cores. Isn't this way so
> intuitive ? Any reason for that ?
>

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or

Hi Canan, TestSQLContext is no longer a singleton but now a class. It is
never meant to be a fully public API, but if you wish to use it you can
just instantiate a new one:

val sqlContext = new TestSQLContext

or just create a new SQLContext from a SparkContext.

-Andrew

2015-08-15 20:33 GMT-07:00 canan chen :

> I am not sure other people's spark debugging environment ( I mean for the
> master branch) , Anyone can share his experience ?
>
>
> On Sun, Aug 16, 2015 at 10:40 AM, canan chen  wrote:
>
>> I import the spark source code to intellij, and want to run SparkPi in
>> intellij, but meet the folliwing weird compilation error? I googled it and
>> sbt clean doesn't work for me. I am not sure whether anyone else has meet
>> this issue also, any help is appreciated
>>
>> Error:scalac:
>>  while compiling:
>> /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
>> during phase: jvm
>>  library version: version 2.10.4
>> compiler version: version 2.10.4
>>   reconstructed args: -nobootcp -javabootclasspath : -deprecation
>> -feature -classpath
>>
>
>

Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or

Hi Saif,

You need to run your application with `spark.eventLog.enabled` set to true.
Then if you are using standalone mode, you can view the Master UI at port
8080. Otherwise, you may start a history server through
`sbin/start-history-server.sh`, which by default starts the history UI at
port 18080.

For more information on how to set this up, visit:
http://spark.apache.org/docs/latest/monitoring.html

-Andrew


2015-08-07 13:16 GMT-07:00 François Pelletier <
newslett...@francoispelletier.org>:

>
> look at
> spark.history.ui.port, if you use standalone
> spark.yarn.historyServer.address, if you use YARN
>
> in your Spark config file
>
> Mine is located at
> /etc/spark/conf/spark-defaults.conf
>
> If you use Apache Ambari you can find this settings in the Spark / Configs
> / Advanced spark-defaults tab
>
> François
>
>
> Le 2015-08-07 15:58, saif.a.ell...@wellsfargo.com a écrit :
>
> Hello, thank you, but that port is unreachable for me. Can you please
> share where can I find that port equivalent in my environment?
>
>
>
> Thank you
>
> Saif
>
>
>
> *From:* François Pelletier [mailto:newslett...@francoispelletier.org
> ]
> *Sent:* Friday, August 07, 2015 4:38 PM
> *To:* user@spark.apache.org
> *Subject:* Re: Spark master driver UI: How to keep it after process
> finished?
>
>
>
> Hi, all spark processes are saved in the Spark History Server
>
> look at your host on port 18080 instead of 4040
>
> François
>
> Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit :
>
> Hi,
>
>
>
> A silly question here. The Driver Web UI dies when the spark-submit
> program finish. I would like some time to analyze after the program ends,
> as the page does not refresh it self, when I hit F5 I lose all the info.
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>
>
>

Re: No event logs in yarn-cluster mode

2015-08-01 Thread Andrew Or

Hi Akmal,

It might be on HDFS, since you provided a relative path
/opt/spark/spark-events to `spark.eventLog.dir`.

-Andrew

2015-08-01 9:25 GMT-07:00 Akmal Abbasov :

> Hi, I am trying to configure a history server for application.
> When I running locally(./run-example SparkPi), the event logs are being
> created, and I can start history server.
> But when I am trying
> ./spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn-cluster file:///opt/hadoop/spark/examples/src/main/python/pi.py
> I am getting
> 15/08/01 18:18:50 INFO yarn.Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: 192.168.56.192
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1438445890676
> final status: SUCCEEDED
> tracking URL: http://sp-m1:8088/proxy/application_1438444529840_0009/A
> user: hadoop
> 15/08/01 18:18:50 INFO util.Utils: Shutdown hook called
> 15/08/01 18:18:50 INFO util.Utils: Deleting directory
> /tmp/spark-185f7b83-cb3b-4134-a10c-452366204f74
> So it is succeeded, but there is no event logs for this application.
>
> here are my configs
> spark-defaults.conf
> spark.master yarn-cluster
> spark.eventLog.dir   /opt/spark/spark-events
> spark.eventLog.enabled  true
>
> spark-env.sh
> export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop"
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
> -Dspark.deploy.zookeeper.url=“zk1:2181,zk2:2181”
> export
> SPARK_HISTORY_OPTS="-Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
> -Dspark.history.fs.logDirectory=file:/opt/spark/spark-events
> -Dspark.history.fs.cleaner.enabled=true"
>
> Any ideas?
>
> Thank you
>

Re: spark.executor.memory and spark.driver.memory have no effect in yarn-cluster mode (1.4.x)?

2015-07-22 Thread Andrew Or

Hi Michael,

In general, driver related properties should not be set through the
SparkConf. This is because by the time the SparkConf is created, we have
already started the driver JVM, so it's too late to change the memory,
class paths and other properties.

In cluster mode, executor related properties should also not be set through
the SparkConf. This is because the driver is run on the cluster just like
the executors, and the executors are launched independently by whatever the
cluster manager (e.g. YARN) is configured to do.

The recommended way of setting these properties is either through the
conf/spark-defaults.conf properties file, or through the spark-submit
command line, e.g.:

bin/spark-shell --master yarn --executor-memory 2g --driver-memory 5g

Let me know if that answers your question,
-Andrew


2015-07-22 12:38 GMT-07:00 Michael Misiewicz :

> Hi group,
>
> I seem to have encountered a weird problem with 'spark-submit' and
> manually setting sparkconf values in my applications.
>
> It seems like setting the configuration values spark.executor.memory
> and spark.driver.memory don't have any effect, when they are set from
> within my application (i.e. prior to creating a SparkContext).
>
> In yarn-cluster mode, only the values specified on the command line via
> spark-submit for driver and executor memory are respected, and if not, it
> appears spark falls back to defaults. For example,
>
> Correct behavior noted in Driver's logs on YARN when --executor-memory is
> specified:
>
> 15/07/22 19:25:59 INFO yarn.YarnAllocator: Will request 200 executor 
> containers, each with 1 cores and 13824 MB memory including 1536 MB overhead
> 15/07/22 19:25:59 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>
>
> But not when spark.executor.memory is specified prior to spark context 
> initialization:
>
> 15/07/22 19:22:22 INFO yarn.YarnAllocator: Will request 200 executor 
> containers, each with 1 cores and 2560 MB memory including 1536 MB overhead
> 15/07/22 19:22:22 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>
>
> In both cases, executor mem should be 10g. Interestingly, I set a parameter 
> spark.yarn.executor.memoryOverhead which appears to be respected whether or 
> not I'm in yarn-cluster or yarn-client mode.
>
>
> Has anyone seen this before? Any idea what might be causing this behavior?
>
>

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or

Hi Dan,

`map2` is a broadcast variable, not your map. To access the map on the
executors you need to do `map2.value(a)`.

-Andrew

2015-07-22 12:20 GMT-07:00 Dan Dong :

> Hi, Andrew,
>   If I broadcast the Map:
> val map2=sc.broadcast(map1)
>
> I will get compilation error:
> org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,String]]
> does not take parameters
> [error]  val matchs= Vecs.map(term=>term.map{case (a,b)=>(map2(a),b)})
>
> Seems it's still an RDD, so how to access it by value=map2(key) ? Thanks!
>
> Cheers,
> Dan
>
>
>
> 2015-07-22 2:20 GMT-05:00 Andrew Or :
>
>> Hi Dan,
>>
>> If the map is small enough, you can just broadcast it, can't you? It
>> doesn't have to be an RDD. Here's an example of broadcasting an array and
>> using it on the executors:
>> https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
>> .
>>
>> -Andrew
>>
>> 2015-07-21 19:56 GMT-07:00 ayan guha :
>>
>>> Either you have to do rdd.collect and then broadcast or you can do a join
>>> On 22 Jul 2015 07:54, "Dan Dong"  wrote:
>>>
>>>> Hi, All,
>>>>
>>>>
>>>> I am trying to access a Map from RDDs that are on different compute
>>>> nodes, but without success. The Map is like:
>>>>
>>>> val map1 = Map("aa"->1,"bb"->2,"cc"->3,...)
>>>>
>>>> All RDDs will have to check against it to see if the key is in the Map
>>>> or not, so seems I have to make the Map itself global, the problem is that
>>>> if the Map is stored as RDDs and spread across the different nodes, each
>>>> node will only see a piece of the Map and the info will not be complete to
>>>> check against the Map( an then replace the key with the corresponding
>>>> value) E,g:
>>>>
>>>> val matchs= Vecs.map(term=>term.map{case (a,b)=>(map1(a),b)})
>>>>
>>>> But if the Map is not an RDD, how to share it like sc.broadcast(map1)
>>>>
>>>> Any idea about this? Thanks!
>>>>
>>>>
>>>> Cheers,
>>>> Dan
>>>>
>>>>
>>
>

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or

Hi Srikanth,

I was able to reproduce the issue by setting `spark.cores.max` to a number
greater than the number of cores on a worker. I've filed SPARK-9260 which I
believe is already being fixed in https://github.com/apache/spark/pull/7274.

Thanks for reporting the issue!
-Andrew

2015-07-22 11:49 GMT-07:00 Andrew Or :

> Hi Srikanth,
>
> It does look like a bug. Did you set `spark.executor.cores` in your
> application by any chance?
>
> -Andrew
>
> 2015-07-22 8:05 GMT-07:00 Srikanth :
>
>> Hello,
>>
>> I've set spark.deploy.spreadOut=false in spark-env.sh.
>>
>>> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4
>>> -Dspark.deploy.spreadOut=false"
>>
>>
>> There are 3 workers each with 4 cores. Spark-shell was started with noof
>> cores = 6.
>> Spark UI show that one executor was used with 6 cores.
>>
>> Is this a bug? This is with Spark 1.4.
>>
>> [image: Inline image 1]
>>
>> Srikanth
>>
>
>

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or

Hi Srikanth,

It does look like a bug. Did you set `spark.executor.cores` in your
application by any chance?

-Andrew

2015-07-22 8:05 GMT-07:00 Srikanth :

> Hello,
>
> I've set spark.deploy.spreadOut=false in spark-env.sh.
>
>> export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4
>> -Dspark.deploy.spreadOut=false"
>
>
> There are 3 workers each with 4 cores. Spark-shell was started with noof
> cores = 6.
> Spark UI show that one executor was used with 6 cores.
>
> Is this a bug? This is with Spark 1.4.
>
> [image: Inline image 1]
>
> Srikanth
>

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or

Hi,

It would be whatever's left in the JVM. This is not explicitly controlled
by a fraction like storage or shuffle. However, the computation usually
doesn't need to use that much space. In my experience it's almost always
the caching or the aggregation during shuffles that's the most memory
intensive.

-Andrew

2015-07-21 13:47 GMT-07:00 wdbaruni :

> I am new to Spark and I understand that Spark divides the executor memory
> into the following fractions:
>
> *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
> .cache() and can be defined by setting spark.storage.memoryFraction
> (default
> 0.6)
>
> *Shuffle and aggregation buffers:* Which Spark uses to store shuffle
> outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
> output exceeds this fraction, then Spark will spill data to disk (default
> 0.2)
>
> *User code:* Spark uses this fraction to execute arbitrary user code
> (default 0.2)
>
> I am not mentioning the storage and shuffle safety fractions for
> simplicity.
>
> My question is, which memory fraction is Spark using to compute and
> transform RDDs that are not going to be persisted? For example:
>
> lines = sc.textFile("i am a big file.txt")
> count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
> 1)).reduceByKey(add)
> count.saveAsTextFile("output")
>
> Here Spark will not load the whole file at once and will partition the
> input
> file and do all these transformations per partition in a single stage.
> However, which memory fraction Spark will use to load the partitioned
> lines,
> compute flatMap() and map()?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or

Hi Dan,

If the map is small enough, you can just broadcast it, can't you? It
doesn't have to be an RDD. Here's an example of broadcasting an array and
using it on the executors:
https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
.

-Andrew

2015-07-21 19:56 GMT-07:00 ayan guha :

> Either you have to do rdd.collect and then broadcast or you can do a join
> On 22 Jul 2015 07:54, "Dan Dong"  wrote:
>
>> Hi, All,
>>
>>
>> I am trying to access a Map from RDDs that are on different compute
>> nodes, but without success. The Map is like:
>>
>> val map1 = Map("aa"->1,"bb"->2,"cc"->3,...)
>>
>> All RDDs will have to check against it to see if the key is in the Map or
>> not, so seems I have to make the Map itself global, the problem is that if
>> the Map is stored as RDDs and spread across the different nodes, each node
>> will only see a piece of the Map and the info will not be complete to check
>> against the Map( an then replace the key with the corresponding value) E,g:
>>
>> val matchs= Vecs.map(term=>term.map{case (a,b)=>(map1(a),b)})
>>
>> But if the Map is not an RDD, how to share it like sc.broadcast(map1)
>>
>> Any idea about this? Thanks!
>>
>>
>> Cheers,
>> Dan
>>
>>

Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Andrew Or

Hi,

The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using
caching at all, have you tried trying something more extreme, like 0.1 /
0.9? Since disabling spark.shuffle.spill didn't cause an OOM this setting
should be fine. Also, one thing you could do is to verify the shuffle bytes
spilled on the UI before and after the change.

Let me know if that helped.
-Andrew

2015-07-21 13:50 GMT-07:00 wdbaruni :

> Hi
> I am testing Spark on Amazon EMR using Python and the basic wordcount
> example shipped with Spark.
>
> After running the application, I realized that in Stage 0 reduceByKey(add),
> around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
> disk. Since in the wordcount example I am not caching or persisting any
> data, so I thought I can increase the performance of this application by
> giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the
> following:
>
> spark.storage.memoryFraction0.2
> spark.shuffle.memoryFraction0.6
>
> However, I am still getting the same performance and the same amount of
> shuffle data is being spilled to disk and memory. I validated that Spark is
> reading these configurations using Spark UI/Environment and I can see my
> changes. Moreover, I tried setting spark.shuffle.spill to false and I got
> the performance I am looking for and all shuffle data was spilled to memory
> only.
>
> So, what am I getting wrong here and why not the extra shuffle memory
> fraction is not utilized?
>
> *My environment:*
> Amazon EMR with Spark 1.3.1 running using -x argument
> 1 Master node: m3.xlarge
> 3 Core nodes: m3.xlarge
> Application: wordcount.py
> Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3
>
> *Submit command:*
> /home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py
> s3n://
>
> *spark-defaults.conf:*
> spark.eventLog.enabled  false
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
> spark.driver.extraJavaOptions   -Dspark.driver.log.level=INFO
> spark.masteryarn
> spark.executor.instances3
> spark.executor.cores4
> spark.executor.memory   9404M
> spark.default.parallelism   12
> spark.eventLog.enabled  true
> spark.eventLog.dir  hdfs:///spark-logs/
> spark.storage.memoryFraction0.2
> spark.shuffle.memoryFraction0.6
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Or

Hi Andrew,

Based on your driver logs, it seems the issue is that the shuffle service
is actually not running on the NodeManagers, but your application is trying
to provide a "spark_shuffle" secret anyway. One way to verify whether the
shuffle service is actually started is to look at the NodeManager logs for
the following lines:

*Initializing YARN shuffle service for Spark*
*Started YARN shuffle service for Spark on port X*

These should be logged under the INFO level. Also, could you verify whether
*all* the executors have this problem, or just a subset? If even one of the
NM doesn't have the shuffle service, you'll see the stack trace that you
ran into. It would be good to confirm whether the yarn-site.xml change is
actually reflected on all NMs if the log statements above are missing.

Let me know if you can get it working. I've run the shuffle service myself
on the master branch (which will become Spark 1.5.0) recently following the
instructions and have not encountered any problems.

-Andrew

Re: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Or

Hi all,

Did you forget to restart the node managers after editing yarn-site.xml by
any chance?

-Andrew

2015-07-17 8:32 GMT-07:00 Andrew Lee :

> I have encountered the same problem after following the document.
>
> Here's my spark-defaults.conf
>
> spark.shuffle.service.enabled true
> spark.dynamicAllocation.enabled  true
> spark.dynamicAllocation.executorIdleTimeout 60
> spark.dynamicAllocation.cachedExecutorIdleTimeout 120
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 8
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.schedulerBacklogTimeout 10
>
>
>
> and yarn-site.xml configured.
>
> 
> yarn.nodemanager.aux-services
> spark_shuffle,mapreduce_shuffle
> 
> ...
> 
> yarn.nodemanager.aux-services.spark_shuffle.class
> org.apache.spark.network.yarn.YarnShuffleService
> 
>
>
> and deployed the 2 JARs to NodeManager's classpath
> /opt/hadoop/share/hadoop/mapreduce/. (I also checked the NodeManager log
> and the JARs appear in the classpath). I notice that the JAR location is
> not the same as the document in 1.4. I found them under network/yarn/target
> and network/shuffle/target/ after building it with "-Phadoop-2.4 -Psparkr
> -Pyarn -Phive -Phive-thriftserver" in maven.
>
>
> spark-network-yarn_2.10-1.4.1.jar
>
> spark-network-shuffle_2.10-1.4.1.jar
>
>
> and still getting the following exception.
>
> Exception in thread "ContainerLauncher #0" java.lang.Error: 
> org.apache.spark.SparkException: Exception while starting container 
> container_1437141440985_0003_01_02 on host 
> alee-ci-2058-slave-2.test.altiscale.com
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Exception while starting 
> container container_1437141440985_0003_01_02 on host 
> alee-ci-2058-slave-2.test.altiscale.com
>   at 
> org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:116)
>   at 
> org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:67)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
> auxService:spark_shuffle does not exist
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>
>
> Not sure what else am I missing here or doing wrong?
>
> Appreciate any insights or feedback, thanks.
>
>
> --
> Date: Wed, 8 Jul 2015 09:25:39 +0800
> Subject: Re: The auxService:spark_shuffle does not exist
> From: zjf...@gmail.com
> To: rp...@njit.edu
> CC: user@spark.apache.org
>
>
> Did you enable the dynamic resource allocation ? You can refer to this
> page for how to configure spark shuffle service for yarn.
>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html
>
>
> On Tue, Jul 7, 2015 at 10:55 PM, roy  wrote:
>
> we tried "--master yarn-client" with no different result.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or

Yeah, we could make it a log a warning instead.

2015-07-15 14:29 GMT-07:00 Kelly, Jonathan :

>  Thanks! Is there an existing JIRA I should watch?
>
>
>  ~ Jonathan
>
>   From: Sandy Ryza 
> Date: Wednesday, July 15, 2015 at 2:27 PM
> To: Jonathan Kelly 
> Cc: "user@spark.apache.org" 
> Subject: Re: Unable to use dynamicAllocation if spark.executor.instances
> is set in spark-defaults.conf
>
>   Hi Jonathan,
>
>  This is a problem that has come up for us as well, because we'd like
> dynamic allocation to be turned on by default in some setups, but not break
> existing users with these properties.  I'm hoping to figure out a way to
> reconcile these by Spark 1.5.
>
>  -Sandy
>
> On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan 
> wrote:
>
>>   Would there be any problem in having spark.executor.instances (or
>> --num-executors) be completely ignored (i.e., even for non-zero values) if
>> spark.dynamicAllocation.enabled is true (i.e., rather than throwing an
>> exception)?
>>
>>  I can see how the exception would be helpful if, say, you tried to pass
>> both "-c spark.executor.instances" (or --num-executors) *and* "-c
>> spark.dynamicAllocation.enabled=true" to spark-submit on the command line
>> (as opposed to having one of them in spark-defaults.conf and one of them in
>> the spark-submit args), but currently there doesn't seem to be any way to
>> distinguish between arguments that were actually passed to spark-submit and
>> settings that simply came from spark-defaults.conf.
>>
>>  If there were a way to distinguish them, I think the ideal situation
>> would be for the validation exception to be thrown only if
>> spark.executor.instances and spark.dynamicAllocation.enabled=true were both
>> passed via spark-submit args or were both present in spark-defaults.conf,
>> but passing spark.dynamicAllocation.enabled=true to spark-submit would take
>> precedence over spark.executor.instances configured in spark-defaults.conf,
>> and vice versa.
>>
>>
>>  Jonathan Kelly
>>
>> Elastic MapReduce - SDE
>>
>> Blackfoot (SEA33) 06.850.F0
>>
>>   From: Jonathan Kelly 
>> Date: Tuesday, July 14, 2015 at 4:23 PM
>> To: "user@spark.apache.org" 
>> Subject: Unable to use dynamicAllocation if spark.executor.instances is
>> set in spark-defaults.conf
>>
>>I've set up my cluster with a pre-calcualted value for
>> spark.executor.instances in spark-defaults.conf such that I can run a job
>> and have it maximize the utilization of the cluster resources by default.
>> However, if I want to run a job with dynamicAllocation (by passing -c
>> spark.dynamicAllocation.enabled=true to spark-submit), I get this exception:
>>
>>  Exception in thread "main" java.lang.IllegalArgumentException:
>> Explicitly setting the number of executors is not compatible with
>> spark.dynamicAllocation.enabled!
>> at
>> org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
>> at
>> org.apache.spark.deploy.yarn.ClientArguments.(ClientArguments.scala:59)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
>>  …
>>
>>  The exception makes sense, of course, but ideally I would like it to
>> ignore what I've put in spark-defaults.conf for spark.executor.instances if
>> I've enabled dynamicAllocation. The most annoying thing about this is that
>> if I have spark.executor.instances present in spark-defaults.conf, I cannot
>> figure out any way to spark-submit a job with
>> spark.dynamicAllocation.enabled=true without getting this error. That is,
>> even if I pass "-c spark.executor.instances=0 -c
>> spark.dynamicAllocation.enabled=true", I still get this error because the
>> validation in ClientArguments.parseArgs() that's checking for this
>> condition simply checks for the presence of spark.executor.instances rather
>> than whether or not its value is > 0.
>>
>>  Should the check be changed to allow spark.executor.instances to be set
>> to 0 if spark.dynamicAllocation.enabled is true? That would be an OK
>> compromise, but I'd really prefer to be able to enable dynamicAllocation
>> simply by setting spark.dynamicAllocation.enabled=true rather than by also
>> having to set spark.executor.instances to 0.
>>
>>
>>  Thanks,
>>
>> Jonathan
>>
>
>

Re: How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Andrew Or

Hi Peter,

AFAIK Spark assumes infinite disk space, so there isn't really a way to
limit how much space it uses. Unfortunately I'm not aware of a simpler
workaround than to simply provision your cluster with more disk space. By
the way, are you sure that it's disk space that exceeded the limit, but not
the number of inodes? If it's the latter, maybe you could control the
ulimit of the container.

To answer your other question: if it can't persist to disk then yes it will
fail. It will only recompute from the data source if for some reason
someone evicted our blocks from memory, but that shouldn't happen in your
case since your'e using MEMORY_AND_DISK_SER.

-Andrew


2015-07-10 3:51 GMT-07:00 Peter Rudenko :

>  Hi, i have a spark ML worklflow. It uses some persist calls. When i
> launch it with 1 tb dataset - it puts down all cluster, becauses it fills
> all disk space at /yarn/nm/usercache/root/appcache:
> http://i.imgur.com/qvRUrOp.png
>
> I found a yarn settings:
> *yarn*.nodemanager.localizer.*cache*.target-size-mb - Target size of
> localizer cache in MB, per nodemanager. It is a target retention size that
> only includes resources with PUBLIC and PRIVATE visibility and excludes
> resources with APPLICATION visibility
>
> But it excludes resources with APPLICATION visibility, and spark cache as
> i understood is of APPLICATION type.
>
> Is it possible to restrict a disk space for spark application? Will spark
> fail if it wouldn't be able to persist on disk
> (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source?
>
> Thanks,
> Peter Rudenko
>
>
>
>
>

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-10 Thread Andrew Or

Hi Jan,

Most SparkContext constructors are there for legacy reasons. The point of
going through spark-submit is to set up all the classpaths, system
properties, and resolve URIs properly *with respect to the deployment mode*.
For instance, jars are distributed differently between YARN cluster mode
and standalone client mode, and this is not something the Spark user should
have to worry about.

As an example, if you pass jars through the SparkContext constructor, it
won't actually work in cluster mode if the jars are local. This is because
the driver is launched on the cluster and the SparkContext will try to find
the jars on the cluster in vain.

So the more concise answer to your question is: yes technically you don't
need to go through spark-submit, but you'll have to deal with all the
bootstrapping complexity yourself.

-Andrew

2015-07-10 3:37 GMT-07:00 algermissen1971 :

> Hi,
>
> I am a bit confused about the steps I need to take to start a Spark
> application on a cluster.
>
> So far I had this impression from the documentation that I need to
> explicitly submit the application using for example spark-submit.
>
> However, from the SparkContext constructur signature I get the impression
> that maybe I do not have to do that after all:
>
> In
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.SparkContext
> the first constructor has (among other things) a parameter 'jars' which
> indicates the "Collection of JARs to send to the cluster".
>
> To me this suggests that I can simply start the application anywhere and
> that it will deploy itself to the cluster in the same way a call to
> spark-submit would.
>
> Is that correct?
>
> If not, can someone explain why I can / need to provide master and jars
> etc. in the call to SparkContext because they essentially only duplicate
> what I would specify in the call to spark-submit.
>
> Jan
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark-submit

2015-07-10 Thread Andrew Or

Hi Ashutosh, I believe the class is
org.apache.spark.*examples.*graphx.Analytics?
If you're running page rank on live journal you could just use
org.apache.spark.examples.graphx.LiveJournalPageRank.

-Andrew

2015-07-10 3:42 GMT-07:00 AshutoshRaghuvanshi <
ashutosh.raghuvans...@gmail.com>:

> when I do run this command:
>
> ashutosh@pas-lab-server7:~/spark-1.4.0$ ./bin/spark-submit \
> > --class org.apache.spark.graphx.lib.Analytics \
> > --master spark://172.17.27.12:7077 \
> > assembly/target/scala-2.10/spark-assembly-1.4.0-hadoop2.2.0.jar \
> > pagerank soc-LiveJournal1.txt --numEPart=100 --nverts=4847571
> --numIter=10
> > --partStrategy=EdgePartition2D
>
> I get an error:
>
> java.lang.ClassNotFoundException: org.apache.spark.graphx.lib.Analytics
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:633)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/07/10 15:31:35 INFO Utils: Shutdown hook called
>
>
>
> where is this class, what path should I give?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-tp23761.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark serialization in closure

2015-07-09 Thread Andrew Or

Hi Chen,

I believe the issue is that `object foo` is a member of `object testing`,
so the only way to access `object foo` is to first pull `object testing`
into the closure, then access a pointer to get to `object foo`. There are
two workarounds that I'm aware of:

(1) Move `object foo` outside of `object testing`. This is only a problem
because of the nested objects. Also, by design it's simpler to reason about
but that's a separate discussion.

(2) Create a local variable for `foo.v`. If all your closure cares about is
the integer, then it makes sense to add a `val v = foo.v` inside `func` and
use this in your closure instead. This avoids pulling in $outer pointers
into your closure at all since it only references local variables.

As others have commented, I think this is more of a Scala problem than a
Spark one.

Let me know if these work,
-Andrew

2015-07-09 13:36 GMT-07:00 Richard Marscher :

> Reading that article and applying it to your observations of what happens
> at runtime:
>
> shouldn't the closure require serializing testing? The foo singleton
> object is a member of testing, and then you call this foo value in the
> closure func and further in the foreachPartition closure. So following by
> that article, Scala will attempt to serialize the containing object/class
> testing to get the foo instance.
>
> On Thu, Jul 9, 2015 at 4:11 PM, Chen Song  wrote:
>
>> Repost the code example,
>>
>> object testing extends Serializable {
>> object foo {
>>   val v = 42
>> }
>> val list = List(1,2,3)
>> val rdd = sc.parallelize(list)
>> def func = {
>>   val after = rdd.foreachPartition {
>> it => println(foo.v)
>>   }
>> }
>>   }
>>
>> On Thu, Jul 9, 2015 at 4:09 PM, Chen Song  wrote:
>>
>>> Thanks Erik. I saw the document too. That is why I am confused because
>>> as per the article, it should be good as long as *foo *is serializable.
>>> However, what I have seen is that it would work if *testing* is
>>> serializable, even foo is not serializable, as shown below. I don't know if
>>> there is something specific to Spark.
>>>
>>> For example, the code example below works.
>>>
>>> object testing extends Serializable {
>>>
>>> object foo {
>>>
>>>   val v = 42
>>>
>>> }
>>>
>>> val list = List(1,2,3)
>>>
>>> val rdd = sc.parallelize(list)
>>>
>>> def func = {
>>>
>>>   val after = rdd.foreachPartition {
>>>
>>> it => println(foo.v)
>>>
>>>   }
>>>
>>> }
>>>
>>>   }
>>>
>>> On Thu, Jul 9, 2015 at 12:11 PM, Erik Erlandson  wrote:
>>>
 I think you have stumbled across this idiosyncrasy:


 http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/




 - Original Message -
 > I am not sure this is more of a question for Spark or just Scala but
 I am
 > posting my question here.
 >
 > The code snippet below shows an example of passing a reference to a
 closure
 > in rdd.foreachPartition method.
 >
 > ```
 > object testing {
 > object foo extends Serializable {
 >   val v = 42
 > }
 > val list = List(1,2,3)
 > val rdd = sc.parallelize(list)
 > def func = {
 >   val after = rdd.foreachPartition {
 > it => println(foo.v)
 >   }
 > }
 >   }
 > ```
 > When running this code, I got an exception
 >
 > ```
 > Caused by: java.io.NotSerializableException:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$
 > Serialization stack:
 > - object not serializable (class:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$, value:
 > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$@10b7e824)
 > - field (class:
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
 > name: $outer, type: class
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$)
 > - object (class
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$testing$$anonfun$1,
 > )
 > ```
 >
 > It looks like Spark needs to serialize `testing` object. Why is it
 > serializing testing even though I only pass foo (another serializable
 > object) in the closure?
 >
 > A more general question is, how can I prevent Spark from serializing
 the
 > parent class where RDD is defined, with still support of passing in
 > function defined in other classes?
 >
 > --
 > Chen Song
 >

>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>>
>> --
>> Chen Song
>>
>>
>
>
> --
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or

Hi Lincoln, I've noticed this myself. I believe it's a new issue that only
affects local mode. I've filed a JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-8911

2015-07-08 14:20 GMT-07:00 Lincoln Atkinson :

>  Brilliant! Thanks.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Wednesday, July 08, 2015 2:15 PM
> *To:* Lincoln Atkinson
> *Cc:* user@spark.apache.org
> *Subject:* Re: Disable heartbeat messages in REPL
>
>
>
> I was thinking the same thing! Try sc.setLogLevel("ERROR")
>
>
>
> On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson 
> wrote:
>
>  “WARN Executor: Told to re-register on heartbeat” is logged repeatedly
> in the spark shell, which is very distracting and corrupts the display of
> whatever set of commands I’m currently typing out.
>
>
>
> Is there an option to disable the logging of this message?
>
>
>
> Thanks,
>
> -Lincoln
>
>
>

Re: Submitting Spark Applications using Spark Submit

2015-06-22 Thread Andrew Or

ka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Also, in the above error it says:* connection refused to
> ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077
> <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *I don’t
> understand where it gets the *10.165.103.16
> <http://ec2-XXX.compute-1.amazonaws.com/10.165.103.16:7077> *from. I
> never specify that in the master url command line parameter. Any ideas on
> what I might be doing wrong?
>
>
> On Jun 19, 2015, at 7:19 PM, Andrew Or  wrote:
>
> Hi Raghav,
>
> I'm assuming you're using standalone mode. When using the Spark EC2
> scripts you need to make sure that every machine has the most updated jars.
> Once you have built on one of the nodes, you must rsync the Spark directory
> to the rest of the nodes (see /root/spark-ec2/copy-dir).
>
> That said, I usually build it locally on my laptop and scp the assembly
> jar to the cluster instead of building it there. The EC2 machines often
> take much longer to build for some reason. Also it's cumbersome to set up
> proper IDE there.
>
> -Andrew
>
>
> 2015-06-19 19:11 GMT-07:00 Raghav Shankar :
> Thanks Andrew! Is this all I have to do when using the spark ec2 script to
> setup a spark cluster? It seems to be getting an assembly jar that is not
> from my project(perhaps from a maven repo). Is there a way to make the ec2
> script use the assembly jar that I created?
>
> Thanks,
> Raghav
>
>
> On Friday, June 19, 2015, Andrew Or  wrote:
> Hi Raghav,
>
> If you want to make changes to Spark and run your application with it, you
> may follow these steps.
>
> 1. git clone g...@github.com:apache/spark
> 2. cd spark; build/mvn clean package -DskipTests [...]
> 3. make local changes
> 4. build/mvn package -DskipTests [...] (no need to clean again here)
> 5. bin/spark-submit --master spark://[...] --class your.main.class your.jar
>
> No need to pass in extra --driver-java-options or --driver-extra-classpath
> as others have suggested. When using spark-submit, the main jar comes from
> assembly/target/scala_2.10, which is prepared through "mvn package". You
> just have to make sure that you re-package the assembly jar after each
> modification.
>
> -Andrew
>
> 2015-06-18 16:35 GMT-07:00 maxdml :
> You can specify the jars of your application to be included with
> spark-submit
> with the /--jars/ switch.
>
> Otherwise, are you sure that your newly compiled spark jar assembly is in
> assembly/target/scala-2.10/?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Re: PySpark on YARN "port out of range"

2015-06-22 Thread Andrew Or

Unfortunately there is not a great way to do it without modifying Spark to
print more things it reads from the stream.

2015-06-20 23:10 GMT-07:00 John Meehan :

> Yes it seems to be consistently "port out of range:1315905645”.  Is there
> any way to see what the python process is actually outputting (in hopes
> that yields a clue)?
>
> On Jun 19, 2015, at 6:47 PM, Andrew Or  wrote:
>
> Hm, one thing to see is whether the same port appears many times (1315905645).
> The way pyspark works today is that the JVM reads the port from the stdout
> of the python process. If there is some interference in output from the
> python side (e.g. any print statements, exception messages), then the Java
> side will think that it's actually a port even when it's not.
>
> I'm not sure why it fails sometimes but not others, but 2/3 of the time is
> a lot...
>
> 2015-06-19 14:57 GMT-07:00 John Meehan :
>
>> Has anyone encountered this “port out of range” error when launching
>> PySpark jobs on YARN?  It is sporadic (e.g. 2/3 jobs get this error).
>>
>> LOG:
>>
>> 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage
>> 39.0 (TID 211) on executor xxx.xxx.xxx.com:
>> java.lang.IllegalArgumentException (port out of range:1315905645)
>> [duplicate 7]
>> Traceback (most recent call last):
>>  File "", line 1, in 
>> 15/06/19 11:49:44 INFO cluster.YarnScheduler: Removed TaskSet 39.0, whose
>> tasks have all completed, from pool
>>  File "/home/john/spark-1.4.0/python/pyspark/rdd.py", line 745, in collect
>>port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>>  File
>> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>  File
>> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError15/06/19 11:49:44 INFO
>> storage.BlockManagerInfo: Removed broadcast_38_piece0 on
>> 17.134.160.35:47455 in memory (size: 2.2 KB, free: 265.4 MB)
>> : An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 1 in stage 39.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>> 39.0 (TID 210, xxx.xxx.xxx.com): java.lang.IllegalArgumentException:
>> port out of range:1315905645
>> at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>> at java.net.InetSocketAddress.(InetSocketAddress.java:185)
>> at java.net.Socket.(Socket.java:241)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>> at
>> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
>> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> <http://org.apache.spark.scheduler.dagscheduler.org/>
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>> at
>> org.apache.spark.sched

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or

Hi Raghav,

I'm assuming you're using standalone mode. When using the Spark EC2 scripts
you need to make sure that every machine has the most updated jars. Once
you have built on one of the nodes, you must *rsync* the Spark directory to
the rest of the nodes (see /root/spark-ec2/copy-dir).

That said, I usually build it locally on my laptop and *scp* the assembly
jar to the cluster instead of building it there. The EC2 machines often
take much longer to build for some reason. Also it's cumbersome to set up
proper IDE there.

-Andrew


2015-06-19 19:11 GMT-07:00 Raghav Shankar :

> Thanks Andrew! Is this all I have to do when using the spark ec2 script to
> setup a spark cluster? It seems to be getting an assembly jar that is not
> from my project(perhaps from a maven repo). Is there a way to make the
> ec2 script use the assembly jar that I created?
>
> Thanks,
> Raghav
>
>
> On Friday, June 19, 2015, Andrew Or  wrote:
>
>> Hi Raghav,
>>
>> If you want to make changes to Spark and run your application with it,
>> you may follow these steps.
>>
>> 1. git clone g...@github.com:apache/spark
>> 2. cd spark; build/mvn clean package -DskipTests [...]
>> 3. make local changes
>> 4. build/mvn package -DskipTests [...] (no need to clean again here)
>> 5. bin/spark-submit --master spark://[...] --class your.main.class
>> your.jar
>>
>> No need to pass in extra --driver-java-options or
>> --driver-extra-classpath as others have suggested. When using spark-submit,
>> the main jar comes from assembly/target/scala_2.10, which is prepared
>> through "mvn package". You just have to make sure that you re-package the
>> assembly jar after each modification.
>>
>> -Andrew
>>
>> 2015-06-18 16:35 GMT-07:00 maxdml :
>>
>>> You can specify the jars of your application to be included with
>>> spark-submit
>>> with the /--jars/ switch.
>>>
>>> Otherwise, are you sure that your newly compiled spark jar assembly is in
>>> assembly/target/scala-2.10/?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Re: Submitting Spark Applications using Spark Submit

2015-06-19 Thread Andrew Or

Hi Raghav,

If you want to make changes to Spark and run your application with it, you
may follow these steps.

1. git clone g...@github.com:apache/spark
2. cd spark; build/mvn clean package -DskipTests [...]
3. make local changes
4. build/mvn package -DskipTests [...] (no need to clean again here)
5. bin/spark-submit --master spark://[...] --class your.main.class your.jar

No need to pass in extra --driver-java-options or --driver-extra-classpath
as others have suggested. When using spark-submit, the main jar comes from
assembly/target/scala_2.10, which is prepared through "mvn package". You
just have to make sure that you re-package the assembly jar after each
modification.

-Andrew

2015-06-18 16:35 GMT-07:00 maxdml :

> You can specify the jars of your application to be included with
> spark-submit
> with the /--jars/ switch.
>
> Otherwise, are you sure that your newly compiled spark jar assembly is in
> assembly/target/scala-2.10/?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-Applications-using-Spark-Submit-tp23352p23400.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Andrew Or

Hi Elkhan,

Spark submit depends on several things: the launcher jar (1.3.0+ only), the
spark-core jar, and the spark-yarn jar (in your case). Why do you want to
put it in HDFS though? AFAIK you can't execute scripts directly from HDFS;
you need to copy them to a local file system first. I don't see clear
benefits of not just running Spark submit from source or from one of the
distributions.

-Andrew

2015-06-19 10:12 GMT-07:00 Elkhan Dadashov :

> Hi all,
>
> If I want to ship spark-submit script to HDFS. and then call it from HDFS
> location for starting Spark job, which other files/folders/jars need to be
> transferred into HDFS with spark-submit script ?
>
> Due to some dependency issues, we can include Spark in our Java
> application, so instead we will allow limited usage of Spark only with
> Python files.
>
> So if I want to put spark-submit script into HDFS, and call it to execute
> Spark job in Yarn cluster, what else need to be put into HDFS with it ?
>
> (Using Spark only for execution Spark jobs written in Python)
>
> Thanks.
>
>

Re: Abount Jobs UI in yarn-client mode

2015-06-19 Thread Andrew Or

Did you make sure that the YARN IP is not an internal address? If it still
doesn't work then it seems like an issue on the YARN side...

2015-06-19 8:48 GMT-07:00 Sea <261810...@qq.com>:

> Hi, all:
> I run spark on yarn,  I want to see the Jobs UI http://ip:4040/,
> but it redirect to http://
> ${yarn.ip}/proxy/application_1428110196022_924324/ which can not be
> found. Why?
> Anyone can help?
>

Re: Spark on Yarn - How to configure

2015-06-19 Thread Andrew Or

Hi Ashish,

For Spark on YARN, you actually only need the Spark files on one machine -
the submission client. This machine could even live outside of the cluster.
Then all you need to do is point YARN_CONF_DIR to the directory containing
your hadoop configuration files (e.g. yarn-site.xml) on that machine. All
the jars will be automatically distributed to the nodes in the cluster
accordingly.

-Andrew

2015-06-19 12:35 GMT-07:00 Ashish Soni :

> Can some one please let me know what all i need to configure to have Spark
> run using Yarn ,
>
> There is lot of documentation but none of it says how and what all files
> needs to be changed
>
> Let say i have 4 node for Spark - SparkMaster , SparkSlave1 , SparkSlave2
> , SparkSlave3
>
> Now in which node which files needs to changed to make sure my master node
> is SparkMaster and slave nodes are 1,2,3 and how to tell / configure Yarn
>
> Ashish
>

Re: PySpark on YARN "port out of range"

2015-06-19 Thread Andrew Or

Hm, one thing to see is whether the same port appears many times (1315905645).
The way pyspark works today is that the JVM reads the port from the stdout
of the python process. If there is some interference in output from the
python side (e.g. any print statements, exception messages), then the Java
side will think that it's actually a port even when it's not.

I'm not sure why it fails sometimes but not others, but 2/3 of the time is
a lot...

2015-06-19 14:57 GMT-07:00 John Meehan :

> Has anyone encountered this “port out of range” error when launching
> PySpark jobs on YARN?  It is sporadic (e.g. 2/3 jobs get this error).
>
> LOG:
>
> 15/06/19 11:49:44 INFO scheduler.TaskSetManager: Lost task 0.3 in stage
> 39.0 (TID 211) on executor xxx.xxx.xxx.com:
> java.lang.IllegalArgumentException (port out of range:1315905645)
> [duplicate 7]
> Traceback (most recent call last):
>  File "", line 1, in 
> 15/06/19 11:49:44 INFO cluster.YarnScheduler: Removed TaskSet 39.0, whose
> tasks have all completed, from pool
>  File "/home/john/spark-1.4.0/python/pyspark/rdd.py", line 745, in collect
>port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  File
> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>  File
> "/home/john/spark-1.4.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError15/06/19 11:49:44 INFO
> storage.BlockManagerInfo: Removed broadcast_38_piece0 on
> 17.134.160.35:47455 in memory (size: 2.2 KB, free: 265.4 MB)
> : An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1 in stage 39.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 39.0 (TID 210, xxx.xxx.xxx.com): java.lang.IllegalArgumentException: port
> out of range:1315905645
> at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
> at java.net.InetSocketAddress.(InetSocketAddress.java:185)
> at java.net.Socket.(Socket.java:241)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
> at
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
> at
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> * Spark 1.4.0 build:
>
> build/mvn -Pyarn -Phive -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.1.4
> -DskipTests clean package
>
> LAUNCH CMD:
>
> export HADOOP_CONF_DIR=/path/to/conf
> export PYSPARK_PYTHON=/path/to/python-2.7.2/bin/python
> ~/spark-1.4.0/bin/pyspark \
> --conf
> spark.yarn.jar=/home/john/spark-1.4.0/assembly/target/scala-2.10/spark-assembly-1.4.0-hadoop2.3.0-cdh5.1.4.jar
> \
> --master yarn-client \
> --num-executors 3 \
> --executor-cores 18 \
> --executor-memory 48g
>
> TEST JOB IN REPL:
>
> words = [‘hi’, ‘there’, ‘yo’, ‘baby’]
> wordsRdd = sc.parallelize(words)
> words.map

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Andrew Or

Hi Patrick,

The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704.
If possible, you may cherry-pick the following commit into your Spark
deployment and it should resolve the issue:

https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948

Note that this commit is only for the 1.3 branch. If you could upgrade to
1.4.0 then you do not need to apply that commit yourself.

-Andrew



2015-06-13 12:01 GMT-07:00 Patrick Woody :

> Hey Sandy,
>
> I'll test it out on 1.4. Do you have a bug number or PR that I could
> reference as well?
>
> Thanks!
> -Pat
>
> Sent from my iPhone
>
> On Jun 13, 2015, at 11:38 AM, Sandy Ryza  wrote:
>
> Hi Patrick,
>
> I'm noticing that you're using Spark 1.3.1.  We fixed a bug in dynamic
> allocation in 1.4 that permitted requesting negative numbers of executors.
> Any chance you'd be able to try with the newer version and see if the
> problem persists?
>
> -Sandy
>
> On Fri, Jun 12, 2015 at 7:42 PM, Patrick Woody 
> wrote:
>
>> Hey all,
>>
>> I've recently run into an issue where spark dynamicAllocation has asked
>> for -1 executors from YARN. Unfortunately, this raises an exception that
>> kills the executor-allocation thread and the application can't request more
>> resources.
>>
>> Has anyone seen this before? It is spurious and the application usually
>> works, but when this gets hit it becomes unusable when getting stuck at
>> minimum YARN resources.
>>
>> Stacktrace below.
>>
>> Thanks!
>> -Pat
>>
>> 470 ERROR [2015-06-12 16:44:39,724] org.apache.spark.util.Utils: Uncaught
>> exception in thread spark-dynamic-executor-allocation-0
>> 471 ! java.lang.IllegalArgumentException: Attempted to request a negative
>> number of executor(s) -1 from the cluster manager. Please specify a
>> positive number!
>> 472 ! at
>> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
>> ~[spark-core_2.10-1.3.1.jar:1.
>> 473 ! at
>> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 474 ! at
>> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 475 ! at
>> org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 476 ! at 
>> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
>> ~[spark-core_2.10-1.3.1.j
>> 477 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 478 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 479 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 480 ! at
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> 481 ! at
>> org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
>> [spark-core_2.10-1.3.1.jar:1.3.1]
>> 482 ! at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> [na:1.7.0_71]
>> 483 ! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>> [na:1.7.0_71]
>> 484 ! at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> [na:1.7.0_71]
>> 485 ! at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> [na:1.7.0_71]
>> 486 ! at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> [na:1.7.0_71]
>> 487 ! at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> [na:1.7.0_71]
>>
>
>

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Andrew Or

Hi Peng,

Setting properties through --conf should still work in Spark 1.4. From the
warning it looks like the config you are trying to set does not start with
the prefix "spark.". What is the config that you are trying to set?

-Andrew

2015-06-12 11:17 GMT-07:00 Peng Cheng :

> In Spark <1.3.x, the system property of the driver can be set by --conf
> option, shared between setting spark properties and system properties.
>
> In Spark 1.4.0 this feature is removed, the driver instead log the
> following
> warning:
>
> Warning: Ignoring non-spark config property: xxx.xxx=v
>
> How do set driver's system property in 1.4.0? Is there a reason it is
> removed without a deprecation warning?
>
> Thanks a lot for your advices.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-How-to-set-driver-s-system-property-using-spark-submit-options-tp23298.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Andrew Or

Hi Deepak,

This is a notorious bug that is being tracked at
https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source
of this bug (it turns out Snappy had a bug in buffer reuse that caused data
corruption). There are other known sources that are being addressed in
outstanding patches currently.

Since you're using 1.3.1 my guess is that you don't have this patch:
https://github.com/apache/spark/pull/6176, which I believe should fix the
issue in your case. It's merged for 1.3.2 (not yet released) but not in
time for 1.3.1, so feel free to patch it yourself and see if it works.

-Andrew


2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :

> Any suggestions ?
>
> I using Spark 1.3.1 to read   sequence file stored in Sequence File format
> (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
> )
>
> with this code and settings
> sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
> org.apache.spark.HashPartitioner(2053))
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryoserializer.buffer.mb",
> arguments.get("buffersize").get)
>   .set("spark.kryoserializer.buffer.max.mb",
> arguments.get("maxbuffersize").get)
>   .set("spark.driver.maxResultSize",
> arguments.get("maxResultSize").get)
>   .set("spark.yarn.maxAppAttempts", "0")
>   //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
>   //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
>   //.set("spark.worker.timeout", arguments.get("workerTimeout").get)
>
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>
>
> and values are
> buffersize=128 maxbuffersize=1068 maxResultSize=200G
>
>
> And i see this exception in each executor task
>
> FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
> 54757), shuffleId=6, mapId=2810, reduceId=1117, message=
>
> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*
>
> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>
> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>
> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>
> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>
> at
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
>
> at
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
>
> at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>
> at
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
>
> at
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
>
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at scala.util.Success.map(Try.scala:206)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(Shuffle

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Andrew Or

Hi all,

As the author of the dynamic allocation feature I can offer a few insights
here.

Gerard's explanation was both correct and concise: dynamic allocation is
not intended to be used in Spark streaming at the moment (1.4 or before).
This is because of two things:

(1) Number of receivers is necessarily fixed, and these are started in
executors. Since we need a receiver for each InputDStream, if we kill these
receivers we essentially stop the stream, which is not what we want. It
makes little sense to close and restart a stream the same way we kill and
relaunch executors.

(2) Records come in every batch, and when there is data to process your
executors are not idle. If your idle timeout is less than the batch
duration, then you'll end up having to constantly kill and restart
executors. If your idle timeout is greater than the batch duration, then
you'll never kill executors.

Long answer short, with Spark streaming there is currently no
straightforward way to scale the size of your cluster. I had a long
discussion with TD (Spark streaming lead) about what needs to be done to
provide some semblance of dynamic scaling to streaming applications, e.g.
take into account the batch queue instead. We came up with a few ideas that
I will not detail here, but we are looking into this and do intend to
support it in the near future.

-Andrew



2015-05-28 8:02 GMT-07:00 Evo Eftimov :

> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK
> – it will be your insurance policy against sys crashes due to memory leaks.
> Until there is free RAM, spark streaming (spark) will NOT resort to disk –
> and of course resorting to disk from time to time (ie when there is no free
> RAM ) and taking a performance hit from that, BUT only until there is no
> free RAM
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
> *Sent:* Thursday, May 28, 2015 2:34 PM
> *To:* Evo Eftimov
> *Cc:* Gerard Maas; spark users
> *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
> sizes/rate of growth in Kafka or Spark's metrics?
>
>
>
> Evo, good points.
>
>
>
> On the dynamic resource allocation, I'm surmising this only works within a
> particular cluster setup.  So it improves the usage of current cluster
> resources but it doesn't make the cluster itself elastic. At least, that's
> my understanding.
>
>
>
> Memory + disk would be good and hopefully it'd take *huge* load on the
> system to start exhausting the disk space too.  I'd guess that falling onto
> disk will make things significantly slower due to the extra I/O.
>
>
>
> Perhaps we'll really want all of these elements eventually.  I think we'd
> want to start with memory only, keeping maxRate low enough not to overwhelm
> the consumers; implement the cluster autoscaling.  We might experiment with
> dynamic resource allocation before we get to implement the cluster
> autoscale.
>
>
>
>
>
>
>
> On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov 
> wrote:
>
> You can also try Dynamic Resource Allocation
>
>
>
>
> https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation
>
>
>
> Also re the Feedback Loop for automatic message consumption rate
> adjustment – there is a “dumb” solution option – simply set the storage
> policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
> exhausted spark streaming will resort to keeping new RDDs on disk which
> will prevent it from crashing and hence loosing them. Then some memory will
> get freed and it will resort back to RAM and so on and so forth
>
>
>
>
>
> Sent from Samsung Mobile
>
>  Original message 
>
> From: Evo Eftimov
>
> Date:2015/05/28 13:22 (GMT+00:00)
>
> To: Dmitry Goldenberg
>
> Cc: Gerard Maas ,spark users
>
> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
> in Kafka or Spark's metrics?
>
>
>
> You can always spin new boxes in the background and bring them into the
> cluster fold when fully operational and time that with job relaunch and
> param change
>
>
>
> Kafka offsets are mabaged automatically for you by the kafka clients which
> keep them in zoomeeper dont worry about that ad long as you shut down your
> job gracefuly. Besides msnaging the offsets explicitly is not a big deal if
> necessary
>
>
>
>
>
> Sent from Samsung Mobile
>
>
>
>  Original message 
>
> From: Dmitry Goldenberg
>
> Date:2015/05/28 13:16 (GMT+00:00)
>
> To: Evo Eftimov
>
> Cc: Gerard Maas ,spark users
>
> Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
> in Kafka or Spark's metrics?
>
>
>
> Thanks, Evo.  Per the last part of your comment, it sounds like we will
> need to implement a job manager which will be in control of starting the
> jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
> marking them as ones to relaunch, scaling the cluster up/down by
> adding/removing machines, and relaunching the 'suspended' (shut down) jobs.
>
>
>
> I suspect that relaunching the j

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or

Hi Jianshi,

For YARN, there may be an issue with how a recently patch changes the
accessibility of the shuffle files by the external shuffle service:
https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you
will hit this with 1.2.1, actually. For this reason I would have to
recommend that you use 1.2.2 when it is released, but for now you should
use 1.2.0 for this specific use case.

-Andrew

2015-02-10 23:38 GMT-08:00 Reynold Xin :

> I think we made the binary protocol compatible across all versions, so you
> should be fine with using any one of them. 1.2.1 is probably the best since
> it is the most recent stable release.
>
> On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang 
> wrote:
>
>> Hi,
>>
>> I need to use branch-1.2 and sometimes master builds of Spark for my
>> project. However the officially supported Spark version by our Hadoop admin
>> is only 1.2.0.
>>
>> So, my question is which version/build of spark-yarn-shuffle.jar should I
>> use that works for all four versions? (1.2.0, 1.2.1, 1.2,2, 1.3.0)
>>
>> Thanks,
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>

Re: Aggregate order semantics when spilling

2015-01-20 Thread Andrew Or

Hi Justin,

I believe the intended semantics of groupByKey or cogroup is that the
ordering *within a key *is not preserved if you spill. In fact, the test
cases for the ExternalAppendOnlyMap only assert that the Set representation
of the results is as expected (see this line
).
This is because these Spark primitives literally just group the values by a
key but does not provide any ordering guarantees.

However, if ordering within a key is a requirement for your application,
then you may need to write your own PairRDDFunction that calls
combineByKey. You can model your method after groupByKey, but change the
combiner function slightly to take ordering into account. This may add some
overhead to your application since you need to insert every value in the
appropriate place, but since you're spilling anyway the overhead will
likely be shadowed by disk I/O.

Let me know if that works.
-Andrew


2015-01-20 9:18 GMT-08:00 Justin Uang :

> Hi,
>
> I am trying to aggregate a key based on some timestamp, and I believe that
> spilling to disk is changing the order of the data fed into the combiner.
>
> I have some timeseries data that is of the form: ("key", "date", "other
> data")
>
> Partition 1
> ("A", 2, ...)
> ("B", 4, ...)
> ("A", 1, ...)
> ("A", 3, ...)
> ("B", 6, ...)
>
> which I then partition by key, then sort within the partition:
>
> Partition 1
> ("A", 1, ...)
> ("A", 2, ...)
> ("A", 3, ...)
> ("A", 4, ...)
>
> Partition 2
> ("B", 4, ...)
> ("B", 6, ...)
>
> If I run a combineByKey with the same partitioner, then the items for each
> key will be fed into the ExternalAppendOnlyMap in the correct order.
> However, if I spill, then the time slices are spilled to disk as multiple
> partial combiners. When its time to merge the spilled combiners for each
> key, the combiners are combined in the wrong order.
>
> For example, if during a groupByKey, [("A", 1, ...), ("A", 2...)] and
> [("A", 3, ...), ("A", 4, ...)] are spilled separately, it's possible that
> the combiners can be combined in the wrong order, like [("A", 3, ...),
> ("A", 4, ...), ("A", 1, ...), ("A", 2, ...)], which invalidates the
> invariant that all the values for A are passed in order to the combiners.
>
> I'm not an expert, but I suspect that this is because we use a heap
> ordered by key when iterating, which doesn't retain the order the spilled
> combiners. Perhaps we can order our mergeHeap by (hash_key, spill_index),
> where spill_index is incremented each time we spill? This would mean that
> we would pop and merge the combiners of each key in order, resulting in
> [("A", 1, ...), ("A", 2, ...), ("A", 3, ...), ("A", 4, ...)].
>
> Thanks in advance for the help! If there is a way to do this already in
> Spark 1.2, can someone point it out to me?
>
> Best,
>
> Justin
>

Re: spark-submit --py-files remote: "Only local additional python files are supported"

2015-01-20 Thread Andrew Or

Hi Vladimir,

Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

-Andrew

2015-01-20 7:38 GMT-08:00 Vladimir Grigor :

> Hi all!
>
> I found this problem when I tried running python application on Amazon's
> EMR yarn cluster.
>
> It is possible to run bundled example applications on EMR but I cannot
> figure out how to run a little bit more complex python application which
> depends on some other python scripts. I tried adding those files with
> '--py-files' and it works fine in local mode but it fails and gives me
> following message when run in EMR:
> "Error: Only local python files are supported:
> s3://pathtomybucket/mylibrary.py".
>
> Simplest way to reproduce in local:
> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
>
> Actual commands to run it in EMR
> #launch cluster
> aws emr create-cluster --name SparkCluster --ami-version 3.3.1
> --instance-type m1.medium --instance-count 2  --ec2-attributes
> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
> --enable-debugging --use-default-roles  --bootstrap-action
> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
> http://pathtomybucket/bootstrap-actions/spark
> ","-l","WARN","-v","1.2","-b","2014121700","-x"]
> #{
> #   "ClusterId": "j-2Y58DME79MPQJ"
> #}
>
> #run application
> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
> #{
> #"StepIds": [
> #"s-2UP4PP75YX0KU"
> #]
> #}
> And in stderr of that step I get "Error: Only local python files are
> supported: s3://pathtomybucket/tasks/demo/main.py".
>
> What is the workaround or correct way to do it? Using hadoop's distcp to
> copy dependency files from s3 to nodes as another pre-step?
>
> Regards, Vladimir
>

Re: PySpark Client

2015-01-20 Thread Andrew Or

Hi Chris,

Short answer is no, not yet.

Longer answer is that PySpark only supports client mode, which means your
driver runs on the same machine as your submission client. By corollary
this means your submission client must currently depend on all of Spark and
its dependencies. There is a patch that supports this for *cluster* mode
(as opposed to client mode), which would be the first step towards what you
want.

-Andrew

2015-01-20 8:36 GMT-08:00 Chris Beavers :

> Hey all,
>
> Is there any notion of a lightweight python client for submitting jobs to
> a Spark cluster remotely? If I essentially install Spark on the client
> machine, and that machine has the same OS, same version of Python, etc.,
> then I'm able to communicate with the cluster just fine. But if Python
> versions differ slightly, then I start to see a lot of opaque errors that
> often bubble up as EOFExceptions. Furthermore, this just seems like a very
> heavy weight way to set up a client.
>
> Does anyone have any suggestions for setting up a thin pyspark client on a
> node which doesn't necessarily conform to the homogeneity of the target
> Spark cluster?
>
> Best,
> Chris
>

Re: Failing jobs runs twice

2015-01-13 Thread Andrew Or

Hi Anders, are you using YARN by any chance?

2015-01-13 0:32 GMT-08:00 Anders Arpteg :

> Since starting using Spark 1.2, I've experienced an annoying issue with
> failing apps that gets executed twice. I'm not talking about tasks inside a
> job, that should be executed multiple times before failing the whole app.
> I'm talking about the whole app, that seems to close the previous Spark
> context, start a new, and rerun the app again.
>
> This is annoying since it overwrite the log files as well and it becomes
> hard to troubleshoot the failing app. Does anyone know how to turn this
> "feature" off?
>
> Thanks,
> Anders
>

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-07 Thread Andrew Or

Did you end up getting it working? By the way this might be a nicer view of
the docs:
https://github.com/apache/spark/blob/60e2d9e2902b132b14191c9791c71e8f0d42ce9d/docs/job-scheduling.md

We will update the latest Spark docs to include this shortly.
-Andrew

2015-01-04 4:44 GMT-08:00 Tsuyoshi Ozawa :

> Please check the document added by Andrew. I could run tasks with Spark
> 1.2.0.
>
> *
> https://github.com/apache/spark/pull/3731/files#diff-c3cbe4cabe90562520f22d2306aa9116R86
> *
> https://github.com/apache/spark/pull/3757/files#diff-c3cbe4cabe90562520f22d2306aa9116R101
>
> Thanks,
> - Tsuyoshi
>
> On Sun, Jan 4, 2015 at 11:54 AM, firemonk9 
> wrote:
> > I am running into similar problem. Have you found any resolution to this
> > issue ?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Elastic-allocation-spark-dynamicAllocation-enabled-results-in-task-never-being-executed-tp18969p20957.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to increase parallelism in Yarn

2014-12-18 Thread Andrew Or

Hi Suman,

I'll assume that you are using spark submit to run your application. You
can pass the --num-executors flag to ask for more containers. If you want
to allocate more memory for each executor, you may also pass in the
--executor-memory flag (this accepts a string in the format 1g, 512m etc.).

-Andrew

2014-12-18 14:37 GMT-08:00 Suman Somasundar :
>
> Hi,
>
>
>
> I am using Spark 1.1.1 on Yarn. When I try to run K-Means, I see from the
> Yarn dashboard that only 3 containers are being used. How do I increase the
> number of containers used?
>
>
>
> P.S: When I run K-Means on Mahout with the same settings, I see that there
> are 25-30 containers being used.
>
>
>
> Thanks,
> Suman.
>

Re: Standalone Spark program

2014-12-18 Thread Andrew Or

Hey Akshat,

What is the class that is not found, is it a Spark class or classes that
you define in your own application? If the latter, then Akhil's solution
should work (alternatively you can also pass the jar through the --jars
command line option in spark-submit).

If it's a Spark class, however, it's likely that the Spark assembly jar is
not present on the worker nodes. When you build Spark on the cluster, you
will need to rsync it to the same path on all the nodes in your cluster.
For more information, see
http://spark.apache.org/docs/latest/spark-standalone.html.

-Andrew

2014-12-18 10:29 GMT-08:00 Akhil Das :
>
> You can build a jar of your project and add it to the sparkContext
> (sc.addJar("/path/to/your/project.jar")) then it will get shipped to the
> worker and hence no classNotfoundException!
>
> Thanks
> Best Regards
>
> On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya  wrote:
>>
>> Hi,
>>
>> I am building a Spark-based service which requires initialization of a
>> SparkContext in a main():
>>
>> def main(args: Array[String]) {
>> val conf = new SparkConf(false)
>>   .setMaster("spark://foo.example.com:7077")
>>   .setAppName("foobar")
>>
>> val sc = new SparkContext(conf)
>> val rdd = sc.parallelize(0 until 255)
>> val res =  rdd.mapPartitions(it => it).take(1)
>> println(s"res=$res")
>> sc.stop()
>> }
>>
>> This code works fine via REPL, but not as a standalone program; it causes
>> a ClassNotFoundException.  This has me confused about how code is shipped
>> out to executors.  When using via REPL, does the mapPartitions closure,
>> it=>it, get sent out when the REPL statement is executed?  When this code
>> is run as a standalone program (not via spark-submit), is the compiled code
>> expected to be present at the the executor?
>>
>> Thanks,
>> Akshat
>>
>>

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or

Hey Arun I've seen that behavior before. It happens when the cluster
doesn't have enough resources to offer and the RM hasn't given us our
containers yet. Can you check the RM Web UI at port 8088 to see whether
your application is requesting more resources than the cluster has to offer?

2014-12-05 12:51 GMT-08:00 Sandy Ryza :

> Hey Arun,
>
> The sleeps would only cause maximum like 5 second overhead.  The idea was
> to give executors some time to register.  On more recent versions, they
> were replaced with the spark.scheduler.minRegisteredResourcesRatio and
> spark.scheduler.maxRegisteredResourcesWaitingTime.  As of 1.1, by default
> YARN will wait until either 30 seconds have passed or 80% of the requested
> executors have registered.
>
> -Sandy
>
> On Fri, Dec 5, 2014 at 12:46 PM, Ashish Rangole 
> wrote:
>
>> Likely this not the case here yet one thing to point out with Yarn
>> parameters like --num-executors is that they should be specified *before*
>> app jar and app args on spark-submit command line otherwise the app only
>> gets the default number of containers which is 2.
>> On Dec 5, 2014 12:22 PM, "Sandy Ryza"  wrote:
>>
>>> Hi Denny,
>>>
>>> Those sleeps were only at startup, so if jobs are taking significantly
>>> longer on YARN, that should be a different problem.  When you ran on YARN,
>>> did you use the --executor-cores, --executor-memory, and --num-executors
>>> arguments?  When running against a standalone cluster, by default Spark
>>> will make use of all the cluster resources, but when running against YARN,
>>> Spark defaults to a couple tiny executors.
>>>
>>> -Sandy
>>>
>>> On Fri, Dec 5, 2014 at 11:32 AM, Denny Lee 
>>> wrote:
>>>
>>>> My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand
>>>> steps. If I was running this on standalone cluster mode the query finished
>>>> in 55s but on YARN, the query was still running 30min later. Would the hard
>>>> coded sleeps potentially be in play here?
>>>> On Fri, Dec 5, 2014 at 11:23 Sandy Ryza 
>>>> wrote:
>>>>
>>>>> Hi Tobias,
>>>>>
>>>>> What version are you using?  In some recent versions, we had a couple
>>>>> of large hardcoded sleeps on the Spark side.
>>>>>
>>>>> -Sandy
>>>>>
>>>>> On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or 
>>>>> wrote:
>>>>>
>>>>>> Hey Tobias,
>>>>>>
>>>>>> As you suspect, the reason why it's slow is because the resource
>>>>>> manager in YARN takes a while to grant resources. This is because YARN
>>>>>> needs to first set up the application master container, and then this AM
>>>>>> needs to request more containers for Spark executors. I think this 
>>>>>> accounts
>>>>>> for most of the overhead. The remaining source probably comes from how 
>>>>>> our
>>>>>> own YARN integration code polls application (every second) and cluster
>>>>>> resource states (every 5 seconds IIRC). I haven't explored in detail
>>>>>> whether there are optimizations there that can speed this up, but I 
>>>>>> believe
>>>>>> most of the overhead comes from YARN itself.
>>>>>>
>>>>>> In other words, no I don't know of any quick fix on your end that you
>>>>>> can do to speed this up.
>>>>>>
>>>>>> -Andrew
>>>>>>
>>>>>>
>>>>>> 2014-12-03 20:10 GMT-08:00 Tobias Pfeiffer :
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I am using spark-submit to submit my application to YARN in
>>>>>>> "yarn-cluster" mode. I have both the Spark assembly jar file as well as 
>>>>>>> my
>>>>>>> application jar file put in HDFS and can see from the logging output 
>>>>>>> that
>>>>>>> both files are used from there. However, it still takes about 10 seconds
>>>>>>> for my application's yarnAppState to switch from ACCEPTED to RUNNING.
>>>>>>>
>>>>>>> I am aware that this is probably not a Spark issue, but some YARN
>>>>>>> configuration setting (or YARN-inherent slowness), I was just wondering 
>>>>>>> if
>>>>>>> anyone has an advice for how to speed this up.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Tobias
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Andrew Or

Increasing max failures is a way to do it, but it's probably a better idea
to keep your tasks from failing in the first place. Are your tasks failing
with exceptions from Spark or your application code? If from Spark, what is
the stack trace? There might be a legitimate Spark bug such that even
increasing this max failures won't fix your problem.

2014-12-05 5:12 GMT-08:00 Daniel Darabos :

> It is controlled by "spark.task.maxFailures". See
> http://spark.apache.org/docs/latest/configuration.html#scheduling.
>
> On Fri, Dec 5, 2014 at 11:02 AM, shahab  wrote:
>
>> Hello,
>>
>> By some (unknown) reasons some of my tasks, that fetch data from
>> Cassandra, are failing so often, and apparently the master removes a tasks
>> which fails more than 4 times (in my case).
>>
>> Is there any way to increase the number of re-tries ?
>>
>> best,
>> /Shahab
>>
>
>

Re: Any ideas why a few tasks would stall

2014-12-05 Thread Andrew Or

Hi Steve et al.,

It is possible that there's just a lot of skew in your data, in which case
repartitioning is a good idea. Depending on how large your input data is
and how much skew you have, you may want to repartition to a larger number
of partitions. By the way you can just call rdd.repartition(1000); this is
the same as rdd.coalesce(1000, forceShuffle = true). Note that
repartitioning is only a good idea if your straggler task is taking a long
time. Otherwise, it can be quite expensive since it requires a full shuffle.

Another possibility is that you might just have bad nodes in your cluster.
To mitigate stragglers, you can try enabling speculative execution through
spark.speculation to true. This attempts to re-run any task that takes a
long time to complete on a different node in parallel.

-Andrew

2014-12-04 11:43 GMT-08:00 akhandeshi :

> This did not work for me.  that is, rdd.coalesce(200, forceShuffle) .  Does
> anyone have ideas on how to distribute your data evenly and co-locate
> partitions of interest?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would-stall-tp20207p20387.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Issue in executing Spark Application from Eclipse

2014-12-05 Thread Andrew Or

Hey Stuti,

Did you start your standalone Master and Workers? You can do this through
sbin/start-all.sh (see
http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I
would recommend launching your application from the command line through
bin/spark-submit. I am not sure if we officially support launching Spark
applications from an IDE, because spark-submit handles very specific cases
of how we set up class paths and JVM memory etc.

-Andrew

2014-12-03 22:05 GMT-08:00 Stuti Awasthi :

>  Hi All,
>
> I have a standalone Spark(1.1) cluster on one machine and I have installed
> scala Eclipse IDE (scala 2.10) on my desktop. I am trying to execute a
> spark code to execute over my standalone cluster but getting errors.
>
> Please guide me to resolve this.
>
>
>
> Code:
>
>   val logFile = "" // Should be some file on
> your system
>
> val conf = new SparkConf().setAppName("Simple
> Application").setMaster("spark://:").setSparkHome("/home/stuti/Spark/spark-1.1.0-bin-hadoop1");
>
> val sc = new SparkContext(conf)
>
>
> println(sc.master)
> // Print correct  master
>
>val logData = sc.textFile(logFile, 2).cache()
>
>
> println(logData.count)
> // throws error
>
>
>
>
>
>
>
> Error :
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
>
> 14/12/04 11:05:38 INFO SecurityManager: Changing view acls to:
> stutiawasthi,
>
> 14/12/04 11:05:38 INFO SecurityManager: Changing modify acls to:
> stutiawasthi,
>
> 14/12/04 11:05:38 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(stutiawasthi,
> ); users with modify permissions: Set(stutiawasthi, )
>
> 14/12/04 11:05:39 INFO Slf4jLogger: Slf4jLogger started
>
> 14/12/04 11:05:39 INFO Remoting: Starting remoting
>
> 14/12/04 11:05:40 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@:62308]
>
> 14/12/04 11:05:40 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://sparkDriver@:62308]
>
> 14/12/04 11:05:40 INFO Utils: Successfully started service 'sparkDriver'
> on port 62308.
>
> 14/12/04 11:05:40 INFO SparkEnv: Registering MapOutputTracker
>
> 14/12/04 11:05:40 INFO SparkEnv: Registering BlockManagerMaster
>
> 14/12/04 11:05:40 INFO DiskBlockManager: Created local directory at
> C:\Users\STUTIA~1\AppData\Local\Temp\spark-local-20141204110540-ad60
>
> 14/12/04 11:05:40 INFO Utils: Successfully started service 'Connection
> manager for block manager' on port 62311.
>
> 14/12/04 11:05:40 INFO ConnectionManager: Bound socket to port 62311 with
> id = ConnectionManagerId(,62311)
>
> 14/12/04 11:05:41 INFO MemoryStore: MemoryStore started with capacity
> 133.6 MB
>
> 14/12/04 11:05:41 INFO BlockManagerMaster: Trying to register BlockManager
>
> 14/12/04 11:05:41 INFO BlockManagerMasterActor: Registering block manager
> :62311 with 133.6 MB RAM
>
> 14/12/04 11:05:41 INFO BlockManagerMaster: Registered BlockManager
>
> 14/12/04 11:05:41 INFO HttpFileServer: HTTP File server directory is
> C:\Users\STUTIA~1\AppData\Local\Temp\spark-b65e69f4-69b9-4bb2-b41f-67165909e4c7
>
> 14/12/04 11:05:41 INFO HttpServer: Starting HTTP Server
>
> 14/12/04 11:05:41 INFO Utils: Successfully started service 'HTTP file
> server' on port 62312.
>
> 14/12/04 11:05:42 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
>
> 14/12/04 11:05:42 INFO SparkUI: Started SparkUI at http://
> :4040
>
> 14/12/04 11:05:43 INFO AppClient$ClientActor: Connecting to master
> spark://10.112.67.80:7077...
>
> 14/12/04 11:05:43 INFO SparkDeploySchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after reached minRegisteredResourcesRatio:
> 0.0
>
> spark://10.112.67.80:7077
>
> 14/12/04 11:05:44 WARN SizeEstimator: Failed to check whether
> UseCompressedOops is set; assuming yes
>
> 14/12/04 11:05:45 INFO MemoryStore: ensureFreeSpace(31447) called with
> curMem=0, maxMem=140142182
>
> 14/12/04 11:05:45 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 30.7 KB, free 133.6 MB)
>
> 14/12/04 11:05:45 INFO MemoryStore: ensureFreeSpace(3631) called with
> curMem=31447, maxMem=140142182
>
> 14/12/04 11:05:45 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 3.5 KB, free 133.6 MB)
>
> 14/12/04 11:05:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on :62311 (size: 3.5 KB, free: 133.6 MB)
>
> 14/12/04 11:05:45 INFO BlockManagerMaster: Updated info of block
> broadcast_0_piece0
>
> 14/12/04 11:05:45 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 14/12/04 11:05:45 WARN LoadSnappy: Snappy native library not loaded
>
> 14/12/04 11:05:46 INFO FileInputFormat: Total input paths to process : 1
>
> 14/12/04 11:05:46 INFO SparkContext: Starting job: count at Test.scala:15
>
> 14/12/04 11:05:46 INFO DAGScheduler: Got job 0 (count at Test.scala:15)
> with 2 output partitions (all

Re: Monitoring Spark

2014-12-05 Thread Andrew Or

If you're only interested in a particular instant, a simpler way is to
check the executors page on the Spark UI:
http://spark.apache.org/docs/latest/monitoring.html. By default each
executor runs one task per core, so you can see how many tasks are being
run at a given time and this translates directly to how many cores are
being used for execution.

2014-12-02 21:49 GMT-08:00 Otis Gospodnetic :

> Hi Isca,
>
> I think SPM can do that for you:
> http://blog.sematext.com/2014/10/07/apache-spark-monitoring/
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 2, 2014 at 11:57 PM, Isca Harmatz  wrote:
>
>> hello,
>>
>> im running spark on a cluster and i want to monitor how many nodes/ cores
>> are active in different (specific) points of the program.
>>
>> is there any way to do this?
>>
>> thanks,
>>   Isca
>>
>
>

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or

Hey Tobias,

As you suspect, the reason why it's slow is because the resource manager in
YARN takes a while to grant resources. This is because YARN needs to first
set up the application master container, and then this AM needs to request
more containers for Spark executors. I think this accounts for most of the
overhead. The remaining source probably comes from how our own YARN
integration code polls application (every second) and cluster resource
states (every 5 seconds IIRC). I haven't explored in detail whether there
are optimizations there that can speed this up, but I believe most of the
overhead comes from YARN itself.

In other words, no I don't know of any quick fix on your end that you can
do to speed this up.

-Andrew


2014-12-03 20:10 GMT-08:00 Tobias Pfeiffer :

> Hi,
>
> I am using spark-submit to submit my application to YARN in "yarn-cluster"
> mode. I have both the Spark assembly jar file as well as my application jar
> file put in HDFS and can see from the logging output that both files are
> used from there. However, it still takes about 10 seconds for my
> application's yarnAppState to switch from ACCEPTED to RUNNING.
>
> I am aware that this is probably not a Spark issue, but some YARN
> configuration setting (or YARN-inherent slowness), I was just wondering if
> anyone has an advice for how to speed this up.
>
> Thanks
> Tobias
>

Re: Unable to run applications on clusters on EC2

2014-12-05 Thread Andrew Or

Hey, the default port is 7077. Not sure if you actually meant to put 7070.
As a rule of thumb, you can go to the Master web UI and copy and paste the
URL at the top left corner. That almost always works unless your cluster
has a weird proxy set up.

2014-12-04 14:26 GMT-08:00 Xingwei Yang :

> I think it is related to my previous questions, but I separate them. In my
> previous question, I could not connect to WebUI even though I could log
> into the cluster without any problem.
>
> Also, I tried lynx localhost:8080 and I could get the information about
> the cluster;
>
> I could also user spark-submit to submit job locally by setting master to
> localhost
>
>
> However, I could not submit the job to the cluster master and I get the
> error like this:
>
> 14/12/04 22:14:39 WARN scheduler.TaskSchedulerImpl: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
> 14/12/04 22:14:42 INFO client.AppClient$ClientActor: Connecting to master
> spark://ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070...
> 14/12/04 22:14:42 WARN client.AppClient$ClientActor: Could not connect to
> akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070:
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070]
> 14/12/04 22:14:42 WARN client.AppClient$ClientActor: Could not connect to
> akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070:
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070]
> 14/12/04 22:14:42 WARN client.AppClient$ClientActor: Could not connect to
> akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070:
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070]
> 14/12/04 22:14:42 WARN client.AppClient$ClientActor: Could not connect to
> akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070:
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://
> sparkmas...@ec2-54-149-92-187.us-west-2.compute.amazonaws.com:7070]
>
>
> Please let me know if you any any clue about it. Thanks a lot.
>
> --
> Sincerely Yours
> Xingwei Yang
> https://sites.google.com/site/xingweiyang1223/
>
>

Re: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Andrew Or

Hey Sourav, are you able to run a simple shuffle in a spark-shell?

2014-12-05 1:20 GMT-08:00 Shao, Saisai :

>  Hi,
>
>
>
> I don’t think it’s a problem of Spark Streaming, seeing for call stack,
> it’s the problem when BlockManager starting to initializing itself. Would
> you mind checking your configuration of Spark, hardware problem,
> deployment. Mostly I think it’s not the problem of Spark.
>
>
>
> Thanks
>
> Saisai
>
>
>
> *From:* Sourav Chandra [mailto:sourav.chan...@livestream.com]
> *Sent:* Friday, December 5, 2014 4:36 PM
> *To:* user@spark.apache.org
> *Subject:* Spark streaming for v1.1.1 - unable to start application
>
>
>
> Hi,
>
>
>
> I am getting the below error and due to this there is no completed stages-
> all the waiting
>
>
>
> *14/12/05 03:31:59 WARN AkkaUtils: Error sending message in 1 attempts*
>
> *java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]*
>
> *at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)*
>
> *at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)*
>
> *at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)*
>
> *at
> akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)*
>
> *at
> scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)*
>
> *at
> akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)*
>
> *at scala.concurrent.Await$.result(package.scala:107)*
>
> *at
> org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)*
>
> *at
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:213)*
>
> *at
> org.apache.spark.storage.BlockManagerMaster.tell(BlockManagerMaster.scala:203)*
>
> *at
> org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:47)*
>
> *at
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:177)*
>
> *at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:147)*
>
> *at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:168)*
>
> *at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)*
>
> *at org.apache.spark.executor.Executor.(Executor.scala:78)*
>
> *at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:60)*
>
> *at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)*
>
> *at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)*
>
> *at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)*
>
> *at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)*
>
> *at
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)*
>
> *at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)*
>
> *at
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)*
>
> *at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)*
>
> *at akka.actor.ActorCell.invoke(ActorCell.scala:456)*
>
> *at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)*
>
> *at akka.dispatch.Mailbox.run(Mailbox.scala:219)*
>
> *at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)*
>
> *at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)*
>
> *at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)*
>
> *at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)*
>
> *at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*
>
>
>
>
>
> Could you please let me know the reason and fix for this? Spark version is
> 1.1.1
>
>
>
> --
>
> *Sourav Chandra*
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> sourav.chan...@livestream.com
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> *Livestream*
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>
>
>

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or

This should be fixed now. Thanks for bringing this to our attention.

2014-12-03 13:31 GMT-08:00 Andrew Or :

> Yeah this is currently broken for 1.1.1. I will submit a fix later today.
>
> 2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu>:
>
> +Andrew
>>
>> Actually I think this is because we haven't uploaded the Spark binaries
>> to cloudfront / pushed the change to mesos/spark-ec2.
>>
>> Andrew, can you take care of this ?
>>
>>
>>
>> On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Interesting. Do you have any problems when launching in us-east-1? What
>>> is the full output of spark-ec2 when launching a cluster? (Post it to a
>>> gist if it’s too big for email.)
>>> 
>>>
>>> On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis >> > wrote:
>>>
>>>> I've been trying to create a Spark cluster on EC2 using the
>>>> documentation at https://spark.apache.org/docs/latest/ec2-scripts.html
>>>> (with Spark 1.1.1).
>>>>
>>>> Running the script successfully creates some EC2 instances, HDFS etc.,
>>>> but appears to fail to copy the actual files needed to run Spark
>>>> across.
>>>>
>>>> I ran the following commands:
>>>>
>>>> $ cd ~/src/spark-1.1.1/ec2
>>>> $ ./spark-ec2 --key-pair=* --identity-file=* --slaves=1
>>>> --region=eu-west-1 --zone=eu-west-1a --instance-type=m3.medium
>>>> --no-ganglia launch foocluster
>>>>
>>>> I see the following in the script's output:
>>>>
>>>> (instance and HDFS set up happens here)
>>>> ...
>>>> Persistent HDFS installed, won't start by default...
>>>> ~/spark-ec2 ~/spark-ec2
>>>> Setting up spark-standalone
>>>> RSYNC'ing /root/spark/conf to slaves...
>>>> *.eu-west-1.compute.amazonaws.com
>>>> RSYNC'ing /root/spark-ec2 to slaves...
>>>> *.eu-west-1.compute.amazonaws.com
>>>> ./spark-standalone/setup.sh: line 22: /root/spark/sbin/stop-all.sh: No
>>>> such file or directory
>>>> ./spark-standalone/setup.sh: line 27:
>>>> /root/spark/sbin/start-master.sh: No such file or directory
>>>> ./spark-standalone/setup.sh: line 33:
>>>> /root/spark/sbin/start-slaves.sh: No such file or directory
>>>> Setting up tachyon
>>>> RSYNC'ing /root/tachyon to slaves...
>>>> ...
>>>> (Tachyon setup happens here without any problem)
>>>>
>>>> I can ssh to the master (using the ./spark-ec2 login), and looking in
>>>> /root/, it contains:
>>>>
>>>> $ ls /root
>>>> ephemeral-hdfs  hadoop-native  mapreduce  persistent-hdfs  scala
>>>> shark  spark  spark-ec2  tachyon
>>>>
>>>> If I look in /root/spark (where the sbin directory should be found),
>>>> it only contains a single 'conf' directory:
>>>>
>>>> $ ls /root/spark
>>>> conf
>>>>
>>>> Any idea why spark-ec2 might have failed to copy these files across?
>>>>
>>>> Thanks,
>>>> Dave
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Problem creating EC2 cluster using spark-ec2

2014-12-03 Thread Andrew Or

Yeah this is currently broken for 1.1.1. I will submit a fix later today.

2014-12-02 17:17 GMT-08:00 Shivaram Venkataraman :

> +Andrew
>
> Actually I think this is because we haven't uploaded the Spark binaries to
> cloudfront / pushed the change to mesos/spark-ec2.
>
> Andrew, can you take care of this ?
>
>
>
> On Tue, Dec 2, 2014 at 5:11 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Interesting. Do you have any problems when launching in us-east-1? What
>> is the full output of spark-ec2 when launching a cluster? (Post it to a
>> gist if it’s too big for email.)
>> 
>>
>> On Mon, Dec 1, 2014 at 10:34 AM, Dave Challis 
>> wrote:
>>
>>> I've been trying to create a Spark cluster on EC2 using the
>>> documentation at https://spark.apache.org/docs/latest/ec2-scripts.html
>>> (with Spark 1.1.1).
>>>
>>> Running the script successfully creates some EC2 instances, HDFS etc.,
>>> but appears to fail to copy the actual files needed to run Spark
>>> across.
>>>
>>> I ran the following commands:
>>>
>>> $ cd ~/src/spark-1.1.1/ec2
>>> $ ./spark-ec2 --key-pair=* --identity-file=* --slaves=1
>>> --region=eu-west-1 --zone=eu-west-1a --instance-type=m3.medium
>>> --no-ganglia launch foocluster
>>>
>>> I see the following in the script's output:
>>>
>>> (instance and HDFS set up happens here)
>>> ...
>>> Persistent HDFS installed, won't start by default...
>>> ~/spark-ec2 ~/spark-ec2
>>> Setting up spark-standalone
>>> RSYNC'ing /root/spark/conf to slaves...
>>> *.eu-west-1.compute.amazonaws.com
>>> RSYNC'ing /root/spark-ec2 to slaves...
>>> *.eu-west-1.compute.amazonaws.com
>>> ./spark-standalone/setup.sh: line 22: /root/spark/sbin/stop-all.sh: No
>>> such file or directory
>>> ./spark-standalone/setup.sh: line 27:
>>> /root/spark/sbin/start-master.sh: No such file or directory
>>> ./spark-standalone/setup.sh: line 33:
>>> /root/spark/sbin/start-slaves.sh: No such file or directory
>>> Setting up tachyon
>>> RSYNC'ing /root/tachyon to slaves...
>>> ...
>>> (Tachyon setup happens here without any problem)
>>>
>>> I can ssh to the master (using the ./spark-ec2 login), and looking in
>>> /root/, it contains:
>>>
>>> $ ls /root
>>> ephemeral-hdfs  hadoop-native  mapreduce  persistent-hdfs  scala
>>> shark  spark  spark-ec2  tachyon
>>>
>>> If I look in /root/spark (where the sbin directory should be found),
>>> it only contains a single 'conf' directory:
>>>
>>> $ ls /root/spark
>>> conf
>>>
>>> Any idea why spark-ec2 might have failed to copy these files across?
>>>
>>> Thanks,
>>> Dave
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Announcing Spark 1.1.1!

2014-12-03 Thread Andrew Or

By the Spark server do you mean the standalone Master? It is best if they
are upgraded together because there have been changes to the Master in
1.1.1. Although it might "just work", it's highly recommended to restart
your cluster manager too.

2014-12-03 13:19 GMT-08:00 Romi Kuntsman :

> About version compatibility and upgrade path -  can the Java application
> dependencies and the Spark server be upgraded separately (i.e. will 1.1.0
> library work with 1.1.1 server, and vice versa), or do they need to be
> upgraded together?
>
> Thanks!
>
> *Romi Kuntsman*, *Big Data Engineer*
>  http://www.totango.com
>
> On Tue, Dec 2, 2014 at 11:36 PM, Andrew Or  wrote:
>
>> I am happy to announce the availability of Spark 1.1.1! This is a
>> maintenance release with many bug fixes, most of which are concentrated in
>> the core. This list includes various fixes to sort-based shuffle, memory
>> leak, and spilling issues. Contributions from this release came from 55
>> developers.
>>
>> Visit the release notes [1] to read about the new features, or
>> download [2] the release today.
>>
>> [1] http://spark.apache.org/releases/spark-release-1-1-1.html
>> [2] http://spark.apache.org/downloads.html
>>
>> Please e-mail me directly for any typo's in the release notes or name
>> listing.
>>
>> Thanks for everyone who contributed, and congratulations!
>> -Andrew
>>
>
>

Announcing Spark 1.1.1!

2014-12-02 Thread Andrew Or

I am happy to announce the availability of Spark 1.1.1! This is a
maintenance release with many bug fixes, most of which are concentrated in
the core. This list includes various fixes to sort-based shuffle, memory
leak, and spilling issues. Contributions from this release came from 55
developers.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.apache.org/releases/spark-release-1-1-1.html
[2] http://spark.apache.org/downloads.html

Please e-mail me directly for any typo's in the release notes or name
listing.

Thanks for everyone who contributed, and congratulations!
-Andrew

Re: Spark 1.1.1 released but not available on maven repositories

2014-11-28 Thread Andrew Or

Hi Luis,

There seems to be a delay in the 1.1.1 artifacts being pushed to our apache
mirrors. We are working with the infra people to get them up as soon as
possible. Unfortunately, due to the national holiday weekend in the US this
may take a little longer than expected, however. For now you may use
https://repository.apache.org/content/repositories/orgapachespark-1043/ as
your Spark 1.1.1 repository instead. This is the one that hosted the final
RC that was later made into the official release.

-Andrew



2014-11-28 2:16 GMT-08:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> Are there any news about this issue? I have checked again maven central
> and the artefacts are still not there.
>
> Regards,
>
> Luis
>
> 2014-11-27 10:42 GMT+00:00 Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com>:
>
>> I have just read on the website that spark 1.1.1 has been released but
>> when I upgraded my project to use 1.1.1 I discovered that the artefacts are
>> not on maven yet.
>>
>> [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...
>>>
>>> [warn] module not found:
 org.apache.spark#spark-streaming-kafka_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom
>>>
>>> [info] Resolving org.apache.spark#spark-core_2.10;1.1.1 ...
>>>
>>> [warn] module not found: org.apache.spark#spark-core_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-core_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom
>>>
>>> [info] Resolving org.apache.spark#spark-streaming_2.10;1.1.1 ...
>>>
>>> [warn] module not found: org.apache.spark#spark-streaming_2.10;1.1.1
>>>
>>> [warn]  local: tried
>>>
>>> [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming_2.10/1.1.1/ivys/ivy.xml
>>>
>>> [warn]  public: tried
>>>
>>> [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom
>>>
>>> [warn]  sonatype snapshots: tried
>>>
>>> [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom

>>>
>

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Andrew Or

Hey Egor,

Have you checked the AM logs? My guess is that it threw an exception or
something such that no executors (not even the initial set) have registered
with your driver. You may already know this, but you can go to the http://:8088 page and click into the application to access this.
Alternatively you could run "yarn logs -applicationId " after
quitting the application.

-Andrew

2014-11-14 14:23 GMT-08:00 Sandy Ryza :

> That would be helpful as well.  Can you confirm that when you try it with
> dynamic allocation the cluster has free resources?
>
> On Fri, Nov 14, 2014 at 12:17 PM, Egor Pahomov 
> wrote:
>
>> It's successful without dynamic allocation. I can provide spark log for
>> that scenario if it can help.
>>
>> 2014-11-14 21:36 GMT+02:00 Sandy Ryza :
>>
>>> Hi Egor,
>>>
>>> Is it successful without dynamic allocation? From your log, it looks
>>> like the job is unable to acquire resources from YARN, which could be
>>> because other jobs are using up all the resources.
>>>
>>> -Sandy
>>>
>>> On Fri, Nov 14, 2014 at 11:32 AM, Egor Pahomov 
>>> wrote:
>>>
 Hi.
 I execute ipython notebook + pyspark with
 spark.dynamicAllocation.enabled = true. Task never ends.
 Code:

 import sys
 from random import random
 from operator import add
 partitions = 10
 n = 10 * partitions

 def f(_):
 x = random() * 2 - 1
 y = random() * 2 - 1
 return 1 if x ** 2 + y ** 2 < 1 else 0

 count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
 print "Pi is roughly %f" % (4.0 * count / n)



 Run notebook:

 IPYTHON_ARGS="notebook --profile=ydf --port $IPYTHON_PORT --port-retries=0 
 --ip='*' --no-browser"
 pyspark \
 --verbose \
 --master yarn-client \
 --conf spark.driver.port=$((RANDOM_PORT + 2)) \
 --conf spark.broadcast.port=$((RANDOM_PORT + 3)) \
 --conf spark.replClassServer.port=$((RANDOM_PORT + 4)) \
 --conf spark.blockManager.port=$((RANDOM_PORT + 5)) \
 --conf spark.executor.port=$((RANDOM_PORT + 6)) \
 --conf spark.fileserver.port=$((RANDOM_PORT + 7)) \
 --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.minExecutors=1 \
 --conf spark.dynamicAllocation.maxExecutors=10 \
 --conf spark.ui.port=$SPARK_UI_PORT


 Spark/Ipython log is in attachment.

 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>>
>>
>>
>> --
>>
>>
>>
>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>
>
>

Re: No module named pyspark - latest built

2014-11-14 Thread Andrew Or

I see. The general known constraints on building your assembly jar for
pyspark on Yarn are:

Java 6
NOT RedHat
Maven

Some of these are documented here
<http://spark.apache.org/docs/latest/building-with-maven.html> (bottom).
Maybe we should make it more explicit.

2014-11-13 2:31 GMT-08:00 jamborta :

> it was built with 1.6 (tried 1.7, too)
>
> On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User
> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=18833&i=0>>
> wrote:
>
> > Hey Jamborta,
> >
> > What java version did you build the jar with?
> >
> > 2014-11-12 16:48 GMT-08:00 jamborta <[hidden email]>:
> >>
> >> I have figured out that building the fat jar with sbt does not seem to
> >> included the pyspark scripts using the following command:
> >>
> >> sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
> >> publish-local assembly
> >>
> >> however the maven command works OK:
> >>
> >> mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
> >> clean package
> >>
> >> am I running the correct sbt command?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
> >> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >
> >
> >
> > 
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18797.html
> > To unsubscribe from No module named pyspark - latest built, click here.
> > NAML
>
> --
> View this message in context: Re: No module named pyspark - latest built
> <http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18833.html>
>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: No module named pyspark - latest built

2014-11-12 Thread Andrew Or

Hey Jamborta,

What java version did you build the jar with?

2014-11-12 16:48 GMT-08:00 jamborta :

> I have figured out that building the fat jar with sbt does not seem to
> included the pyspark scripts using the following command:
>
> sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
> publish-local assembly
>
> however the maven command works OK:
>
> mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
> clean package
>
> am I running the correct sbt command?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Yarn-Client Python

2014-10-28 Thread Andrew Or

Hey TJ,

It appears that your ApplicationMaster thinks it's on the same node as your
driver. Are you setting "spark.driver.host" by any chance? Can you post the
value of this config here? (You can access it through the SparkUI)

2014-10-28 12:50 GMT-07:00 TJ Klein :

> Hi there,
>
> I am trying to run Spark on YARN managed cluster using Python (which
> requires yarn-client mode). However, I cannot get it running (same with
> example apps).
>
> Using spark-submit to launch the script I get the following warning:
> WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
>
> The log file states:
> ERROR yarn.ExecutorLauncher: Failed to connect to driver at
> localhost:52006,
> retrying ...
>
> Any idea? Would greatly appreciate that.
> Thanks,
>  Tassilo
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Python-tp17551.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread Andrew Or

Did you `export` the environment variables? Also, are you running in client
mode or cluster mode? If it still doesn't work you can try to set these
through the spark-submit command lines --num-executors, --executor-cores,
and --executor-memory.

2014-10-23 19:25 GMT-07:00 firemonk9 :

> Hi,
>
>I am facing same problem. My spark-env.sh has below entries yet I see
> the
> yarn container with only 1G and yarn only spawns two workers.
>
> SPARK_EXECUTOR_CORES=1
> SPARK_EXECUTOR_MEMORY=3G
> SPARK_EXECUTOR_INSTANCES=5
>
> Please let me know if you are able to resolve this issue.
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-on-yarn-cluster-problem-tp7560p17175.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Shuffle issues in the current master

2014-10-23 Thread Andrew Or

To add to Aaron's response, `spark.shuffle.consolidateFiles` only applies
to hash-based shuffle, so you shouldn't have to set it for sort-based
shuffle. And yes, since you changed neither `spark.shuffle.compress` nor
`spark.shuffle.spill.compress` you can't possibly have run into what #2890
fixes.

I'm assuming you're running master? Was it before or after this commit:
https://github.com/apache/spark/commit/6b79bfb42580b6bd4c4cd99fb521534a94150693
?

-Andrew

2014-10-22 16:37 GMT-07:00 Aaron Davidson :

> You may be running into this issue:
> https://issues.apache.org/jira/browse/SPARK-4019
>
> You could check by having 2000 or fewer reduce partitions.
>
> On Wed, Oct 22, 2014 at 1:48 PM, DB Tsai  wrote:
>
>> PS, sorry for spamming the mailing list. Based my knowledge, both
>> spark.shuffle.spill.compress and spark.shuffle.compress are default to
>> true, so in theory, we should not run into this issue if we don't
>> change any setting. Is there any other big we run into?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Oct 22, 2014 at 1:37 PM, DB Tsai  wrote:
>> > Or can it be solved by setting both of the following setting into true
>> for now?
>> >
>> > spark.shuffle.spill.compress true
>> > spark.shuffle.compress ture
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > ---
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, Oct 22, 2014 at 1:34 PM, DB Tsai  wrote:
>> >> It seems that this issue should be addressed by
>> >> https://github.com/apache/spark/pull/2890 ? Am I right?
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> ---
>> >> My Blog: https://www.dbtsai.com
>> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>
>> >>
>> >> On Wed, Oct 22, 2014 at 11:54 AM, DB Tsai  wrote:
>> >>> Hi all,
>> >>>
>> >>> With SPARK-3948, the exception in Snappy PARSING_ERROR is gone, but
>> >>> I've another exception now. I've no clue about what's going on; does
>> >>> anyone run into similar issue? Thanks.
>> >>>
>> >>> This is the configuration I use.
>> >>> spark.rdd.compress true
>> >>> spark.shuffle.consolidateFiles true
>> >>> spark.shuffle.manager SORT
>> >>> spark.akka.frameSize 128
>> >>> spark.akka.timeout  600
>> >>> spark.core.connection.ack.wait.timeout  600
>> >>> spark.core.connection.auth.wait.timeout 300
>> >>>
>> >>>
>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
>> >>>
>>  
>> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>> >>>
>>  java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>> >>> java.io.ObjectInputStream.(ObjectInputStream.java:299)
>> >>>
>>  
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:57)
>> >>>
>>  
>> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:57)
>> >>>
>>  
>> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:95)
>> >>>
>>  
>> org.apache.spark.storage.BlockManager.getLocalShuffleFromDisk(BlockManager.scala:351)
>> >>>
>>  
>> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$fetchLocalBlocks$1$$anonfun$apply$4.apply(ShuffleBlockFetcherIterator.scala:196)
>> >>>
>>  
>> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$fetchLocalBlocks$1$$anonfun$apply$4.apply(ShuffleBlockFetcherIterator.scala:196)
>> >>>
>>  
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>> >>>
>>  
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>> >>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> >>>
>>  
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>> >>>
>>  
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> >>>
>>  org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
>> >>>
>>  
>> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
>> >>> org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>> >>>
>>  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >>> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>> >>>
>>  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >>>
>>  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >>>
>>  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >>>
>>  org.apache.spark.scheduler.Sh

Re: Setting only master heap

2014-10-23 Thread Andrew Or

Yeah, as Sameer commented, there is unfortunately not an equivalent
`SPARK_MASTER_MEMORY` that you can set. You can work around this by
starting the master and the slaves separately with different settings of
SPARK_DAEMON_MEMORY each time.

AFAIK there haven't been any major changes in the standalone master in
1.1.0, so I don't see an immediate explanation for what you're observing.
In general the Spark master doesn't use that much memory, and even if there
are many applications it will discard the old ones appropriately, so unless
you have a ton (like thousands) of concurrently running applications
connecting to it there's little likelihood for it to OOM. At least that's
my understanding.

-Andrew

2014-10-22 15:51 GMT-07:00 Sameer Farooqui :

> Hi Keith,
>
> Would be helpful if you could post the error message.
>
> Are you running Spark in Standalone mode or with YARN?
>
> In general, the Spark Master is only used for scheduling and it should be
> fine with the default setting of 512 MB RAM.
>
> Is it actually the Spark Driver's memory that you intended to change?
>
>
>
> *++ If in Standalone mode ++*
> You're right that SPARK_DAEMON_MEMORY set the memory to allocate to the
> Spark Master, Worker and even HistoryServer daemons together.
>
> SPARK_WORKER_MEMORY is slightly confusing. In Standalone mode, it is the
> amount of memory that a worker advertises as available for drivers to
> launch executors. The sum of the memory used by executors spawned from a
> worker cannot exceed SPARK_WORKER_MEMORY.
>
> Unfortunately, I'm not aware of a way to set the memory for Master and
> Worker individually, other than launching them manually. You can also try
> setting the config differently on each machine's spark-env.sh file.
>
>
> *++ If in YARN mode ++*
> In YARN, there is no setting for SPARK_DAEMON_MEMORY. Therefore this is
> only in the Standalone documentation.
>
> Remember that in YARN mode there is no Spark Worker, instead the YARN
> NodeManagers launches the Executors. And in YARN, there is no need to run a
> Spark Master JVM (since the YARN ResourceManager takes care of the
> scheduling).
>
> So, with YARN use SPARK_EXECUTOR_MEMORY to set the Executor's memory. And
> use SPARK_DRIVER_MEMORY to set the Driver's memory.
>
> Just an FYI - for compatibility's sake, even in YARN mode there is a
> setting for SPARK_WORKER_MEMORY, but this has been deprecated. If you do
> set it, it just does the same thing as setting SPARK_EXECUTOR_MEMORY would
> have done.
>
>
> - Sameer
>
>
> On Wed, Oct 22, 2014 at 1:46 PM, Keith Simmons  wrote:
>
>> We've been getting some OOMs from the spark master since upgrading to
>> Spark 1.1.0.  I've found SPARK_DAEMON_MEMORY, but that also seems to
>> increase the worker heap, which as far as I know is fine.  Is there any
>> setting which *only* increases the master heap size?
>>
>> Keith
>>
>
>

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Andrew Or

Hm, it works for me. Are you sure you have provided the right jars? What
happens if you pass in the `--verbose` flag?

2014-10-16 23:51 GMT-07:00 eric wong :

> Hi,
>
> i using the comma separated style for submit multiple jar files in the
> follow shell but it does not work:
>
> bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
> --master yarn-cluster --execur-memory 2g *--jars
> lib/spark-examples-1.0.2-hadoop2.2.0.jar,lib/spark-mllib_2.10-1.0.0.jar 
> *hdfs://master:8000/srcdata/kmeans
> 8 4
>
>
> Thanks!
>
>
> --
> WangHaihua
>

Re: executors not created yarn-cluster mode

2014-10-08 Thread Andrew Or

Hi Jamborta,

It could be that your executors are requesting too much memory. I'm not
sure why it works in client mode but not in cluster mode, however. Have you
checked the RM logs for messages that complain about container memory
requested being too high? How much memory is each of your container asking
for (check AM logs for this)? As a sanity check you could try lowering the
executor memory to see if that's in fact the issue. Another factor at play
here is "spark.yarn.executor.memoryOverhead", which adds to the amount of
memory requested and is 348MB by default.

Let me know if you find anything,
-Andrew

2014-10-08 12:00 GMT-07:00 jamborta :

> Hi all,
>
> I have a setup that works fine in yarn-client mode, but when I change that
> to yarn-cluster, the executors don't get created, apart from the driver (it
> seems that it does not even appear in yarn's resource manager). There is
> nothing in the spark logs, either (even when debug is enabled). When I try
> to submit something to the sparkcontext I get the following error:
>
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient memory
>
> Just wondering where can I find the logs that show the executor creation
> process? I looked at yarn logs, also the spark event logs, couldn't find
> anything.
>
> many thanks,
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/executors-not-created-yarn-cluster-mode-tp15957.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Running Spark cluster on local machine, cannot connect to master error

2014-10-08 Thread Andrew Or

Hi Russell and Theodore,

This usually means your Master / Workers / client machine are running
different versions of Spark. On a local machine, you may want to restart
your master and workers (sbin/stop-all.sh, then sbin/start-all.sh). On a
real cluster, you want to make sure that every node (including the submit
client) has the same Spark assembly jar before restarting the master and
the workers.

Let me know if that fixes it.
-Andrew

2014-10-08 13:29 GMT-07:00 rrussell25 :

> Theodore, did you ever get this resolved?  I just ran into the same thing.
> Before digging, I figured I'd ask.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-cluster-on-local-machine-cannot-connect-to-master-error-tp12743p15972.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark on YARN driver memory allocation bug?

2014-10-08 Thread Andrew Or

Hi Greg,

It does seem like a bug. What is the particular exception message that you
see?

Andrew

2014-10-08 12:12 GMT-07:00 Greg Hill :

>  So, I think this is a bug, but I wanted to get some feedback before I
> reported it as such.  On Spark on YARN, 1.1.0, if you specify the
> --driver-memory value to be higher than the memory available on the client
> machine, Spark errors out due to failing to allocate enough memory.  This
> happens even in yarn-cluster mode.  Shouldn't it only allocate that memory
> on the YARN node that is going to run the driver process, not the local
> client machine?
>
>  Greg
>
>

Re: anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Andrew Or

Hi Steve, what Spark version are you running?

2014-10-07 14:45 GMT-07:00 Steve Lewis :

> java.lang.NullPointerException
> at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
>
> My spark application running on Windows 8 keeps crashing with this error
> and I find no work around
>

Re: Is there a way to provide individual property to each Spark executor?

2014-10-02 Thread Andrew Or

Hi Vladimir,

This is not currently supported, but users have asked for it in the past. I
have filed an issue for it here:
https://issues.apache.org/jira/browse/SPARK-3767 so we can track its
progress.

Andrew

2014-10-02 5:25 GMT-07:00 Vladimir Tretyakov <
vladimir.tretya...@sematext.com>:

> Hi, here in Sematext we almost done with Spark monitoring
> http://www.sematext.com/spm/index.html
>
> But we need 1 thing from Spark, something like
> https://groups.google.com/forum/#!topic/storm-user/2fNCF341yqU in Storm.
>
> Something like 'placeholder' in java opts which Spark will fills for
> executor, with executorId (0,1,2,3...).
>
> For example I will write in spark-defaults.conf:
>
> spark.executor.extraJavaOptions -Dcom.sun.management.jmxremote
> -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-spark.jar=myValue-
> *%executoId*:spark-executor:default
>
> and will get in executor processes:
> -Dcom.sun.management.jmxremote
> -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-spark.jar=myValue-*0*
> :spark-executor:default
> -Dcom.sun.management.jmxremote
> -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-spark.jar=myValue-*1*
> :spark-executor:default
> -Dcom.sun.management.jmxremote
> -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-spark.jar=myValue-*2*
> :spark-executor:default
> ...
> ...
> ...
>
>
>
> Can I do something like that in Spark for executor? If not maybe it can be
> done in the future? Will be useful.
>
> Thx, best redgards, Vladimir Tretyakov.
>
>

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or

Ah I see, you were running in yarn cluster mode so the logs are the same.
Glad you figured it out.

2014-10-02 10:31 GMT-07:00 Greg Hill :

>  So, I actually figured it out, and it's all my fault.  I had an older
> version of spark on the datanodes and was passing
> in spark.executor.extraClassPath to pick it up.  It was a holdover from
> some initial work before I got everything working right.  Once I removed
> that, it picked up the spark JAR from hdfs instead and ran without issue.
>
>  Sorry for the false alarm.
>
>  The AM container logs were what I had pasted in the original email, btw.
>
>  Greg
>
>   From: Andrew Or 
> Date: Thursday, October 2, 2014 12:24 PM
> To: Greg 
> Cc: "user@spark.apache.org" 
> Subject: Re: weird YARN errors on new Spark on Yarn cluster
>
>   Hi Greg,
>
>  Have you looked at the AM container logs? (You may already know this,
> but) you can get these through the RM web UI or through:
>
>  yarn logs -applicationId 
>
>  If an AM throws an exception then the executors may not be started
> properly.
>
>  -Andrew
>
>
>
> 2014-10-02 9:47 GMT-07:00 Greg Hill :
>
>>  I haven't run into this until today.  I spun up a fresh cluster to do
>> some more testing, and it seems that every single executor fails because it
>> can't connect to the driver.  This is in the YARN logs:
>>
>>  14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend:
>> Connecting to driver: akka.tcp://sparkDriver@GATEWAY-1
>> :60855/user/CoarseGrainedScheduler
>> 14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver
>> Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] ->
>> [akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down.
>>
>>  And this is what shows up from the driver:
>>
>>  14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered
>> executor: 
>> Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113]
>> with ID 2
>> 14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to
>> /rack/node8da83a04def73517bf437e95aeefa2469b1daf14
>> 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2
>> disconnected, so removing it
>>
>> It doesn't appear to be a networking issue.  Networking works both
>> directions and there's no firewall blocking ports.  Googling the issue, it
>> sounds like the most common problem is overallocation of memory, but I'm
>> not doing that.  I've got these settings for a 3 * 128GB node cluster:
>>
>>  spark.executor.instances17
>>  spark.executor.memory   12424m
>> spark.yarn.executor.memoryOverhead  3549
>>
>>  That makes it 6 * 15973 = 95838 MB per node, which is well beneath the
>> 128GB limit.
>>
>>  Frankly I'm stumped.  It worked fine when I spun up a cluster last
>> week, but now it doesn't.  The logs give me no indication as to what the
>> problem actually is.  Any pointers to where else I might look?
>>
>>  Thanks in advance.
>>
>>  Greg
>>
>
>

Re: weird YARN errors on new Spark on Yarn cluster

2014-10-02 Thread Andrew Or

Hi Greg,

Have you looked at the AM container logs? (You may already know this, but)
you can get these through the RM web UI or through:

yarn logs -applicationId 

If an AM throws an exception then the executors may not be started properly.

-Andrew



2014-10-02 9:47 GMT-07:00 Greg Hill :

>  I haven't run into this until today.  I spun up a fresh cluster to do
> some more testing, and it seems that every single executor fails because it
> can't connect to the driver.  This is in the YARN logs:
>
>  14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: Connecting
> to driver: akka.tcp://sparkDriver@GATEWAY-1
> :60855/user/CoarseGrainedScheduler
> 14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver
> Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] ->
> [akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down.
>
>  And this is what shows up from the driver:
>
>  14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered
> executor: 
> Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113]
> with ID 2
> 14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to
> /rack/node8da83a04def73517bf437e95aeefa2469b1daf14
> 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2
> disconnected, so removing it
>
> It doesn't appear to be a networking issue.  Networking works both
> directions and there's no firewall blocking ports.  Googling the issue, it
> sounds like the most common problem is overallocation of memory, but I'm
> not doing that.  I've got these settings for a 3 * 128GB node cluster:
>
>  spark.executor.instances17
>  spark.executor.memory   12424m
> spark.yarn.executor.memoryOverhead  3549
>
>  That makes it 6 * 15973 = 95838 MB per node, which is well beneath the
> 128GB limit.
>
>  Frankly I'm stumped.  It worked fine when I spun up a cluster last week,
> but now it doesn't.  The logs give me no indication as to what the problem
> actually is.  Any pointers to where else I might look?
>
>  Thanks in advance.
>
>  Greg
>

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Andrew Or

Hi Tamas,

Yes, Marcelo is right. The reason why it doesn't make sense to set
"spark.driver.memory" in your SparkConf is because your application code,
by definition, *is* the driver. This means by the time you get to the code
that initializes your SparkConf, your driver JVM has already started with
some heap size, and you can't easily change the size of the JVM once it has
started. Note that this is true regardless of the deploy mode (client or
cluster).

Alternatives to set this include the following: (1) You can set
"spark.driver.memory" in your `spark-defaults.conf` on the node that
submits the application, (2) You can use the --driver-memory command line
option if you are using Spark submit (bin/pyspark goes through this path,
as you have discovered on your own).

Does that make sense?


2014-10-01 10:17 GMT-07:00 Tamas Jambor :

> when you say "respective backend code to launch it", I thought this is
> the way to do that.
>
> thanks,
> Tamas
>
> On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin 
> wrote:
> > Because that's not how you launch apps in cluster mode; you have to do
> > it through the command line, or by calling directly the respective
> > backend code to launch it.
> >
> > (That being said, it would be nice to have a programmatic way of
> > launching apps that handled all this - this has been brought up in a
> > few different contexts, but I don't think there's an "official"
> > solution yet.)
> >
> > On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor  wrote:
> >> thanks Marcelo.
> >>
> >> What's the reason it is not possible in cluster mode, either?
> >>
> >> On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin 
> wrote:
> >>> You can't set up the driver memory programatically in client mode. In
> >>> that mode, the same JVM is running the driver, so you can't modify
> >>> command line options anymore when initializing the SparkContext.
> >>>
> >>> (And you can't really start cluster mode apps that way, so the only
> >>> way to set this is through the command line / config files.)
> >>>
> >>> On Wed, Oct 1, 2014 at 9:26 AM, jamborta  wrote:
>  Hi all,
> 
>  I cannot figure out why this command is not setting the driver memory
> (it is
>  setting the executor memory):
> 
>  conf = (SparkConf()
>  .setMaster("yarn-client")
>  .setAppName("test")
>  .set("spark.driver.memory", "1G")
>  .set("spark.executor.memory", "1G")
>  .set("spark.executor.instances", 2)
>  .set("spark.executor.cores", 4))
>  sc = SparkContext(conf=conf)
> 
>  whereas if I run the spark console:
>  ./bin/pyspark --driver-memory 1G
> 
>  it sets it correctly. Seemingly they both generate the same commands
> in the
>  logs.
> 
>  thanks a lot,
> 
> 
> 
> 
> 
>  --
>  View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
>  Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> 
>  -
>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  For additional commands, e-mail: user-h...@spark.apache.org
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Marcelo
> >
> >
> >
> > --
> > Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SPARK UI - Details post job processiong

2014-09-25 Thread Andrew Or

Hi Harsha,

You can turn on `spark.eventLog.enabled` as documented here:
http://spark.apache.org/docs/latest/monitoring.html. Then, if you are
running standalone mode, you can access the finished SparkUI through the
Master UI. Otherwise, you can start a HistoryServer to display finished UIs.

-Andrew

2014-09-25 12:55 GMT-07:00 Harsha HN <99harsha.h@gmail.com>:

> Hi,
>
> Details laid out in Spark UI for the job in progress is really interesting
> and very useful.
> But this gets vanished once the job is done.
> Is there a way to get job details post processing?
>
> Looking for Spark UI data, not standard input,output and error info.
>
> Thanks,
> Harsha
>

Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Andrew Or

Yes... good find. I have filed a JIRA here:
https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it
shortly. Both of these fixes will be available in 1.1.1. Until both of
these are merged in, it appears that the only way you can do it now is
through --driver-memory.

-Andrew

2014-09-23 7:23 GMT-07:00 Greg Hill :

>  Thanks for looking into it.  I'm trying to avoid making the user pass in
> any parameters by configuring it to use the right values for the cluster
> size by default, hence my reliance on the configuration.  I'd rather just
> use spark-defaults.conf than the environment variables, and looking at the
> code you modified, I don't see any place it's picking up
> spark.driver.memory either.  Is that a separate bug?
>
>  Greg
>
>
>   From: Andrew Or 
> Date: Monday, September 22, 2014 8:11 PM
> To: Nishkam Ravi 
> Cc: Greg , "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: clarification for some spark on yarn configuration options
>
>   Hi Greg,
>
>  From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
> actually picked up in cluster mode. This is a bug and I have opened a PR to
> fix it: https://github.com/apache/spark/pull/2500.
> For now, please use --driver-memory instead, which should work for both
> client and cluster mode.
>
>  Thanks for pointing this out,
> -Andrew
>
> 2014-09-22 14:04 GMT-07:00 Nishkam Ravi :
>
>> Maybe try --driver-memory if you are using spark-submit?
>>
>>  Thanks,
>> Nishkam
>>
>> On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill 
>> wrote:
>>
>>>  Ah, I see.  It turns out that my problem is that that comparison is
>>> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
>>> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
>>> master.  'yarn-client' seems to pick up the values and works fine.
>>>
>>>  Greg
>>>
>>>   From: Nishkam Ravi 
>>> Date: Monday, September 22, 2014 3:30 PM
>>> To: Greg 
>>> Cc: Andrew Or , "user@spark.apache.org" <
>>> user@spark.apache.org>
>>>
>>> Subject: Re: clarification for some spark on yarn configuration options
>>>
>>>   Greg, if you look carefully, the code is enforcing that the
>>> memoryOverhead be lower (and not higher) than spark.driver.memory.
>>>
>>>  Thanks,
>>> Nishkam
>>>
>>> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill 
>>> wrote:
>>>
>>>>  I thought I had this all figured out, but I'm getting some weird
>>>> errors now that I'm attempting to deploy this on production-size servers.
>>>> It's complaining that I'm not allocating enough memory to the
>>>> memoryOverhead values.  I tracked it down to this code:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>>>
>>>>  Unless I'm reading it wrong, those checks are enforcing that you set
>>>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>>>> that makes no sense to me since that memory is just supposed to be what
>>>> YARN needs on top of what you're allocating for Spark.  My understanding
>>>> was that the overhead values should be quite a bit lower (and by default
>>>> they are).
>>>>
>>>>  Also, why must the executor be allocated less memory than the
>>>> driver's memory overhead value?
>>>>
>>>>  What am I misunderstanding here?
>>>>
>>>>  Greg
>>>>
>>>>   From: Andrew Or 
>>>> Date: Tuesday, September 9, 2014 5:49 PM
>>>> To: Greg 
>>>> Cc: "user@spark.apache.org" 
>>>> Subject: Re: clarification for some spark on yarn configuration options
>>>>
>>>>   Hi Greg,
>>>>
>>>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
>>>> cluster. The equivalent "spark.executor.instances" is just another way to
>>>> set the same thing in your spark-defaults.conf. Maybe this should be
>>>> documented. :)
>>>>
>>>>  "spark.yarn.executor.memoryOverhead" is just an additional margin
>>>> added to "spark.executor.memory" for the container. In addition to the
>>>> executor's memory, the

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Andrew Or

Hi Greg,

>From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
actually picked up in cluster mode. This is a bug and I have opened a PR to
fix it: https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both
client and cluster mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi :

> Maybe try --driver-memory if you are using spark-submit?
>
> Thanks,
> Nishkam
>
> On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill 
> wrote:
>
>>  Ah, I see.  It turns out that my problem is that that comparison is
>> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
>> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
>> master.  'yarn-client' seems to pick up the values and works fine.
>>
>>  Greg
>>
>>   From: Nishkam Ravi 
>> Date: Monday, September 22, 2014 3:30 PM
>> To: Greg 
>> Cc: Andrew Or , "user@spark.apache.org" <
>> user@spark.apache.org>
>>
>> Subject: Re: clarification for some spark on yarn configuration options
>>
>>   Greg, if you look carefully, the code is enforcing that the
>> memoryOverhead be lower (and not higher) than spark.driver.memory.
>>
>>  Thanks,
>> Nishkam
>>
>> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill 
>> wrote:
>>
>>>  I thought I had this all figured out, but I'm getting some weird
>>> errors now that I'm attempting to deploy this on production-size servers.
>>> It's complaining that I'm not allocating enough memory to the
>>> memoryOverhead values.  I tracked it down to this code:
>>>
>>>
>>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>>
>>>  Unless I'm reading it wrong, those checks are enforcing that you set
>>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>>> that makes no sense to me since that memory is just supposed to be what
>>> YARN needs on top of what you're allocating for Spark.  My understanding
>>> was that the overhead values should be quite a bit lower (and by default
>>> they are).
>>>
>>>  Also, why must the executor be allocated less memory than the driver's
>>> memory overhead value?
>>>
>>>  What am I misunderstanding here?
>>>
>>>  Greg
>>>
>>>   From: Andrew Or 
>>> Date: Tuesday, September 9, 2014 5:49 PM
>>> To: Greg 
>>> Cc: "user@spark.apache.org" 
>>> Subject: Re: clarification for some spark on yarn configuration options
>>>
>>>   Hi Greg,
>>>
>>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
>>> cluster. The equivalent "spark.executor.instances" is just another way to
>>> set the same thing in your spark-defaults.conf. Maybe this should be
>>> documented. :)
>>>
>>>  "spark.yarn.executor.memoryOverhead" is just an additional margin
>>> added to "spark.executor.memory" for the container. In addition to the
>>> executor's memory, the container in which the executor is launched needs
>>> some extra memory for system processes, and this is what this "overhead"
>>> (somewhat of a misnomer) is for. The same goes for the driver equivalent.
>>>
>>>  "spark.driver.memory" behaves differently depending on which version
>>> of Spark you are using. If you are using Spark 1.1+ (this was released very
>>> recently), you can directly set "spark.driver.memory" and this will take
>>> effect. Otherwise, setting this doesn't actually do anything for client
>>> deploy mode, and you have two alternatives: (1) set the environment
>>> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
>>> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
>>> bin/spark-submit), pass the "--driver-memory" command line argument.
>>>
>>>  If you want your PySpark application (driver) to pick up extra class
>>> path, you can pass the "--driver-class-path" to Spark submit. If you are
>>> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
>>> spark-defaults.conf. There is also an environment variable you could set
>>> (SPARK_CLASSPATH), though this is now deprecated.
>>>
>>>

1 2 3 >

1 - 100 of 240 matches

Mail list logo