date:20140601

There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many instances of this left (sorting comes to mind, and this is being
worked on).

In general, one problem with Spark today is that you *can* OOM under
certain configurations, and it's possible you'll need to change from the
default configuration if you're using doing very memory-intensive jobs.
However, there are very few cases where Spark would simply fail as a matter
of course *-- *for instance, you can always increase the number of
partitions to decrease the size of any given one. or repartition data to
eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an
impact depending on your jobs. If you're doing a join on a very large
amount of data with few partitions, then we'll have to spill to disk. If
you can't cache your working set of data in memory, you will also see a
performance degradation. Spark enables the use of memory to make things
fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 Clearly thr will be impact on performance but frankly depends on what you
 are trying to achieve with the dataset.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows
 larger than the size of RAM available in the cluster, will the application
 fail, or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney

Thanks for the fast reply.

I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone
mode.

On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote:

 First issue was because your cluster was configured incorrectly. You could
 probably read 1 file because that was done on the driver node, but when it
 tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

 Now I get this:

 scala rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 console:41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first
 at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
 partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 console:41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 console:41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first
 at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks
 from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
 TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as
 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
 TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
 TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as
 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
 TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
 TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
 TID 97 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
 TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as
 TID 99 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as
 TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 as
 TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 as
 TID 102 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as
 TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 as
 1294 bytes in 0 ms

 14/05/31

Re: Spark on EC2

Hmm.. you've gotten further than me. Which AMI's are you using?


On Sun, Jun 1, 2014 at 2:21 PM, superback andrew.matrix.c...@gmail.com
wrote:

 Hi,
 I am trying to run an example on AMAZON EC2 and have successfully
 set up one cluster with two nodes on EC2. However, when I was testing an
 example using the following command,

 *
 ./run-example org.apache.spark.examples.GroupByTest
 spark://`hostname`:7077*

 I got the following warnings and errors. Can anyone help one solve this
 problem? Thanks very much!

 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial
 job has not accepted any resources; check your cluster UI to ensure that
 workers are registered and have sufficient memory
 61544 [spark-akka.actor.default-dispatcher-3] ERROR
 org.apache.spark.deploy.client.AppClient$ClientActor - All masters are
 unresponsive! Giving up.
 61544 [spark-akka.actor.default-dispatcher-3] ERROR
 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark
 cluster looks dead, giving up.
 61546 [spark-akka.actor.default-dispatcher-3] INFO
 org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool
 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run
 count at GroupByTest.scala:50
 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Spark cluster looks down
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)







 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers

Re: Yay for 1.0.0! EC2 Still has problems.

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Have I missed something? What AMI's are people using? I've just gone back
through the archives, and I'm seeing a lot of I can't get EC2 to work and
not a single My EC2 has post-install issues,

The quickstart page says ...can have a spark cluster up and running in
five minutes. But it's been three days for me so far. I'm about to bite
the bullet and start building my own AMI's from scratch... if anyone can
save me from that, I'd be most grateful.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Re: Yay for 1.0.0! EC2 Still has problems.

If you are explicitly specifying the AMI in your invocation of spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on the
specified instance type.

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com
javascript:_e(%7B%7D,'cvml','unorthodox.engine...@gmail.com');님이 작성한
메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

SparkSQL Table schema in Java

2014-06-01 Thread Kuldeep Bora

Hello,

Congrats for 1.0.0 release.

I would like to ask why is it that the table creation requires an proper
class in Scala and Java while in python you can just use a map?
I think that the use of class for definition of table is bit too
restrictive. Using a plain map otoh could be very handy in creating tables
dynamically.

Are there any alternative apis for spark sql which can work with plain java
maps like in python?

Regards

Re: Yay for 1.0.0! EC2 Still has problems.

*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar issues...
)

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default
to the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre B

Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

For those who don't know the sbt-pack plugin, it basically copies all the
dependencies JARs from your local ivy/maven cache to a your target folder
(in target/pack/lib), and creates launch scripts (in target/pack/bin) for
your application (notably setting all these jars on the classpath).

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with sbt run is fine but running our app with the
launch scripts generated by sbt-pack fails.

After a (quite painful) investigation, it turns out some JARs are NOT copied
from the local ivy2 cache to the lib folder. I noticed that all the missing
jars contain shaded in their file name (but all not all jars with such
name are missing).
One of the missing JARs is explicitly from the Spark definition
(SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?

Any help would be appreciated!

Cheers

Pierre

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Chanwit Kaewkasi

Hi all,

This is what I found:

1. Like Aaron suggested, an executor will be killed silently when the
OS's memory is running out.
I've found this many times to conclude this it's real. Adding swap and
increasing the JVM heap solved the problem, but you will encounter OS
paging out and full GC.

2. OS paging out and full GC are not likely to affect my benchmark
much while processing data from HDFS. But Akka process's randomly
killed during the network-related stage (for example, sorting). I've
found that an Akka process cannot fetch the result fast enough.
Increasing the block manager timeout helped a lot. I've doubled the
value many times as the network of our ARM cluster is quite slow.

3. We'd like to collect times spent for all stages of our benchmark.
So we always re-run when some tasks failed. Failure happened a lot but
it's understandable as Spark is designed on top of Akka's let-it-crash
philosophy. To make the benchmark run more perfectly (without a task
failure), I called .cache() before calling the transformation of the
next stage. And it helped a lot.

Combined above and others tuning, we can now boost the performance of
our ARM cluster to 2.8 times faster than our first report.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi chan...@gmail.com wrote:
 May be that's explaining mine too.
 Thank you very much, Aaron !!

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit


 On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote:
 Spark should effectively turn Akka's failure detector off, because we
 historically had problems with GCs and other issues causing disassociations.
 The only thing that should cause these messages nowadays is if the TCP
 connection (which Akka sustains between Actor Systems on different machines)
 actually drops. TCP connections are pretty resilient, so one common cause of
 this is actual Executor failure -- recently, I have experienced a
 similar-sounding problem due to my machine's OOM killer terminating my
 Executors, such that they didn't produce any error output.


 On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 On an ARM cluster, I have been testing a wordcount program with JRE 7
 and everything is OK. But when changing to the embedded version of
 Java SE (Oracle's eJRE), the same program cannot complete all
 computing stages.

 It is failed by many Akka's disassociation.

 - I've been trying to increase Akka's timeout but still stuck. I am
 not sure what is the right way to do so? (I suspected that GC pausing
 the world is causing this).

 - Another question is that how could I properly turn on Akka's logging
 to see what's the root cause of this disassociation problem? (If my
 guess about GC is wrong).

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit

Re: Akka disassociation on Java SE Embedded

Thanks for the update! I've also run into the block manager timeout issue,
it might be a good idea to increase the default significantly (it would
probably timeout immediately if the TCP connection itself dropped anyway).


On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 This is what I found:

 1. Like Aaron suggested, an executor will be killed silently when the
 OS's memory is running out.
 I've found this many times to conclude this it's real. Adding swap and
 increasing the JVM heap solved the problem, but you will encounter OS
 paging out and full GC.

 2. OS paging out and full GC are not likely to affect my benchmark
 much while processing data from HDFS. But Akka process's randomly
 killed during the network-related stage (for example, sorting). I've
 found that an Akka process cannot fetch the result fast enough.
 Increasing the block manager timeout helped a lot. I've doubled the
 value many times as the network of our ARM cluster is quite slow.

 3. We'd like to collect times spent for all stages of our benchmark.
 So we always re-run when some tasks failed. Failure happened a lot but
 it's understandable as Spark is designed on top of Akka's let-it-crash
 philosophy. To make the benchmark run more perfectly (without a task
 failure), I called .cache() before calling the transformation of the
 next stage. And it helped a lot.

 Combined above and others tuning, we can now boost the performance of
 our ARM cluster to 2.8 times faster than our first report.

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit


 On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi chan...@gmail.com
 wrote:
  May be that's explaining mine too.
  Thank you very much, Aaron !!
 
  Best regards,
 
  -chanwit
 
  --
  Chanwit Kaewkasi
  linkedin.com/in/chanwit
 
 
  On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com
 wrote:
  Spark should effectively turn Akka's failure detector off, because we
  historically had problems with GCs and other issues causing
 disassociations.
  The only thing that should cause these messages nowadays is if the TCP
  connection (which Akka sustains between Actor Systems on different
 machines)
  actually drops. TCP connections are pretty resilient, so one common
 cause of
  this is actual Executor failure -- recently, I have experienced a
  similar-sounding problem due to my machine's OOM killer terminating my
  Executors, such that they didn't produce any error output.
 
 
  On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com
 wrote:
 
  Hi all,
 
  On an ARM cluster, I have been testing a wordcount program with JRE 7
  and everything is OK. But when changing to the embedded version of
  Java SE (Oracle's eJRE), the same program cannot complete all
  computing stages.
 
  It is failed by many Akka's disassociation.
 
  - I've been trying to increase Akka's timeout but still stuck. I am
  not sure what is the right way to do so? (I suspected that GC pausing
  the world is causing this).
 
  - Another question is that how could I properly turn on Akka's logging
  to see what's the root cause of this disassociation problem? (If my
  guess about GC is wrong).
 
  Best regards,
 
  -chanwit
 
  --
  Chanwit Kaewkasi
  linkedin.com/in/chanwit

Re: hadoopRDD stalls reading entire directory

Gotcha. The easiest way to get your dependencies to your Executors would
probably be to construct your SparkContext with all necessary jars passed
in (as the jars parameter), or inside a SparkConf with setJars(). Avro is
a necessary jar, but it's possible your application also needs to
distribute other ones to the cluster.

An easy way to make sure all your dependencies get shipped to the cluster
is to create an assembly jar of your application, and then you just need to
tell Spark about that jar, which includes all your application's transitive
dependencies. Maven and sbt both have pretty straightforward ways of
producing assembly jars.


On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com
wrote:

 Thanks for the fast reply.

 I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
 standalone mode.


 On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote:

 First issue was because your cluster was configured incorrectly. You
 could probably read 1 file because that was done on the driver node, but
 when it tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com
  wrote:

 Now I get this:

 scala rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 console:41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
 partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 console:41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 console:41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
 tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
 TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as
 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
 TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
 TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as
 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
 TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
 TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
 TID 97 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
 TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as
 TID 99 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as
 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as
 TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager:

Re: sc.textFileGroupByPath(/.txt)

2014-06-01 Thread Anwar Rizal

I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs= (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =
sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  = (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Could you provide an example of what you mean?

 I know it's possible to create an RDD from a path with wildcards, like in
 the subject.

 For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
 provide a comma delimited list of paths.

 Nick

 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:

 Hi All,

 Is it possible to create an RDD from a directory tree of the following
 form?

 RDD[(PATH, Seq[TEXT])]

 Thank you,
 Oleg

Re: Yay for 1.0.0! EC2 Still has problems.

Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.

Spark-ec2 itself has a flag -a that allows you to give a specific
AMI. This flag is just an internal tool that we use for testing when
we spin new AMI's. Users can't set that to an arbitrary AMI because we
tightly control things like the Java and OS versions, libraries, etc.

On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar issues... )

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default to
the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Re: Using sbt-pack with Spark 1.0.0

One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.

On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with sbt run is fine but running our app with the
launch scripts generated by sbt-pack fails.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?

Any help would be appreciated!

Cheers

Pierre

Re: Using sbt-pack with Spark 1.0.0

https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350

On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote:
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.

On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with sbt run is fine but running our app with the
launch scripts generated by sbt-pack fails.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?

Any help would be appreciated!

Cheers

Pierre

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre Borckmans

You're right Patrick!

Just had a chat with sbt-pack creator and indeed dependencies with classifiers
are ignored to avoid problems with dirty cache...

Should be fixed in next version of the plugin.

Cheers

Pierre

Message sent from a mobile device - excuse typos and abbreviations

Le 1 juin 2014 à 20:04, Patrick Wendell pwend...@gmail.com a écrit :

https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350

On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with sbt run is fine but running our app with the
launch scripts generated by sbt-pack fails.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?

Any help would be appreciated!

Cheers

Pierre

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen

Note that everything works fine in spark 0.9, which is packaged in CDH5: I
can launch a spark-shell and interact with workers spawned on my yarn
cluster.

So in my /opt/hadoop/conf/yarn-site.xml, I have:
...
property
nameyarn.resourcemanager.address.rm1/name
valuecontroller-1.mycomp.com:23140/value
/property
...
property
nameyarn.resourcemanager.address.rm2/name
valuecontroller-2.mycomp.com:23140/value
/property
...

And the other usual stuff.

So spark 1.0 is launched like this:
Spark Command: java -cp
::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
-XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
--class org.apache.spark.repl.Main

I do see /opt/hadoop/conf included, but not sure it's the right place.

Thanks..
-Simon



On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 I would agree with your guess, it looks like the yarn library isn't
 correctly finding your yarn-site.xml file. If you look in
 yarn-site.xml do you definitely the resource manager
 address/addresses?

 Also, you can try running this command with
 SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
 set-up correctly.

 - Patrick

 On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Hi all,
 
  I tried a couple ways, but couldn't get it to work..
 
  The following seems to be what the online document
  (http://spark.apache.org/docs/latest/running-on-yarn.html) is
 suggesting:
 
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
  YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
 
  Help info of spark-shell seems to be suggesting --master yarn
 --deploy-mode
  cluster.
 
  But either way, I am seeing the following messages:
  14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
  /0.0.0.0:8032
  14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
 
  My guess is that spark-shell is trying to talk to resource manager to
 setup
  spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
  though. I am running CDH5 with two resource managers in HA mode. Their
  IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
  HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
 
  Any ideas? Thanks.
  -Simon

Re: Yay for 1.0.0! EC2 Still has problems.

Ah yes, looking back at the first email in the thread, indeed that was the
case. For the record, I too launch clusters from my laptop, where I have
Python 2.7 installed.

On Sun, Jun 1, 2014 at 2:01 PM, Patrick Wendell pwend...@gmail.com wrote:

On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar
issues... )

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to
default to
the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x
the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

If you are explicitly specifying the AMI in your invocation of
spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on
the
specified instance type.

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한
메시지:

Could you post how exactly you are invoking spark-ec2? And are you
having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use
the
default AMI - but I've now run through all of the official
quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically
viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster
setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster
setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia

More specifically with the -a flag, you *can* set your own AMI, but you’ll need
to base it off ours. This is because spark-ec2 assumes that some packages (e.g.
java, Python 2.6) are already available on the AMI.

Matei

On Jun 1, 2014, at 11:01 AM, Patrick Wendell pwend...@gmail.com wrote:

On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar issues... )

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default to
the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Re: sc.textFileGroupByPath(/.txt)

2014-06-01 Thread Oleg Proudnikov

Anwar,

Will try this as it might do exactly what I need. I will follow your
pattern but use sc.textFile() for each file.

I am now thinking that I could start with an RDD of file paths and map it
into (path, content) pairs, provided I could read a file on the server.

Thank you,
Oleg



On 1 June 2014 18:41, Anwar Rizal anriza...@gmail.com wrote:

 I presume that you need to have access to the path of each file you are
 reading.

 I don't know whether there is a good way to do that for HDFS, I need to
 read the files myself, something like:

 def openWithPath(inputPath: String, sc:SparkContext) =  {
   val fs= (new
 Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
   val filesIt   = fs.listFiles(path, false)
   val paths   = new ListBuffer[URI]
   while (filesIt.hasNext) {
 paths += filesIt.next.getPath.toUri
   }
   val withPaths = paths.toList.map{  p =
 sc.newAPIHadoopFile[LongWritable, Text,
 TextInputFormat](p.toString).map{ case (_,s)  = (p, s.toString) }
   }
   withPaths.reduce{ _ ++ _ }
 }
 ...

 I would be interested if there is a better way to do the same thing ...

 Cheers,
 a:


 On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Could you provide an example of what you mean?

 I know it's possible to create an RDD from a path with wildcards, like in
 the subject.

 For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
 provide a comma delimited list of paths.

 Nick

 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지:

 Hi All,

 Is it possible to create an RDD from a directory tree of the following
 form?

 RDD[(PATH, Seq[TEXT])]

 Thank you,
 Oleg





-- 
Kind regards,

Oleg

[Spark Streaming] Distribute custom receivers evenly across excecutors

2014-06-01 Thread Guang Gao

Dear All,

I'm running Spark Streaming (1.0.0) with Yarn (2.2.0) on a 10-node cluster.
I setup 10 custom receivers to hear from 10 data streams. I want one
receiver per node in order to maximize the network bandwidth. However, if I
set --executor-cores 4, the 10 receivers only run on 3 of the nodes in
the cluster, each running 4, 4, 2 receivers; if I set --executor-cores 1,
each node will run exactly one receiver, and it seems that Spark can't make
any progress to process theses streams.

I read the documentation on configuration and also googled but didn't find
a clue. Is there a way to configure how the receivers are distributed?

Thanks!

Here are some details:

How I created 10 receivers:

val conf = new SparkConf().setAppName(jobId)
val sc = new StreamingContext(conf, Seconds(1))
var lines:DStream[String] =
  sc.receiverStream(
  new CustomReceiver(...)
  )
for(i - 1 to 9) {
lines = lines.union(
sc.receiverStream(
  new CustomReceiver(...)
   )
}

How I submit a job to Yarn:

spark-submit \
--class $JOB_CLASS \
--master yarn-client \
--num-executors 10 \
--driver-memory 1g \
--executor-memory 2g \
--executor-cores 4 \
$JAR_NAME

Re: hadoopRDD stalls reading entire directory

You can avoid that by using the constructor that takes a SparkConf, a la

val conf = new SparkConf()
conf.setJars(avro.jar, ...)
val sc = new SparkContext(conf)


On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com
wrote:

 Followup question: the docs to make a new SparkContext require that I know
 where $SPARK_HOME is. However, I have no idea. Any idea where that might be?


 On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Gotcha. The easiest way to get your dependencies to your Executors would
 probably be to construct your SparkContext with all necessary jars passed
 in (as the jars parameter), or inside a SparkConf with setJars(). Avro is
 a necessary jar, but it's possible your application also needs to
 distribute other ones to the cluster.

 An easy way to make sure all your dependencies get shipped to the cluster
 is to create an assembly jar of your application, and then you just need to
 tell Spark about that jar, which includes all your application's transitive
 dependencies. Maven and sbt both have pretty straightforward ways of
 producing assembly jars.


 On Sat, May 31, 2014 at 11:23 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Thanks for the fast reply.

 I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
 standalone mode.


 On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote:

 First issue was because your cluster was configured incorrectly. You
 could probably read 1 file because that was done on the driver node, but
 when it tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Now I get this:

 scala rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 console:41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
 partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 console:41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 console:41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
 tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
 TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
 TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
 TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
 TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
 TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
 TID 97 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
 TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31

Re: Trouble with EC2

2014-06-01 Thread PJ$

Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
gotten any further. No clue what's wrong. I'd really appreciate any
guidance y'all could offer.

Best,
PJ$


On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 What instance types did you launch on?

 Sometimes you also get a bad individual machine from EC2. It might help to
 remove the node it’s complaining about from the conf/slaves file.

 Matei

 On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:

 Hey Folks,

 I'm really having quite a bit of trouble getting spark running on ec2. I'm
 not using scripts the https://github.com/apache/spark/tree/master/ec2
 because I'd like to know how everything works. But I'm going a little
 crazy. I think that something about the networking configuration must be
 messed up, but I'm at a loss. Shortly after starting the cluster, I get a
 lot of this:

 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO actor.LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
 was not delivered. [5] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485

Re: Trouble with EC2

2014-06-01 Thread Matei Zaharia

So to run spark-ec2, you should use the default AMI that it launches with if 
you don’t pass -a. Those are based on Amazon Linux, not Debian. Passing your 
own AMI is an advanced option but people need to install some stuff on their 
AMI in advance for it to work with our scripts.

Matei


On Jun 1, 2014, at 3:11 PM, PJ$ p...@chickenandwaffl.es wrote:

 Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't 
 gotten any further. No clue what's wrong. I'd really appreciate any guidance 
 y'all could offer. 
 
 Best, 
 PJ$
 
 
 On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 What instance types did you launch on?
 
 Sometimes you also get a bad individual machine from EC2. It might help to 
 remove the node it’s complaining about from the conf/slaves file.
 
 Matei
 
 On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:
 
 Hey Folks, 
 
 I'm really having quite a bit of trouble getting spark running on ec2. I'm 
 not using scripts the https://github.com/apache/spark/tree/master/ec2 
 because I'd like to know how everything works. But I'm going a little crazy. 
 I think that something about the networking configuration must be messed up, 
 but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 
 
 14/05/30 18:03:22 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:22 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO actor.LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
  was not delivered. [5] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485

Re: Trouble with EC2

Ha yes,,, I just went through this.

(a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment
) and not any of the other linux distros. They don't work.
(b) Start with m1.large instances.. I tried going for r3.large at first,
and had no end of self-caused trouble. m1.large works.
(c) It's possible for the script to choose the wrong AMI, especially if one
has been messing with it to allow other instance types. (ahem)

But it will work in the end.. just start simple. (yeah, I know m1.large
doesn't look that large anymore. :-)


On Mon, Jun 2, 2014 at 8:11 AM, PJ$ p...@chickenandwaffl.es wrote:

 Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
 gotten any further. No clue what's wrong. I'd really appreciate any
 guidance y'all could offer.

 Best,
 PJ$


 On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 What instance types did you launch on?

 Sometimes you also get a bad individual machine from EC2. It might help
 to remove the node it’s complaining about from the conf/slaves file.

 Matei

 On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:

 Hey Folks,

 I'm really having quite a bit of trouble getting spark running on ec2.
 I'm not using scripts the https://github.com/apache/spark/tree/master/ec2
 because I'd like to know how everything works. But I'm going a little
 crazy. I think that something about the networking configuration must be
 messed up, but I'm at a loss. Shortly after starting the cluster, I get a
 lot of this:

 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:22 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO actor.LocalActorRef: Message
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
 Actor[akka://sparkMaster/deadLetters] to
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
 was not delivered. [5] dead letters encountered. This logging can be turned
 off or adjusted with configuration settings 'akka.log-dead-letters' and
 'akka.log-dead-letters-during-shutdown'.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 INFO master.Master:
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
 - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
 [Association failed with
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with [
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485






-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers

Re: spark 1.0.0 on yarn

As a debugging step, does it work if you use a single resource manager
with the key yarn.resourcemanager.address instead of using two named
resource managers? I wonder if somehow the YARN client can't detect
this multi-master set-up.

On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote:
 Note that everything works fine in spark 0.9, which is packaged in CDH5: I
 can launch a spark-shell and interact with workers spawned on my yarn
 cluster.

 So in my /opt/hadoop/conf/yarn-site.xml, I have:
 ...
 property
 nameyarn.resourcemanager.address.rm1/name
 valuecontroller-1.mycomp.com:23140/value
 /property
 ...
 property
 nameyarn.resourcemanager.address.rm2/name
 valuecontroller-2.mycomp.com:23140/value
 /property
 ...

 And the other usual stuff.

 So spark 1.0 is launched like this:
 Spark Command: java -cp
 ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
 -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
 org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class
 org.apache.spark.repl.Main

 I do see /opt/hadoop/conf included, but not sure it's the right place.

 Thanks..
 -Simon



 On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 I would agree with your guess, it looks like the yarn library isn't
 correctly finding your yarn-site.xml file. If you look in
 yarn-site.xml do you definitely the resource manager
 address/addresses?

 Also, you can try running this command with
 SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
 set-up correctly.

 - Patrick

 On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Hi all,
 
  I tried a couple ways, but couldn't get it to work..
 
  The following seems to be what the online document
  (http://spark.apache.org/docs/latest/running-on-yarn.html) is
  suggesting:
 
  SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
  YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
 
  Help info of spark-shell seems to be suggesting --master yarn
  --deploy-mode
  cluster.
 
  But either way, I am seeing the following messages:
  14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
  /0.0.0.0:8032
  14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
 
  My guess is that spark-shell is trying to talk to resource manager to
  setup
  spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
  though. I am running CDH5 with two resource managers in HA mode. Their
  IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
  HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
 
  Any ideas? Thanks.
  -Simon

Re: Yay for 1.0.0! EC2 Still has problems.

Sort of.. there were two separate issues, but both related to AWS..

I've sorted the confusion about the Master/Worker AMI ... use the version
chosen by the scripts. (and use the right instance type so the script can
choose wisely)

But yes, one also needs a launch machine to kick off the cluster, and for
that I _also_ was using an Amazon instance... (made sense.. I have a team
that will needs to do things as well, not just me) and I was just pointing
out that if you use the most recommended by Amazon AMI (for your free
micro instance, for example) you get python 2.6 and the ec2 scripts fail.

That merely needs a line in the documentation saying use Ubuntu for your
cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie,
it was hard working out when to use default or custom AMIs for various
parts of the setup.

On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote:

On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar
issues... )

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to
default to
the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x
the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

If you are explicitly specifying the AMI in your invocation of
spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on
the
specified instance type.

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한
메시지:

Could you post how exactly you are invoking spark-ec2? And are you
having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use
the
default AMI - but I've now run through all of the official
quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically
viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster
setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster
setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen

That helped a bit... Now I have a different failure: the start up process
is stuck in an infinite loop outputting the following message:

14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
 appMasterRpcPort: -1
 appStartTime: 1401672868277
 yarnAppState: ACCEPTED

I am using the hadoop 2 prebuild package. Probably it doesn't have the
latest yarn client.

-Simon




On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote:

 As a debugging step, does it work if you use a single resource manager
 with the key yarn.resourcemanager.address instead of using two named
 resource managers? I wonder if somehow the YARN client can't detect
 this multi-master set-up.

 On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Note that everything works fine in spark 0.9, which is packaged in CDH5:
 I
  can launch a spark-shell and interact with workers spawned on my yarn
  cluster.
 
  So in my /opt/hadoop/conf/yarn-site.xml, I have:
  ...
  property
  nameyarn.resourcemanager.address.rm1/name
  valuecontroller-1.mycomp.com:23140/value
  /property
  ...
  property
  nameyarn.resourcemanager.address.rm2/name
  valuecontroller-2.mycomp.com:23140/value
  /property
  ...
 
  And the other usual stuff.
 
  So spark 1.0 is launched like this:
  Spark Command: java -cp
 
 ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
  -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
  org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
 --class
  org.apache.spark.repl.Main
 
  I do see /opt/hadoop/conf included, but not sure it's the right place.
 
  Thanks..
  -Simon
 
 
 
  On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  I would agree with your guess, it looks like the yarn library isn't
  correctly finding your yarn-site.xml file. If you look in
  yarn-site.xml do you definitely the resource manager
  address/addresses?
 
  Also, you can try running this command with
  SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
  set-up correctly.
 
  - Patrick
 
  On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
  wrote:
   Hi all,
  
   I tried a couple ways, but couldn't get it to work..
  
   The following seems to be what the online document
   (http://spark.apache.org/docs/latest/running-on-yarn.html) is
   suggesting:
  
  
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
   YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
  
   Help info of spark-shell seems to be suggesting --master yarn
   --deploy-mode
   cluster.
  
   But either way, I am seeing the following messages:
   14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager
 at
   /0.0.0.0:8032
   14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
   14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
  
   My guess is that spark-shell is trying to talk to resource manager to
   setup
   spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
 from
   though. I am running CDH5 with two resource managers in HA mode. Their
   IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
   HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
  
   Any ideas? Thanks.
   -Simon

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney

Thanks again. Run results here:
https://gist.github.com/rjurney/dc0efae486ba7d55b7d5

This time I get a port already in use exception on 4040, but it isn't
fatal. Then when I run rdd.first, I get this over and over:

14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient memory



On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike...@gmail.com wrote:

 You can avoid that by using the constructor that takes a SparkConf, a la

 val conf = new SparkConf()
 conf.setJars(avro.jar, ...)
 val sc = new SparkContext(conf)


 On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

 Followup question: the docs to make a new SparkContext require that I
 know where $SPARK_HOME is. However, I have no idea. Any idea where that
 might be?


 On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Gotcha. The easiest way to get your dependencies to your Executors would
 probably be to construct your SparkContext with all necessary jars passed
 in (as the jars parameter), or inside a SparkConf with setJars(). Avro is
 a necessary jar, but it's possible your application also needs to
 distribute other ones to the cluster.

 An easy way to make sure all your dependencies get shipped to the
 cluster is to create an assembly jar of your application, and then you just
 need to tell Spark about that jar, which includes all your application's
 transitive dependencies. Maven and sbt both have pretty straightforward
 ways of producing assembly jars.


 On Sat, May 31, 2014 at 11:23 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Thanks for the fast reply.

 I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
 standalone mode.


 On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote:

 First issue was because your cluster was configured incorrectly. You
 could probably read 1 file because that was done on the driver node, but
 when it tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? 
 What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Now I get this:

 scala rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 console:41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
 partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 console:41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 console:41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
 tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
 5.0 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2
 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4
 as TID 96 on executor 3:

Re: Spark on EC2

2014-06-01 Thread superback

I haven't set up AMI yet. I am just trying to run a simple job on the EC2
cluster. So, is setting up AMI a prerequisite for running simple Spark
example like org.apache.spark.examples.GroupByTest? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638p6681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia

FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this.

Matei

On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote:

Sort of.. there were two separate issues, but both related to AWS..

I've sorted the confusion about the Master/Worker AMI ... use the version
chosen by the scripts. (and use the right instance type so the script can
choose wisely)

But yes, one also needs a launch machine to kick off the cluster, and for
that I _also_ was using an Amazon instance... (made sense.. I have a team
that will needs to do things as well, not just me) and I was just pointing
out that if you use the most recommended by Amazon AMI (for your free micro
instance, for example) you get python 2.6 and the ec2 scripts fail.

That merely needs a line in the documentation saying use Ubuntu for your
cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie, it
was hard working out when to use default or custom AMIs for various parts
of the setup.

On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote:
Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.

On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

m1.large works, (I swear I tested that earlier and had similar issues... )

It was my obsession with starting r3.*large instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default to
the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!

On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지:

Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:

It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

--
Jeremy Lee BCompSci(Hons)
The Unorthodox Engineers

Please put me into the mail list, thanks.

2014-06-01 Thread Yunmeng Ban

Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Yunmeng Ban

Hi,

I'm running the example of JavaKafkaWordCount in a standalone cluster. I
want to set 1600MB memory for each slave node. I wrote in the
spark/conf/spark-env.sh

SPARK_WORKER_MEMORY=1600m

But the logs on slave nodes looks this:
Spark Executor Command: /usr/java/latest/bin/java -cp
:/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
-Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend

The memory seems to be the default number, not 1600M.
I don't how to make SPARK_WORKER_MEMORY work.
Can anyone help me?
Many thanks in advance.

Yunmeng

Re: hadoopRDD stalls reading entire directory

Sounds like you have two shells running, and the first one is talking all
your resources. Do a jps and kill the other guy, then try again.

By the way, you can look at http://localhost:8080 (replace localhost with
the server your Spark Master is running on) to see what applications are
currently started, and what resource allocations they have.


On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney russell.jur...@gmail.com
wrote:

 Thanks again. Run results here:
 https://gist.github.com/rjurney/dc0efae486ba7d55b7d5

 This time I get a port already in use exception on 4040, but it isn't
 fatal. Then when I run rdd.first, I get this over and over:

 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not 
 accepted any resources; check your cluster UI to ensure that workers are 
 registered and have sufficient memory





 On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike...@gmail.com wrote:

 You can avoid that by using the constructor that takes a SparkConf, a la

 val conf = new SparkConf()
 conf.setJars(avro.jar, ...)
 val sc = new SparkContext(conf)


 On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

 Followup question: the docs to make a new SparkContext require that I
 know where $SPARK_HOME is. However, I have no idea. Any idea where that
 might be?


 On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Gotcha. The easiest way to get your dependencies to your Executors
 would probably be to construct your SparkContext with all necessary jars
 passed in (as the jars parameter), or inside a SparkConf with setJars().
 Avro is a necessary jar, but it's possible your application also needs to
 distribute other ones to the cluster.

 An easy way to make sure all your dependencies get shipped to the
 cluster is to create an assembly jar of your application, and then you just
 need to tell Spark about that jar, which includes all your application's
 transitive dependencies. Maven and sbt both have pretty straightforward
 ways of producing assembly jars.


 On Sat, May 31, 2014 at 11:23 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Thanks for the fast reply.

 I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
 standalone mode.


 On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote:

 First issue was because your cluster was configured incorrectly. You
 could probably read 1 file because that was done on the driver node, but
 when it tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? 
 What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Now I get this:

 scala rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 console:41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
 stage: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the
 requested partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 console:41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 console:41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 console:41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
 (first at console:41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
 stage: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
 tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
 5.0 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
 5.0:0 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
 5.0:3 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
 as TID 94 on executor 4: hivecluster4

Re: Can anyone help me set memory for standalone cluster?

In addition to setting the Standalone memory, you'll also need to tell your
SparkContext to claim the extra resources. Set spark.executor.memory to
1600m as well. This should be a system property set in SPARK_JAVA_OPTS in
conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g.,
export SPARK_JAVA_OPTS=-Dspark.executor.memory=1600mb


On Sun, Jun 1, 2014 at 7:36 PM, Yunmeng Ban banyunm...@gmail.com wrote:

 Hi,

 I'm running the example of JavaKafkaWordCount in a standalone cluster. I
 want to set 1600MB memory for each slave node. I wrote in the
 spark/conf/spark-env.sh

 SPARK_WORKER_MEMORY=1600m

 But the logs on slave nodes looks this:
 Spark Executor Command: /usr/java/latest/bin/java -cp
 :/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
 -Xms512M -Xmx512M
 org.apache.spark.executor.CoarseGrainedExecutorBackend

 The memory seems to be the default number, not 1600M.
 I don't how to make SPARK_WORKER_MEMORY work.
 Can anyone help me?
 Many thanks in advance.

 Yunmeng

Re: apache whirr for spark

2014-06-01 Thread chirag lakhani

Thanks for letting me know, I am leaning towards using Whirr to setup a
Yarn cluster with Hive, Pig, Hbase, etc... and then adding Spark on Yarn.
 Is it pretty straightforward to install Spark on a Yarn cluster?


On Fri, May 30, 2014 at 5:51 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I don’t think Whirr provides support for this, but Spark’s own EC2 scripts
 also launch a Hadoop cluster:
 http://spark.apache.org/docs/latest/ec2-scripts.html.

 Matei

 On May 30, 2014, at 12:59 PM, chirag lakhani chirag.lakh...@gmail.com
 wrote:

  Does anyone know if it is possible to use Whirr to setup a Spark cluster
 on AWS.  I would like to be able to use Whirr to setup a cluster that has
 all of the standard Hadoop and Spark tools.  I want to automate this
 process because I anticipate I will have to create and destroy often enough
 that I would like to have it all automated.  Could anyone provide any
 pointers into how this could be done or whether it is documented somewhere?
 
  Chirag Lakhani

Re: Spark on EC2