Re: Using Spark on Data size larger than Memory size
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on). In general, one problem with Spark today is that you *can* OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course *-- *for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew. Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast. On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com wrote: Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? Any thoughts in this direction will be helpful and are welcome. Thanks, -Vibhor -- Vibhor Banga Software Development Engineer Flipkart Internet Pvt. Ltd., Bangalore
Re: hadoopRDD stalls reading entire directory
Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote: First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed. Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)? On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at console:41, took 0.037371256 s 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at console:41) with 16 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37) 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as TID 97 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as TID 99 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 as TID 102 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 as 1294 bytes in 0 ms 14/05/31
Re: Spark on EC2
Hmm.. you've gotten further than me. Which AMI's are you using? On Sun, Jun 1, 2014 at 2:21 PM, superback andrew.matrix.c...@gmail.com wrote: Hi, I am trying to run an example on AMAZON EC2 and have successfully set up one cluster with two nodes on EC2. However, when I was testing an example using the following command, * ./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077* I got the following warnings and errors. Can anyone help one solve this problem? Thanks very much! 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.deploy.client.AppClient$ClientActor - All masters are unresponsive! Giving up. 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark cluster looks dead, giving up. 61546 [spark-akka.actor.default-dispatcher-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run count at GroupByTest.scala:50 Exception in thread main org.apache.spark.SparkException: Job aborted: Spark cluster looks down at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. Have I missed something? What AMI's are people using? I've just gone back through the archives, and I'm seeing a lot of I can't get EC2 to work and not a single My EC2 has post-install issues, The quickstart page says ...can have a spark cluster up and running in five minutes. But it's been three days for me so far. I'm about to bite the bullet and start building my own AMI's from scratch... if anyone can save me from that, I'd be most grateful. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com javascript:_e(%7B%7D,'cvml','unorthodox.engine...@gmail.com');님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails.
SparkSQL Table schema in Java
Hello, Congrats for 1.0.0 release. I would like to ask why is it that the table creation requires an proper class in Scala and Java while in python you can just use a map? I think that the use of class for definition of table is bit too restrictive. Using a plain map otoh could be very handy in creating tables dynamically. Are there any alternative apis for spark sql which can work with plain java maps like in python? Regards
Re: Yay for 1.0.0! EC2 Still has problems.
*sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Using sbt-pack with Spark 1.0.0
Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local ivy/maven cache to a your target folder (in target/pack/lib), and creates launch scripts (in target/pack/bin) for your application (notably setting all these jars on the classpath). Now, since Spark 1.0.0 was released, we are encountering a weird error where running our project with sbt run is fine but running our app with the launch scripts generated by sbt-pack fails. After a (quite painful) investigation, it turns out some JARs are NOT copied from the local ivy2 cache to the lib folder. I noticed that all the missing jars contain shaded in their file name (but all not all jars with such name are missing). One of the missing JARs is explicitly from the Spark definition (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``. This file is clearly present in my local ivy cache, but is not copied by sbt-pack. Is there an evident reason for that? I don't know much about the shading mechanism, maybe I'm missing something here? Any help would be appreciated! Cheers Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Akka disassociation on Java SE Embedded
Hi all, This is what I found: 1. Like Aaron suggested, an executor will be killed silently when the OS's memory is running out. I've found this many times to conclude this it's real. Adding swap and increasing the JVM heap solved the problem, but you will encounter OS paging out and full GC. 2. OS paging out and full GC are not likely to affect my benchmark much while processing data from HDFS. But Akka process's randomly killed during the network-related stage (for example, sorting). I've found that an Akka process cannot fetch the result fast enough. Increasing the block manager timeout helped a lot. I've doubled the value many times as the network of our ARM cluster is quite slow. 3. We'd like to collect times spent for all stages of our benchmark. So we always re-run when some tasks failed. Failure happened a lot but it's understandable as Spark is designed on top of Akka's let-it-crash philosophy. To make the benchmark run more perfectly (without a task failure), I called .cache() before calling the transformation of the next stage. And it helped a lot. Combined above and others tuning, we can now boost the performance of our ARM cluster to 2.8 times faster than our first report. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi chan...@gmail.com wrote: May be that's explaining mine too. Thank you very much, Aaron !! Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should cause these messages nowadays is if the TCP connection (which Akka sustains between Actor Systems on different machines) actually drops. TCP connections are pretty resilient, so one common cause of this is actual Executor failure -- recently, I have experienced a similar-sounding problem due to my machine's OOM killer terminating my Executors, such that they didn't produce any error output. On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, On an ARM cluster, I have been testing a wordcount program with JRE 7 and everything is OK. But when changing to the embedded version of Java SE (Oracle's eJRE), the same program cannot complete all computing stages. It is failed by many Akka's disassociation. - I've been trying to increase Akka's timeout but still stuck. I am not sure what is the right way to do so? (I suspected that GC pausing the world is causing this). - Another question is that how could I properly turn on Akka's logging to see what's the root cause of this disassociation problem? (If my guess about GC is wrong). Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit
Re: Akka disassociation on Java SE Embedded
Thanks for the update! I've also run into the block manager timeout issue, it might be a good idea to increase the default significantly (it would probably timeout immediately if the TCP connection itself dropped anyway). On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, This is what I found: 1. Like Aaron suggested, an executor will be killed silently when the OS's memory is running out. I've found this many times to conclude this it's real. Adding swap and increasing the JVM heap solved the problem, but you will encounter OS paging out and full GC. 2. OS paging out and full GC are not likely to affect my benchmark much while processing data from HDFS. But Akka process's randomly killed during the network-related stage (for example, sorting). I've found that an Akka process cannot fetch the result fast enough. Increasing the block manager timeout helped a lot. I've doubled the value many times as the network of our ARM cluster is quite slow. 3. We'd like to collect times spent for all stages of our benchmark. So we always re-run when some tasks failed. Failure happened a lot but it's understandable as Spark is designed on top of Akka's let-it-crash philosophy. To make the benchmark run more perfectly (without a task failure), I called .cache() before calling the transformation of the next stage. And it helped a lot. Combined above and others tuning, we can now boost the performance of our ARM cluster to 2.8 times faster than our first report. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi chan...@gmail.com wrote: May be that's explaining mine too. Thank you very much, Aaron !! Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should cause these messages nowadays is if the TCP connection (which Akka sustains between Actor Systems on different machines) actually drops. TCP connections are pretty resilient, so one common cause of this is actual Executor failure -- recently, I have experienced a similar-sounding problem due to my machine's OOM killer terminating my Executors, such that they didn't produce any error output. On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, On an ARM cluster, I have been testing a wordcount program with JRE 7 and everything is OK. But when changing to the embedded version of Java SE (Oracle's eJRE), the same program cannot complete all computing stages. It is failed by many Akka's disassociation. - I've been trying to increase Akka's timeout but still stuck. I am not sure what is the right way to do so? (I suspected that GC pausing the world is causing this). - Another question is that how could I properly turn on Akka's logging to see what's the root cause of this disassociation problem? (If my guess about GC is wrong). Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit
Re: hadoopRDD stalls reading entire directory
Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the jars parameter), or inside a SparkConf with setJars(). Avro is a necessary jar, but it's possible your application also needs to distribute other ones to the cluster. An easy way to make sure all your dependencies get shipped to the cluster is to create an assembly jar of your application, and then you just need to tell Spark about that jar, which includes all your application's transitive dependencies. Maven and sbt both have pretty straightforward ways of producing assembly jars. On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote: First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed. Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)? On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at console:41, took 0.037371256 s 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at console:41) with 16 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37) 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as TID 97 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as TID 99 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager:
Re: sc.textFileGroupByPath(*/*.txt)
I presume that you need to have access to the path of each file you are reading. I don't know whether there is a good way to do that for HDFS, I need to read the files myself, something like: def openWithPath(inputPath: String, sc:SparkContext) = { val fs= (new Path(inputPath)).getFileSystem(sc.hadoopConfiguration) val filesIt = fs.listFiles(path, false) val paths = new ListBuffer[URI] while (filesIt.hasNext) { paths += filesIt.next.getPath.toUri } val withPaths = paths.toList.map{ p = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](p.toString).map{ case (_,s) = (p, s.toString) } } withPaths.reduce{ _ ++ _ } } ... I would be interested if there is a better way to do the same thing ... Cheers, a: On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Could you provide an example of what you mean? I know it's possible to create an RDD from a path with wildcards, like in the subject. For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also provide a comma delimited list of paths. Nick 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지: Hi All, Is it possible to create an RDD from a directory tree of the following form? RDD[(PATH, Seq[TEXT])] Thank you, Oleg
Re: Yay for 1.0.0! EC2 Still has problems.
Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Using sbt-pack with Spark 1.0.0
One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun, Jun 1, 2014 at 2:34 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local ivy/maven cache to a your target folder (in target/pack/lib), and creates launch scripts (in target/pack/bin) for your application (notably setting all these jars on the classpath). Now, since Spark 1.0.0 was released, we are encountering a weird error where running our project with sbt run is fine but running our app with the launch scripts generated by sbt-pack fails. After a (quite painful) investigation, it turns out some JARs are NOT copied from the local ivy2 cache to the lib folder. I noticed that all the missing jars contain shaded in their file name (but all not all jars with such name are missing). One of the missing JARs is explicitly from the Spark definition (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``. This file is clearly present in my local ivy cache, but is not copied by sbt-pack. Is there an evident reason for that? I don't know much about the shading mechanism, maybe I'm missing something here? Any help would be appreciated! Cheers Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using sbt-pack with Spark 1.0.0
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote: One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun, Jun 1, 2014 at 2:34 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local ivy/maven cache to a your target folder (in target/pack/lib), and creates launch scripts (in target/pack/bin) for your application (notably setting all these jars on the classpath). Now, since Spark 1.0.0 was released, we are encountering a weird error where running our project with sbt run is fine but running our app with the launch scripts generated by sbt-pack fails. After a (quite painful) investigation, it turns out some JARs are NOT copied from the local ivy2 cache to the lib folder. I noticed that all the missing jars contain shaded in their file name (but all not all jars with such name are missing). One of the missing JARs is explicitly from the Spark definition (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``. This file is clearly present in my local ivy cache, but is not copied by sbt-pack. Is there an evident reason for that? I don't know much about the shading mechanism, maybe I'm missing something here? Any help would be appreciated! Cheers Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using sbt-pack with Spark 1.0.0
You're right Patrick! Just had a chat with sbt-pack creator and indeed dependencies with classifiers are ignored to avoid problems with dirty cache... Should be fixed in next version of the plugin. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations Le 1 juin 2014 à 20:04, Patrick Wendell pwend...@gmail.com a écrit : https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote: One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun, Jun 1, 2014 at 2:34 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local ivy/maven cache to a your target folder (in target/pack/lib), and creates launch scripts (in target/pack/bin) for your application (notably setting all these jars on the classpath). Now, since Spark 1.0.0 was released, we are encountering a weird error where running our project with sbt run is fine but running our app with the launch scripts generated by sbt-pack fails. After a (quite painful) investigation, it turns out some JARs are NOT copied from the local ivy2 cache to the lib folder. I noticed that all the missing jars contain shaded in their file name (but all not all jars with such name are missing). One of the missing JARs is explicitly from the Spark definition (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``. This file is clearly present in my local ivy cache, but is not copied by sbt-pack. Is there an evident reason for that? I don't know much about the shading mechanism, maybe I'm missing something here? Any help would be appreciated! Cheers Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark 1.0.0 on yarn
Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: Yay for 1.0.0! EC2 Still has problems.
Ah yes, looking back at the first email in the thread, indeed that was the case. For the record, I too launch clusters from my laptop, where I have Python 2.7 installed. On Sun, Jun 1, 2014 at 2:01 PM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Yay for 1.0.0! EC2 Still has problems.
More specifically with the -a flag, you *can* set your own AMI, but you’ll need to base it off ours. This is because spark-ec2 assumes that some packages (e.g. java, Python 2.6) are already available on the AMI. Matei On Jun 1, 2014, at 11:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: sc.textFileGroupByPath(*/*.txt)
Anwar, Will try this as it might do exactly what I need. I will follow your pattern but use sc.textFile() for each file. I am now thinking that I could start with an RDD of file paths and map it into (path, content) pairs, provided I could read a file on the server. Thank you, Oleg On 1 June 2014 18:41, Anwar Rizal anriza...@gmail.com wrote: I presume that you need to have access to the path of each file you are reading. I don't know whether there is a good way to do that for HDFS, I need to read the files myself, something like: def openWithPath(inputPath: String, sc:SparkContext) = { val fs= (new Path(inputPath)).getFileSystem(sc.hadoopConfiguration) val filesIt = fs.listFiles(path, false) val paths = new ListBuffer[URI] while (filesIt.hasNext) { paths += filesIt.next.getPath.toUri } val withPaths = paths.toList.map{ p = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](p.toString).map{ case (_,s) = (p, s.toString) } } withPaths.reduce{ _ ++ _ } } ... I would be interested if there is a better way to do the same thing ... Cheers, a: On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Could you provide an example of what you mean? I know it's possible to create an RDD from a path with wildcards, like in the subject. For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also provide a comma delimited list of paths. Nick 2014년 6월 1일 일요일, Oleg Proudnikovoleg.proudni...@gmail.com님이 작성한 메시지: Hi All, Is it possible to create an RDD from a directory tree of the following form? RDD[(PATH, Seq[TEXT])] Thank you, Oleg -- Kind regards, Oleg
[Spark Streaming] Distribute custom receivers evenly across excecutors
Dear All, I'm running Spark Streaming (1.0.0) with Yarn (2.2.0) on a 10-node cluster. I setup 10 custom receivers to hear from 10 data streams. I want one receiver per node in order to maximize the network bandwidth. However, if I set --executor-cores 4, the 10 receivers only run on 3 of the nodes in the cluster, each running 4, 4, 2 receivers; if I set --executor-cores 1, each node will run exactly one receiver, and it seems that Spark can't make any progress to process theses streams. I read the documentation on configuration and also googled but didn't find a clue. Is there a way to configure how the receivers are distributed? Thanks! Here are some details: How I created 10 receivers: val conf = new SparkConf().setAppName(jobId) val sc = new StreamingContext(conf, Seconds(1)) var lines:DStream[String] = sc.receiverStream( new CustomReceiver(...) ) for(i - 1 to 9) { lines = lines.union( sc.receiverStream( new CustomReceiver(...) ) } How I submit a job to Yarn: spark-submit \ --class $JOB_CLASS \ --master yarn-client \ --num-executors 10 \ --driver-memory 1g \ --executor-memory 2g \ --executor-cores 4 \ $JAR_NAME
Re: hadoopRDD stalls reading entire directory
You can avoid that by using the constructor that takes a SparkConf, a la val conf = new SparkConf() conf.setJars(avro.jar, ...) val sc = new SparkContext(conf) On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com wrote: Followup question: the docs to make a new SparkContext require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com wrote: Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the jars parameter), or inside a SparkConf with setJars(). Avro is a necessary jar, but it's possible your application also needs to distribute other ones to the cluster. An easy way to make sure all your dependencies get shipped to the cluster is to create an assembly jar of your application, and then you just need to tell Spark about that jar, which includes all your application's transitive dependencies. Maven and sbt both have pretty straightforward ways of producing assembly jars. On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote: First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed. Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)? On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at console:41, took 0.037371256 s 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at console:41) with 16 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37) 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as TID 97 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31
Re: Trouble with EC2
Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
Re: Trouble with EC2
So to run spark-ec2, you should use the default AMI that it launches with if you don’t pass -a. Those are based on Amazon Linux, not Debian. Passing your own AMI is an advanced option but people need to install some stuff on their AMI in advance for it to work with our scripts. Matei On Jun 1, 2014, at 3:11 PM, PJ$ p...@chickenandwaffl.es wrote: Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
Re: Trouble with EC2
Ha yes,,, I just went through this. (a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment ) and not any of the other linux distros. They don't work. (b) Start with m1.large instances.. I tried going for r3.large at first, and had no end of self-caused trouble. m1.large works. (c) It's possible for the script to choose the wrong AMI, especially if one has been messing with it to allow other instance types. (ahem) But it will work in the end.. just start simple. (yeah, I know m1.large doesn't look that large anymore. :-) On Mon, Jun 2, 2014 at 8:11 AM, PJ$ p...@chickenandwaffl.es wrote: Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia matei.zaha...@gmail.com wrote: What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [ akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: spark 1.0.0 on yarn
As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: Yay for 1.0.0! EC2 Still has problems.
Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use the version chosen by the scripts. (and use the right instance type so the script can choose wisely) But yes, one also needs a launch machine to kick off the cluster, and for that I _also_ was using an Amazon instance... (made sense.. I have a team that will needs to do things as well, not just me) and I was just pointing out that if you use the most recommended by Amazon AMI (for your free micro instance, for example) you get python 2.6 and the ec2 scripts fail. That merely needs a line in the documentation saying use Ubuntu for your cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie, it was hard working out when to use default or custom AMIs for various parts of the setup. On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: spark 1.0.0 on yarn
That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTED I am using the hadoop 2 prebuild package. Probably it doesn't have the latest yarn client. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: hadoopRDD stalls reading entire directory
Thanks again. Run results here: https://gist.github.com/rjurney/dc0efae486ba7d55b7d5 This time I get a port already in use exception on 4040, but it isn't fatal. Then when I run rdd.first, I get this over and over: 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike...@gmail.com wrote: You can avoid that by using the constructor that takes a SparkConf, a la val conf = new SparkConf() conf.setJars(avro.jar, ...) val sc = new SparkContext(conf) On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com wrote: Followup question: the docs to make a new SparkContext require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com wrote: Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the jars parameter), or inside a SparkConf with setJars(). Avro is a necessary jar, but it's possible your application also needs to distribute other ones to the cluster. An easy way to make sure all your dependencies get shipped to the cluster is to create an assembly jar of your application, and then you just need to tell Spark about that jar, which includes all your application's transitive dependencies. Maven and sbt both have pretty straightforward ways of producing assembly jars. On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote: First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed. Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)? On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at console:41, took 0.037371256 s 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at console:41) with 16 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37) 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as TID 96 on executor 3:
Re: Spark on EC2
I haven't set up AMI yet. I am just trying to run a simple job on the EC2 cluster. So, is setting up AMI a prerequisite for running simple Spark example like org.apache.spark.examples.GroupByTest? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638p6681.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Yay for 1.0.0! EC2 Still has problems.
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei On Jun 1, 2014, at 6:14 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use the version chosen by the scripts. (and use the right instance type so the script can choose wisely) But yes, one also needs a launch machine to kick off the cluster, and for that I _also_ was using an Amazon instance... (made sense.. I have a team that will needs to do things as well, not just me) and I was just pointing out that if you use the most recommended by Amazon AMI (for your free micro instance, for example) you get python 2.6 and the ec2 scripts fail. That merely needs a line in the documentation saying use Ubuntu for your cluster controller, not Amazon Linux or somesuch. But yeah, for a newbie, it was hard working out when to use default or custom AMIs for various parts of the setup. On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell pwend...@gmail.com wrote: Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a specific AMI. This flag is just an internal tool that we use for testing when we spin new AMI's. Users can't set that to an arbitrary AMI because we tightly control things like the Java and OS versions, libraries, etc. On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: *sigh* OK, I figured it out. (Thank you Nick, for the hint) m1.large works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting r3.*large instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazon AMI. I'll have to take a closer look at the code and see if I can't fix it correctly, because I really, really do want nodes with 2x the CPU and 4x the memory for the same low spot price. :-) I've got a cluster up now, at least. Time for the fun stuff... Thanks everyone for the help! On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammasnicholas.cham...@gmail.com님이 작성한 메시지: Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지: It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Please put me into the mail list, thanks.
Can anyone help me set memory for standalone cluster?
Hi, I'm running the example of JavaKafkaWordCount in a standalone cluster. I want to set 1600MB memory for each slave node. I wrote in the spark/conf/spark-env.sh SPARK_WORKER_MEMORY=1600m But the logs on slave nodes looks this: Spark Executor Command: /usr/java/latest/bin/java -cp :/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend The memory seems to be the default number, not 1600M. I don't how to make SPARK_WORKER_MEMORY work. Can anyone help me? Many thanks in advance. Yunmeng
Re: hadoopRDD stalls reading entire directory
Sounds like you have two shells running, and the first one is talking all your resources. Do a jps and kill the other guy, then try again. By the way, you can look at http://localhost:8080 (replace localhost with the server your Spark Master is running on) to see what applications are currently started, and what resource allocations they have. On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks again. Run results here: https://gist.github.com/rjurney/dc0efae486ba7d55b7d5 This time I get a port already in use exception on 4040, but it isn't fatal. Then when I run rdd.first, I get this over and over: 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike...@gmail.com wrote: You can avoid that by using the constructor that takes a SparkConf, a la val conf = new SparkConf() conf.setJars(avro.jar, ...) val sc = new SparkContext(conf) On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney russell.jur...@gmail.com wrote: Followup question: the docs to make a new SparkContext require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com wrote: Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the jars parameter), or inside a SparkConf with setJars(). Avro is a necessary jar, but it's possible your application also needs to distribute other ones to the cluster. An easy way to make sure all your dependencies get shipped to the cluster is to create an assembly jar of your application, and then you just need to tell Spark about that jar, which includes all your application's transitive dependencies. Maven and sbt both have pretty straightforward ways of producing assembly jars. On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike...@gmail.com wrote: First issue was because your cluster was configured incorrectly. You could probably read 1 file because that was done on the driver node, but when it tried to run a job on the cluster, it failed. Second issue, it seems that the jar containing avro is not getting propagated to the Executors. What version of Spark are you running on? What deployment mode (YARN, standalone, Mesos)? On Sat, May 31, 2014 at 9:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Now I get this: scala rdd.first 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at console:41) with 1 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at console:41, took 0.037371256 s 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at console:41 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at console:41) with 16 output partitions (allowLocal=true) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 (first at console:41) 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 (HadoopRDD[0] at hadoopRDD at console:37), which has no missing parents 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at console:37) 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 16 tasks 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as 1294 bytes in 1 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as 1294 bytes in 0 ms 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as TID 94 on executor 4: hivecluster4
Re: Can anyone help me set memory for standalone cluster?
In addition to setting the Standalone memory, you'll also need to tell your SparkContext to claim the extra resources. Set spark.executor.memory to 1600m as well. This should be a system property set in SPARK_JAVA_OPTS in conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g., export SPARK_JAVA_OPTS=-Dspark.executor.memory=1600mb On Sun, Jun 1, 2014 at 7:36 PM, Yunmeng Ban banyunm...@gmail.com wrote: Hi, I'm running the example of JavaKafkaWordCount in a standalone cluster. I want to set 1600MB memory for each slave node. I wrote in the spark/conf/spark-env.sh SPARK_WORKER_MEMORY=1600m But the logs on slave nodes looks this: Spark Executor Command: /usr/java/latest/bin/java -cp :/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend The memory seems to be the default number, not 1600M. I don't how to make SPARK_WORKER_MEMORY work. Can anyone help me? Many thanks in advance. Yunmeng
Re: apache whirr for spark
Thanks for letting me know, I am leaning towards using Whirr to setup a Yarn cluster with Hive, Pig, Hbase, etc... and then adding Spark on Yarn. Is it pretty straightforward to install Spark on a Yarn cluster? On Fri, May 30, 2014 at 5:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I don’t think Whirr provides support for this, but Spark’s own EC2 scripts also launch a Hadoop cluster: http://spark.apache.org/docs/latest/ec2-scripts.html. Matei On May 30, 2014, at 12:59 PM, chirag lakhani chirag.lakh...@gmail.com wrote: Does anyone know if it is possible to use Whirr to setup a Spark cluster on AWS. I would like to be able to use Whirr to setup a cluster that has all of the standard Hadoop and Spark tools. I want to automate this process because I anticipate I will have to create and destroy often enough that I would like to have it all automated. Could anyone provide any pointers into how this could be done or whether it is documented somewhere? Chirag Lakhani
Re: Spark on EC2
No, you don't have to set up your own AMI. Actually it's probably simpler and less error prone if you let spark-ec2 manage that for you as you first start to get comfortable with Spark. Just spin up a cluster without any explicit mention of AMI and it will do the right thing. 2014년 6월 1일 일요일, superbackandrew.matrix.c...@gmail.com님이 작성한 메시지: I haven't set up AMI yet. I am just trying to run a simple job on the EC2 cluster. So, is setting up AMI a prerequisite for running simple Spark example like org.apache.spark.examples.GroupByTest? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638p6681.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?
Hi guys, I'm using IntelliJ IDEA 13.1.2 Community Edition, and I have installed Scala plugin and Maven 3.2.1. I want to develop Spark applications with IntelliJ IDEA through Maven. In IntelliJ, I create a Maven project with the archetype ID spark-core_2.10, but got the following messages in the Message Maven Goal: = [WARNING] Archetype not found in any catalog. Falling back to central repository (http://repo1.maven.org/maven2). [WARNING] Use -DarchetypeRepository=your repository if archetype's repository is elsewhere. [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 20.064 s [INFO] Finished at: 2014-06-02T11:50:14+08:00 [INFO] Final Memory: 9M/65M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-archetype-plugin:2.2:generate (default-cli) on project standalone-pom: The defined artifact is not an archetype - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] Maven execution terminated abnormally (exit code 1) = I have spent several days on this, but did not get any success. The intructions on Spark Website ( http://spark.apache.org/docs/latest/building-with-maven.html) may be to brief for newbies like me. Is there any more detailed instructions on how to build Spark App with Intellij IDEA? Thanks a lot!