Exception failure: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver
Hi spark version I am using is spark-0.9.1-bin-hadoop2 I build spark-assembly_2.10-0.9.1-hadoop2.2.0.jar I moved JavaKafkaWordCount.java from examples to new directory to play with it. My compile commands: javac -cp libs/spark-streaming_2.10-0.9.1.jar:libs/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar:libs/spark-streaming-kafka_2.10-0.9.1.jar:libs/kafka_2.9.1-0.8.1.1.jar:libs/zkclient-0.4.jar JavaKafkaWordCount.java jar -cvf JavaKafkaWordCount.jar JavaKafkaWordCount* And run: java -cp libs/spark-streaming_2.10-0.9.1.jar:libs/zkclient-0.4.jar:libs/kafka_2.9.1-0.8.1.1.jar:libs/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar:libs/spark-streaming-kafka_2.10-0.9.1.jar:./JavaKafkaWordCount.jar JavaKafkaWordCount spark://dlvm1:7077 vm37.dbweb.ee demogroup kafkademo1 1 The job is visible in UI but I am getting: log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/05/30 11:53:42 INFO SparkEnv: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/05/30 11:53:42 INFO SparkEnv: Registering BlockManagerMaster 14/05/30 11:53:42 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140530115342-0b66 14/05/30 11:53:42 INFO MemoryStore: MemoryStore started with capacity 386.3 MB. 14/05/30 11:53:42 INFO ConnectionManager: Bound socket to port 49250 with id = ConnectionManagerId(dlvm1,49250) 14/05/30 11:53:42 INFO BlockManagerMaster: Trying to register BlockManager 14/05/30 11:53:42 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager dlvm1:49250 with 386.3 MB RAM 14/05/30 11:53:42 INFO BlockManagerMaster: Registered BlockManager 14/05/30 11:53:42 INFO HttpServer: Starting HTTP Server 14/05/30 11:53:42 INFO HttpBroadcast: Broadcast server started at http://90.190.106.47:42861 14/05/30 11:53:42 INFO SparkEnv: Registering MapOutputTracker 14/05/30 11:53:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-76fd126e-7fcd-4df1-a967-bf4b8d356973 14/05/30 11:53:42 INFO HttpServer: Starting HTTP Server 14/05/30 11:53:43 INFO SparkUI: Started Spark Web UI at http://dlvm1:4040 14/05/30 11:53:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/05/30 11:53:44 INFO SparkContext: Added JAR /opt/spark-0.9.1-bin-hadoop2/margusja_kafka/JavaKafkaWordCount.jar at http://90.190.106.47:57550/jars/JavaKafkaWordCount.jar with timestamp 1401440024153 14/05/30 11:53:44 INFO AppClient$ClientActor: Connecting to master spark://dlvm1:7077... ... ... ... 14/05/30 11:53:56 INFO SparkContext: Job finished: collect at NetworkInputTracker.scala:178, took 10.617582853 s 14/05/30 11:53:56 INFO TaskSetManager: Finished TID 70 in 41 ms on dlvm1 (progress: 20/20) 14/05/30 11:53:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/05/30 11:53:56 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 1 to spark@dlvm1:48363 14/05/30 11:53:56 INFO TaskSetManager: Finished TID 71 in 58 ms on dlvm1 (progress: 1/1) 14/05/30 11:53:56 INFO DAGScheduler: Completed ResultTask(4, 1) 14/05/30 11:53:56 INFO DAGScheduler: Stage 4 (take at DStream.scala:586) finished in 0.815 s 14/05/30 11:53:56 INFO SparkContext: Job finished: take at DStream.scala:586, took 0.844135774 s --- Time: 1401440028000 ms --- 14/05/30 11:53:56 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 14/05/30 11:53:56 INFO JobScheduler: Finished job streaming job 1401440028000 ms.0 from job set of time 1401440028000 ms 14/05/30 11:53:56 INFO JobScheduler: Total delay: 8.189 s for time 1401440028000 ms (execution: 7.847 s) 14/05/30 11:53:56 INFO SparkContext: Starting job: take at DStream.scala:586 14/05/30 11:53:56 INFO DAGScheduler: Registering RDD 17 (combineByKey at ShuffledDStream.scala:42) 14/05/30 11:53:56 INFO DAGScheduler: Got job 3 (take at DStream.scala:586) with 1 output partitions (allowLocal=true) 14/05/30 11:53:56 INFO DAGScheduler: Final stage: Stage 6 (take at DStream.scala:586) 14/05/30 11:53:56 INFO DAGScheduler: Parents of final stage: List(Stage 7) 14/05/30 11:53:56 INFO DAGScheduler: Missing parents: List() 14/05/30 11:53:56 INFO DAGScheduler: Submitting Stage 6 (MapPartitionsRDD[19] at combineByKey at ShuffledDStream.scala:42), which has no missing parents 14/05/30 11:53:56 INFO JobScheduler: Starting job streaming job 140144003 ms.0 from job set of time 140144003 ms 14/05/30 11:53:56 INFO SparkContext: Starting job: runJob at NetworkInputTracker.scala:182 14/05/30 11:53:56 INFO DAGScheduler: Submitting 1 missing tasks from Stage 6 (MapPartitionsRDD[19] at combineByKey at ShuffledDStream.scala:42) 14/05/30 11:53:56 INFO
Announcing Spark 1.0.0
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/ spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
It is updated - try holding Shift + refresh in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote: Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
Now I can download. Thanks. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:48, Patrick Wendell wrote: It is updated - try holding Shift + refresh in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote: Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
RE: Announcing Spark 1.0.0
Hi all In https://spark.apache.org/downloads.html, the URL for release note of 1.0.0 seems to be wrong. The URL should be https://spark.apache.org/releases/spark-release-1-0-0.html but links to https://spark.apache.org/releases/spark-release-1.0.0.html Best Regards, Kousuke From: prabeesh k [mailto:prabsma...@gmail.com] Sent: Friday, May 30, 2014 8:18 PM To: user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 I forgot to hard refresh. thanks On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com wrote: It is updated - try holding Shift + refresh in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote: Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 tel:%2B372%2051%2048%20780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
All: In the pom.xml file I see the MapR repository, but it's not included in the ./project/SparkBuild.scala file. Is this expected? I know to build I have to add it there otherwise sbt hates me with evil red messages and such. John On Fri, May 30, 2014 at 6:24 AM, Kousuke Saruta saru...@oss.nttdata.co.jp wrote: Hi all In https://spark.apache.org/downloads.html, the URL for release note of 1.0.0 seems to be wrong. The URL should be https://spark.apache.org/releases/spark-release-1-0-0.html but links to https://spark.apache.org/releases/spark-release-1.0.0.html Best Regards, Kousuke *From:* prabeesh k [mailto:prabsma...@gmail.com] *Sent:* Friday, May 30, 2014 8:18 PM *To:* user@spark.apache.org *Subject:* Re: Announcing Spark 1.0.0 I forgot to hard refresh. thanks On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com wrote: It is updated - try holding Shift + refresh in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote: Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
Awesome work On Fri, May 30, 2014 at 12:12 PM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
By the way: This is great work. I am new to the spark world, and have been like a kid in a candy store learnign all it can do. Is there a good list of build variables? What I me is like the SPARK_HIVE variable described on the Spark SQL page. I'd like to include that, but once I found that I wondered if there were other options I should consider before building. Thanks! On Fri, May 30, 2014 at 6:52 AM, John Omernik j...@omernik.com wrote: All: In the pom.xml file I see the MapR repository, but it's not included in the ./project/SparkBuild.scala file. Is this expected? I know to build I have to add it there otherwise sbt hates me with evil red messages and such. John On Fri, May 30, 2014 at 6:24 AM, Kousuke Saruta saru...@oss.nttdata.co.jp wrote: Hi all In https://spark.apache.org/downloads.html, the URL for release note of 1.0.0 seems to be wrong. The URL should be https://spark.apache.org/releases/spark-release-1-0-0.html but links to https://spark.apache.org/releases/spark-release-1.0.0.html Best Regards, Kousuke *From:* prabeesh k [mailto:prabsma...@gmail.com] *Sent:* Friday, May 30, 2014 8:18 PM *To:* user@spark.apache.org *Subject:* Re: Announcing Spark 1.0.0 I forgot to hard refresh. thanks On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com wrote: It is updated - try holding Shift + refresh in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote: Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote: Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) On 30/05/14 13:18, Christopher Nguyen wrote: Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Selecting first ten values in a RDD/partition
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only top 10 hashtags in each RDD because these only could come in the final answer. I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM ? Basically I need to transform my DTREAM in which each RDD contains only top 10 hashtags so that number of hashtags in 5 mins interval is low. If there is some more efficient way of doing this then please let me know that also. Thanx, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: pyspark MLlib examples don't work with Spark 1.0.0
thanks for the reply. I am definitely running 1.0.0, I set it up manually. To answer my question, I found out from the examples that it would need a new data type called LabeledPoint instead of numpy array. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-MLlib-examples-don-t-work-with-Spark-1-0-0-tp6546p6579.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Announcing Spark 1.0.0
How exciting! Congratulations! :-) Ognen On 5/30/14, 5:12 AM, Patrick Wendell wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Announcing Spark 1.0.0
Congratulations !! -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Fri, May 30, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: SparkContext startup time out
I was annoyed by this as well. It appears that just permuting the order of decencies inclusion solves this problem: first spark, than your cdh hadoop distro. HTH, Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p6582.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: KryoSerializer Exception
Hi, i just migrate to 1.0. Still having the same issue. Either with or without the custom registrator. Just the usage of the KryoSerializer triggers the exception immediately. I set the kryo settings through the property: System.setProperty(spark.serializer, org.apache.spark.serializer. KryoSerializer) System.setProperty(spark.kryo.registrator, it.unipi.thesis.andrea. esposito.onjag.test.TestKryoRegistrator) The registrator is just a sequence of: kryo.register(classOf[MyClass]) I tried also with very small RDD (few MB of serialized data) and the problem still occurs. The problem seems about broadcast but i'm completely stuck. Following complete log: *2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 5 (task 3.0:1) 2014-05-30 15:47:36 WARN TaskSetManager:70 - Loss was due to java.io.EOFExceptionjava.io.EOFException at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:119) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:205) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:135) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 4 (task 3.0:0)2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 6 (task 3.0:1) 2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 7 (task 3.0:0)2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 8 (task 3.0:1) 2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 9 (task 3.0:0)2014-05-30 15:47:36 WARN TaskSetManager:70 - Lost TID 10 (task 3.0:1) 2014-05-30 15:47:36 ERROR TaskSetManager:74 - Task 3.0:1 failed 4 times; aborting jobException in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 3.0:1 failed 4 times, most recent failure: Exception failure in TID 10 on host Andrea-Laptop.unipi.it: java.io.EOFException org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:119) org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:205) org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89) sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606)
Re: Announcing Spark 1.0.0
Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Local file being refrenced in mapper function
Hi, I recently posted a question on stackoverflow but didn't get any reply. I joined the mailing list now. Can anyone of you guide me a way for the problem mentioned in http://stackoverflow.com/questions/23923966/writing-the-rdd-data-in-excel-file-along-mapping-in-apache-spark Thanks in advance -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Monitoring / Instrumenting jobs in 1.0
The Spark 1.0.0 release notes state Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs. Can anyone point me to the docs for this? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
RE: Announcing Spark 1.0.0
Congrats Sent from my Windows Phone From: Dean Wamplermailto:deanwamp...@gmail.com Sent: 5/30/2014 6:53 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Using Spark on Data size larger than Memory size
Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? Any thoughts in this direction will be helpful and are welcome. Thanks, -Vibhor
Re: Problem using Spark with Hbase
Thanks Mayur for the reply. Actually issue was the I was running Spark application on hadoop-2.2.0 and hbase version there was 0.95.2. But spark by default gets build by an older hbase version. So I had to build spark again with hbase version as 0.95.2 in spark build file. And it worked. Thanks, -Vibhor On Wed, May 28, 2014 at 11:34 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Try this.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, May 28, 2014 at 7:40 PM, Vibhor Banga vibhorba...@gmail.com wrote: Any one who has used spark this way or has faced similar issue, please help. Thanks, -Vibhor On Wed, May 28, 2014 at 6:03 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am facing issues while using spark with HBase. I am getting NullPointerException at org.apache.hadoop.hbase.TableName.valueOf (TableName.java:288) Can someone please help to resolve this issue. What am I missing ? I am using following snippet of code - Configuration config = HBaseConfiguration.create(); config.set(hbase.zookeeper.znode.parent, hostname1); config.set(hbase.zookeeper.quorum,hostname1); config.set(hbase.zookeeper.property.clientPort,2181); config.set(hbase.master, hostname1: config.set(fs.defaultFS,hdfs://hostname1/); config.set(dfs.namenode.rpc-address,hostname1:8020); config.set(TableInputFormat.INPUT_TABLE, tableName); JavaSparkContext ctx = new JavaSparkContext(args[0], Simple, System.getenv(sparkHome), JavaSparkContext.jarOfClass(Simple.class)); JavaPairRDDImmutableBytesWritable, Result hBaseRDD = ctx.newAPIHadoopRDD( config, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); MapImmutableBytesWritable, Result rddMap = hBaseRDD.collectAsMap(); But when I go to the spark cluster and check the logs, I see following error - INFO NewHadoopRDD: Input split: w3-target1.nm.flipkart.com:, 14/05/28 16:48:51 ERROR TableInputFormat: java.lang.NullPointerException at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:288) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:154) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:99) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:92) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks, -Vibhor -- Vibhor Banga Software Development Engineer Flipkart Internet Pvt. Ltd., Bangalore
Re: Announcing Spark 1.0.0
You guys were up late, eh? :) I'm looking forward to using this latest version. Is there any place we can get a list of the new functions in the Python API? The release notes don't enumerate them. Nick On Fri, May 30, 2014 at 10:15 AM, Ian Ferreira ianferre...@hotmail.com wrote: Congrats Sent from my Windows Phone -- From: Dean Wampler deanwamp...@gmail.com Sent: 5/30/2014 6:53 AM To: user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Subscribing to news releases
Is there a way to subscribe to news releases http://spark.apache.org/news/index.html? That would be swell. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Subscribing-to-news-releases-tp6592.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Announcing Spark 1.0.0
Great work! On May 30, 2014 10:15 PM, Ian Ferreira ianferre...@hotmail.com wrote: Congrats Sent from my Windows Phone -- From: Dean Wampler deanwamp...@gmail.com Sent: 5/30/2014 6:53 AM To: user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Spark 1.0.0 - Java 8
Great News ! I've been awaiting this release to start doing some coding with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need to use a cluster of machines. Second, are there any good exmaples of using MLIB on Spark. Please shoot me in the right direction. Thanks Upender On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Spark 1.0.0 - Java 8
With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3 CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node. Everything works fine, though if you are using MapR, you have to make sure they are all on the same subnet. -Suren On Fri, May 30, 2014 at 12:20 PM, Upender Nimbekar upent...@gmail.com wrote: Great News ! I've been awaiting this release to start doing some coding with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need to use a cluster of machines. Second, are there any good exmaples of using MLIB on Spark. Please shoot me in the right direction. Thanks Upender On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Spark 1.0.0 - Java 8
Also, the Spark examples can run out of the box on a single machine, as well as a cluster. See the Master URLs heading here: http://spark.apache.org/docs/latest/submitting-applications.html#master-urls On Fri, May 30, 2014 at 9:24 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3 CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node. Everything works fine, though if you are using MapR, you have to make sure they are all on the same subnet. -Suren On Fri, May 30, 2014 at 12:20 PM, Upender Nimbekar upent...@gmail.com wrote: Great News ! I've been awaiting this release to start doing some coding with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need to use a cluster of machines. Second, are there any good exmaples of using MLIB on Spark. Please shoot me in the right direction. Thanks Upender On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Local file being refrenced in mapper function
Hi Rahul, I'll just copy paste your question here to aid with context, and reply afterwards. - Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters?? Below is given the python code(Its just an example to clarify my question, i understand that this implementation may not be actually required): import xlsxwriter import sys import math from pyspark import SparkContext # get the spark context in sc. workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) workbook.close() There are 2 questioms: Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? Is writing in the excel file using worksheet.write() in side the mapper function a correct way? Also If #2 is correct then plz clarify the doubt that I am thinking the worksheet is created in local machine then how does it work? - On Fri, May 30, 2014 at 6:55 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hi, I recently posted a question on stackoverflow but didn't get any reply. I joined the mailing list now. Can anyone of you guide me a way for the problem mentioned in http://stackoverflow.com/questions/23923966/writing-the-rdd-data-in-excel-file-along-mapping-in-apache-spark Thanks in advance -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Marcelo
Re: Local file being refrenced in mapper function
Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? No. mapperFunc will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of row in the executor, and your driver's copy will remain at 0. In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. row in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? No, for the same reasons. Your executor will have a copy of your workbook variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. data.map() is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo
Re: Local file being refrenced in mapper function
Hi Rahul, Marcelo's explanation is correct. Here's a possible approach to your program, in pseudo-Python: # connect to Spark cluster sc = SparkContext(...) # load input data input_data = load_xls(file(input.xls)) input_rows = input_data['Sheet1'].rows # create RDD on cluster input_rdd = sc.parallelize(input_rows) # munge RDD result_rdd = input_rdd.map(munge_row) # collect result RDD to local process result_rows = result_rdd.collect() # write output file write_xls(file(output.xls, w), result_rows) Hope that helps, -Jey On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin van...@cloudera.com wrote: Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? No. mapperFunc will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of row in the executor, and your driver's copy will remain at 0. In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. row in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? No, for the same reasons. Your executor will have a copy of your workbook variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. data.map() is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo
Trouble with EC2
Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
Re: Is uberjar a recommended way of running Spark/Scala applications?
Thanks, Stephen. I have eventually decided to go with assembly, but put away Spark and Hadoop jars, and instead use `spark-submit` to automatically provide these dependencies. This way no resource conflicts arise and mergeStrategy needs no modification. To memorize this stable setup and also share it with the community I've crafted a project [1] with minimal working config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's Hadoop client. Hope, it will help somebody to take Spark setup quicker. Though I'm fine with this setup for final builds, I'm still looking for a more interactive dev setup - something that doesn't require full rebuild. [1]: https://github.com/faithlessfriend/sample-spark-project Thanks and have a good weekend, Andrei On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch java...@gmail.com wrote: The MergeStrategy combined with sbt assembly did work for me. This is not painless: some trial and error and the assembly may take multiple minutes. You will likely want to filter out some additional classes from the generated jar file. Here is an SOF answer to explain that and with IMHO the best answer snippet included here (in this case the OP understandably did not want to not include javax.servlet.Servlet) http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) = ms filter { case (file, toPath) = toPath != javax/servlet/Servlet.class } } There is a setting to not include the project files in the assembly but I do not recall it at this moment. 2014-05-29 10:13 GMT-07:00 Andrei faithlessfri...@gmail.com: Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow). I like the idea of using sbt package, since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this case I'll have to handle jar list manually via Spark context. Is there a way to automate this process? E.g. when I was a Clojure guy, I could run lein deps (lein is a build tool similar to sbt) to download all dependencies and then just enumerate them from my app. Maybe you have heard of something like that for Spark/SBT? Thanks, Andrei On Thu, May 29, 2014 at 3:48 PM, jaranda jordi.ara...@bsc.es wrote: Hi Andrei, I think the preferred way to deploy Spark jobs is by using the sbt package task instead of using the sbt assembly plugin. In any case, as you comment, the mergeStrategy in combination with some dependency exlusions should fix your problems. Have a look at this gist https://gist.github.com/JordiAranda/bdbad58d128c14277a05 for further details (I just followed some recommendations commented in the sbt assembly plugin documentation). Up to now I haven't found a proper way to combine my development/deployment phases, although I must say my experience in Spark is pretty poor (it really depends in your deployment requirements as well). In this case, I think someone else could give you some further insights. Best, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Local file being refrenced in mapper function
Thanks Marcelo, It actually made my few concepts clear. (y). On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? No. mapperFunc will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of row in the executor, and your driver's copy will remain at 0. In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. row in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? No, for the same reasons. Your executor will have a copy of your workbook variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. data.map() is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Re: Local file being refrenced in mapper function
Thanks jey I was hellpful. On Sat, May 31, 2014 at 12:45 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks Marcelo, It actually made my few concepts clear. (y). On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? No. mapperFunc will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of row in the executor, and your driver's copy will remain at 0. In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. row in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? No, for the same reasons. Your executor will have a copy of your workbook variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. data.map() is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Failed to remove RDD error
I'm running a some kafka streaming spark contexts (on 0.9.1), and they seem to be dying after 10 or so minutes with a lot of these errors. I can't really tell what's going on here, except that maybe the driver is unresponsive somehow? Has anyone seen this before? 14/05/31 01:13:30 ERROR BlockManagerMaster: Failed to remove RDD 12635 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:691) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:688) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:744) Thanks, Mike
possible typos in spark 1.0 documentation
Congrats on the new 1.0 release. Amazing work ! It looks like there may some typos in the latest http://spark.apache.org/docs/latest/sql-programming-guide.html in the Running SQL on RDDs section when choosing the java example: 1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. 2. In that same code example the object sqlCtx is referenced, but it is never instantiated in the code. should this be ctx? Cheers, Yadid
Re: Yay for 1.0.0! EC2 Still has problems.
Hi Jeremy, That's interesting, I don't think anyone has ever reported an issue running these scripts due to Python incompatibility, but they may require Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a good place to start. But if there is a straightforward way to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hi there! I'm relatively new to the list, so sorry if this is a repeat: I just wanted to mention there are still problems with the EC2 scripts. Basically, they don't work. First, if you run the scripts on Amazon's own suggested version of linux, they break because amazon installs Python2.6.9, and the scripts use a couple of Python2.7 commands. I have to sudo yum install python27, and then edit the spark-ec2 shell script to use that specific version. Annoying, but minor. (the base python command isn't upgraded to 2.7 on many systems, apparently because it would break yum) The second minor problem is that the script doesn't know about the r3.large servers... also easily fixed by adding to the spark_ec2.py script. Minor, The big problem is that after the EC2 cluster is provisioned, installed, set up, and everything, it fails to start up the webserver on the master. Here's the tail of the log: Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory [FAILED] Basically, the AMI you have chosen does not seem to have a full install of apache, and is missing several modules that are referred to in the httpd.conf file that is installed. The full list of missing modules is: authn_alias_module modules/mod_authn_alias.so authn_default_module modules/mod_authn_default.so authz_default_module modules/mod_authz_default.so ldap_module modules/mod_ldap.so authnz_ldap_module modules/mod_authnz_ldap.so disk_cache_module modules/mod_disk_cache.so Alas, even if these modules are commented out, the server still fails to start. root@ip-172-31-11-193 ~]$ service httpd start Starting httpd: AH00534: httpd: Configuration error: No MPM loaded. That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run according to the instructions. Sorry. Any suggestions on how to proceed? I'll keep trying to fix the webserver, but (a) changes to httpd.conf get blown away by resume, and (b) anything I do has to be redone every time I provision another cluster. Ugh. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers