Exception failure: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-05-30 Thread Margusja

Hi

spark version I am using is spark-0.9.1-bin-hadoop2

I build spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
I moved JavaKafkaWordCount.java from examples to new directory to play 
with it.


My compile commands:
javac -cp 
libs/spark-streaming_2.10-0.9.1.jar:libs/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar:libs/spark-streaming-kafka_2.10-0.9.1.jar:libs/kafka_2.9.1-0.8.1.1.jar:libs/zkclient-0.4.jar 
JavaKafkaWordCount.java

jar -cvf JavaKafkaWordCount.jar JavaKafkaWordCount*

And run:
java -cp 
libs/spark-streaming_2.10-0.9.1.jar:libs/zkclient-0.4.jar:libs/kafka_2.9.1-0.8.1.1.jar:libs/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar:libs/spark-streaming-kafka_2.10-0.9.1.jar:./JavaKafkaWordCount.jar 
JavaKafkaWordCount spark://dlvm1:7077 vm37.dbweb.ee demogroup kafkademo1 1


The job is visible in UI but I am getting:
log4j:WARN No appenders could be found for logger 
(akka.event.slf4j.Slf4jLogger).

log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
14/05/30 11:53:42 INFO SparkEnv: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties

14/05/30 11:53:42 INFO SparkEnv: Registering BlockManagerMaster
14/05/30 11:53:42 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140530115342-0b66
14/05/30 11:53:42 INFO MemoryStore: MemoryStore started with capacity 
386.3 MB.
14/05/30 11:53:42 INFO ConnectionManager: Bound socket to port 49250 
with id = ConnectionManagerId(dlvm1,49250)

14/05/30 11:53:42 INFO BlockManagerMaster: Trying to register BlockManager
14/05/30 11:53:42 INFO BlockManagerMasterActor$BlockManagerInfo: 
Registering block manager dlvm1:49250 with 386.3 MB RAM

14/05/30 11:53:42 INFO BlockManagerMaster: Registered BlockManager
14/05/30 11:53:42 INFO HttpServer: Starting HTTP Server
14/05/30 11:53:42 INFO HttpBroadcast: Broadcast server started at 
http://90.190.106.47:42861

14/05/30 11:53:42 INFO SparkEnv: Registering MapOutputTracker
14/05/30 11:53:42 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-76fd126e-7fcd-4df1-a967-bf4b8d356973

14/05/30 11:53:42 INFO HttpServer: Starting HTTP Server
14/05/30 11:53:43 INFO SparkUI: Started Spark Web UI at http://dlvm1:4040
14/05/30 11:53:43 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
14/05/30 11:53:44 INFO SparkContext: Added JAR 
/opt/spark-0.9.1-bin-hadoop2/margusja_kafka/JavaKafkaWordCount.jar at 
http://90.190.106.47:57550/jars/JavaKafkaWordCount.jar with timestamp 
1401440024153
14/05/30 11:53:44 INFO AppClient$ClientActor: Connecting to master 
spark://dlvm1:7077...

...
...
...
14/05/30 11:53:56 INFO SparkContext: Job finished: collect at 
NetworkInputTracker.scala:178, took 10.617582853 s
14/05/30 11:53:56 INFO TaskSetManager: Finished TID 70 in 41 ms on dlvm1 
(progress: 20/20)
14/05/30 11:53:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose 
tasks have all completed, from pool
14/05/30 11:53:56 INFO MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 1 to spark@dlvm1:48363
14/05/30 11:53:56 INFO TaskSetManager: Finished TID 71 in 58 ms on dlvm1 
(progress: 1/1)

14/05/30 11:53:56 INFO DAGScheduler: Completed ResultTask(4, 1)
14/05/30 11:53:56 INFO DAGScheduler: Stage 4 (take at DStream.scala:586) 
finished in 0.815 s
14/05/30 11:53:56 INFO SparkContext: Job finished: take at 
DStream.scala:586, took 0.844135774 s

---
Time: 1401440028000 ms
---

14/05/30 11:53:56 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose 
tasks have all completed, from pool
14/05/30 11:53:56 INFO JobScheduler: Finished job streaming job 
1401440028000 ms.0 from job set of time 1401440028000 ms
14/05/30 11:53:56 INFO JobScheduler: Total delay: 8.189 s for time 
1401440028000 ms (execution: 7.847 s)

14/05/30 11:53:56 INFO SparkContext: Starting job: take at DStream.scala:586
14/05/30 11:53:56 INFO DAGScheduler: Registering RDD 17 (combineByKey at 
ShuffledDStream.scala:42)
14/05/30 11:53:56 INFO DAGScheduler: Got job 3 (take at 
DStream.scala:586) with 1 output partitions (allowLocal=true)
14/05/30 11:53:56 INFO DAGScheduler: Final stage: Stage 6 (take at 
DStream.scala:586)

14/05/30 11:53:56 INFO DAGScheduler: Parents of final stage: List(Stage 7)
14/05/30 11:53:56 INFO DAGScheduler: Missing parents: List()
14/05/30 11:53:56 INFO DAGScheduler: Submitting Stage 6 
(MapPartitionsRDD[19] at combineByKey at ShuffledDStream.scala:42), 
which has no missing parents
14/05/30 11:53:56 INFO JobScheduler: Starting job streaming job 
140144003 ms.0 from job set of time 140144003 ms
14/05/30 11:53:56 INFO SparkContext: Starting job: runJob at 
NetworkInputTracker.scala:182
14/05/30 11:53:56 INFO DAGScheduler: Submitting 1 missing tasks from 
Stage 6 (MapPartitionsRDD[19] at combineByKey at ShuffledDStream.scala:42)

14/05/30 11:53:56 INFO 

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.

Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyone involved in this release -
it was truly a community effort with fixes, features, and
optimizations contributed from dozens of organizations.

This release expands Spark's standard libraries, introducing a new SQL
package (SparkSQL) which lets users integrate SQL queries into
existing Spark workflows. MLlib, Spark's machine learning library, is
expanded with sparse vector support and several new algorithms. The
GraphX and Streaming libraries also introduce new features and
optimizations. Spark's core engine adds support for secured YARN
clusters, a unified tool for submitting Spark applications, and
several performance and stability improvements. Finally, Spark adds
support for Java 8 lambda syntax and improves coverage of the Java and
Python API's.

Those features only scratch the surface - check out the release notes here:
http://spark.apache.org/releases/spark-release-1-0-0.html

Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.

- Patrick


Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.!

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick



Re: Announcing Spark 1.0.0

2014-05-30 Thread prabeesh k
Please update the http://spark.apache.org/docs/latest/  link


On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:

 Is it possible to download pre build package?
 http://mirror.symnds.com/software/Apache/incubator/
 spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404

 Best regards, Margus (Margusja) Roo
 +372 51 48 780
 http://margus.roo.ee
 http://ee.linkedin.com/in/margusroo
 skype: margusja
 ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)


 On 30/05/14 13:18, Christopher Nguyen wrote:

 Awesome work, Pat et al.!

 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen




 On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release
 notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick






Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
It is updated - try holding Shift + refresh in your browser, you are
probably caching the page.

On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote:
 Please update the http://spark.apache.org/docs/latest/  link


 On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:

 Is it possible to download pre build package?

 http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz
 - gives me 404

 Best regards, Margus (Margusja) Roo
 +372 51 48 780
 http://margus.roo.ee
 http://ee.linkedin.com/in/margusroo
 skype: margusja
 ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)


 On 30/05/14 13:18, Christopher Nguyen wrote:

 Awesome work, Pat et al.!

 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen




 On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new
 SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java
 and
 Python API's.

 Those features only scratch the surface - check out the release
 notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick






Re: Announcing Spark 1.0.0

2014-05-30 Thread Margusja

Now I can download. Thanks.

Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)

On 30/05/14 13:48, Patrick Wendell wrote:

It is updated - try holding Shift + refresh in your browser, you are
probably caching the page.

On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote:

Please update the http://spark.apache.org/docs/latest/  link


On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:

Is it possible to download pre build package?

http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz
- gives me 404

Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)


On 30/05/14 13:18, Christopher Nguyen wrote:

Awesome work, Pat et al.!

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen




On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
mailto:pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new
SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java
and
 Python API's.

 Those features only scratch the surface - check out the release
 notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick






RE: Announcing Spark 1.0.0

2014-05-30 Thread Kousuke Saruta
Hi all

 

In https://spark.apache.org/downloads.html, the URL for release note of 1.0.0 
seems to be wrong.

The URL should be https://spark.apache.org/releases/spark-release-1-0-0.html 
but links to https://spark.apache.org/releases/spark-release-1.0.0.html

 

Best Regards,

Kousuke

 

From: prabeesh k [mailto:prabsma...@gmail.com] 
Sent: Friday, May 30, 2014 8:18 PM
To: user@spark.apache.org
Subject: Re: Announcing Spark 1.0.0

 

I forgot to hard refresh.

thanks

 

 

On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com wrote:

It is updated - try holding Shift + refresh in your browser, you are
probably caching the page.


On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote:
 Please update the http://spark.apache.org/docs/latest/  link


 On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:

 Is it possible to download pre build package?

 http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz
 - gives me 404

 Best regards, Margus (Margusja) Roo
 +372 51 48 780 tel:%2B372%2051%2048%20780 
 http://margus.roo.ee
 http://ee.linkedin.com/in/margusroo
 skype: margusja
 ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)


 On 30/05/14 13:18, Christopher Nguyen wrote:

 Awesome work, Pat et al.!

 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen




 On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new
 SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java
 and
 Python API's.

 Those features only scratch the surface - check out the release
 notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick





 



Re: Announcing Spark 1.0.0

2014-05-30 Thread John Omernik
All:

In the pom.xml file I see the MapR repository, but it's not included in the
./project/SparkBuild.scala file. Is this expected?  I know to build I have
to add it there otherwise sbt hates me with evil red messages and such.

John


On Fri, May 30, 2014 at 6:24 AM, Kousuke Saruta saru...@oss.nttdata.co.jp
wrote:

 Hi all



 In https://spark.apache.org/downloads.html, the URL for release note of
 1.0.0 seems to be wrong.

 The URL should be
 https://spark.apache.org/releases/spark-release-1-0-0.html but links to
 https://spark.apache.org/releases/spark-release-1.0.0.html



 Best Regards,

 Kousuke



 *From:* prabeesh k [mailto:prabsma...@gmail.com]
 *Sent:* Friday, May 30, 2014 8:18 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Announcing Spark 1.0.0



 I forgot to hard refresh.

 thanks





 On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 It is updated - try holding Shift + refresh in your browser, you are
 probably caching the page.


 On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote:
  Please update the http://spark.apache.org/docs/latest/  link
 
 
  On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:
 
  Is it possible to download pre build package?
 
 
 http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz
  - gives me 404
 
  Best regards, Margus (Margusja) Roo
  +372 51 48 780
  http://margus.roo.ee
  http://ee.linkedin.com/in/margusroo
  skype: margusja
  ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
 
 
  On 30/05/14 13:18, Christopher Nguyen wrote:
 
  Awesome work, Pat et al.!
 
  --
  Christopher T. Nguyen
  Co-founder  CEO, Adatao http://adatao.com
  linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen
 
 
 
 
  On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  I'm thrilled to announce the availability of Spark 1.0.0! Spark
 1.0.0
  is a milestone release as the first in the 1.0 line of releases,
  providing API stability for Spark's core interfaces.
 
  Spark 1.0.0 is Spark's largest release ever, with contributions
 from
  117 developers. I'd like to thank everyone involved in this
 release -
  it was truly a community effort with fixes, features, and
  optimizations contributed from dozens of organizations.
 
  This release expands Spark's standard libraries, introducing a new
  SQL
  package (SparkSQL) which lets users integrate SQL queries into
  existing Spark workflows. MLlib, Spark's machine learning library,
 is
  expanded with sparse vector support and several new algorithms. The
  GraphX and Streaming libraries also introduce new features and
  optimizations. Spark's core engine adds support for secured YARN
  clusters, a unified tool for submitting Spark applications, and
  several performance and stability improvements. Finally, Spark adds
  support for Java 8 lambda syntax and improves coverage of the Java
  and
  Python API's.
 
  Those features only scratch the surface - check out the release
  notes here:
  http://spark.apache.org/releases/spark-release-1-0-0.html
 
  Note that since release artifacts were posted recently, certain
  mirrors may not have working downloads for a few hours.
 
  - Patrick
 
 
 
 





Re: Announcing Spark 1.0.0

2014-05-30 Thread jose farfan
Awesome work



On Fri, May 30, 2014 at 12:12 PM, Patrick Wendell pwend...@gmail.com
wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick



Re: Announcing Spark 1.0.0

2014-05-30 Thread John Omernik
By the  way:

This is great work. I am new to the spark world, and have been like a kid
in a candy store learnign all it can do.

Is there a good list of build variables? What I me is like the SPARK_HIVE
variable described on the Spark SQL page. I'd like to include that, but
once I found that I wondered if there were other options I should consider
before building.

Thanks!



On Fri, May 30, 2014 at 6:52 AM, John Omernik j...@omernik.com wrote:

 All:

 In the pom.xml file I see the MapR repository, but it's not included in
 the ./project/SparkBuild.scala file. Is this expected?  I know to build I
 have to add it there otherwise sbt hates me with evil red messages and
 such.

 John


 On Fri, May 30, 2014 at 6:24 AM, Kousuke Saruta saru...@oss.nttdata.co.jp
  wrote:

 Hi all



 In https://spark.apache.org/downloads.html, the URL for release note
 of 1.0.0 seems to be wrong.

 The URL should be
 https://spark.apache.org/releases/spark-release-1-0-0.html but links to
 https://spark.apache.org/releases/spark-release-1.0.0.html



 Best Regards,

 Kousuke



 *From:* prabeesh k [mailto:prabsma...@gmail.com]
 *Sent:* Friday, May 30, 2014 8:18 PM
 *To:* user@spark.apache.org
 *Subject:* Re: Announcing Spark 1.0.0



 I forgot to hard refresh.

 thanks





 On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 It is updated - try holding Shift + refresh in your browser, you are
 probably caching the page.


 On Fri, May 30, 2014 at 3:46 AM, prabeesh k prabsma...@gmail.com wrote:
  Please update the http://spark.apache.org/docs/latest/  link
 
 
  On Fri, May 30, 2014 at 4:03 PM, Margusja mar...@roo.ee wrote:
 
  Is it possible to download pre build package?
 
 
 http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz
  - gives me 404
 
  Best regards, Margus (Margusja) Roo
  +372 51 48 780
  http://margus.roo.ee
  http://ee.linkedin.com/in/margusroo
  skype: margusja
  ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
 
 
  On 30/05/14 13:18, Christopher Nguyen wrote:
 
  Awesome work, Pat et al.!
 
  --
  Christopher T. Nguyen
  Co-founder  CEO, Adatao http://adatao.com
  linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen
 
 
 
 
  On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  I'm thrilled to announce the availability of Spark 1.0.0! Spark
 1.0.0
  is a milestone release as the first in the 1.0 line of releases,
  providing API stability for Spark's core interfaces.
 
  Spark 1.0.0 is Spark's largest release ever, with contributions
 from
  117 developers. I'd like to thank everyone involved in this
 release -
  it was truly a community effort with fixes, features, and
  optimizations contributed from dozens of organizations.
 
  This release expands Spark's standard libraries, introducing a new
  SQL
  package (SparkSQL) which lets users integrate SQL queries into
  existing Spark workflows. MLlib, Spark's machine learning
 library, is
  expanded with sparse vector support and several new algorithms.
 The
  GraphX and Streaming libraries also introduce new features and
  optimizations. Spark's core engine adds support for secured YARN
  clusters, a unified tool for submitting Spark applications, and
  several performance and stability improvements. Finally, Spark
 adds
  support for Java 8 lambda syntax and improves coverage of the Java
  and
  Python API's.
 
  Those features only scratch the surface - check out the release
  notes here:
  http://spark.apache.org/releases/spark-release-1-0-0.html
 
  Note that since release artifacts were posted recently, certain
  mirrors may not have working downloads for a few hours.
 
  - Patrick
 
 
 
 







Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval.

I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow. 

So I now I am thinking of retaining only top 10 hashtags in each RDD because
these only could come in the final answer. 

I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
? Basically I need to transform my DTREAM in which each RDD contains only
top 10 hashtags so that number of hashtags in 5 mins interval is low.

If there is some more efficient way of doing this then please let me know
that also.

Thanx,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-30 Thread jamborta
thanks for the reply. I am definitely running 1.0.0, I set it up manually.

To answer my question, I found out from the examples that it would need a
new data type called LabeledPoint instead of numpy array.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-MLlib-examples-don-t-work-with-Spark-1-0-0-tp6546p6579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Announcing Spark 1.0.0

2014-05-30 Thread Ognen Duzlevski

How exciting! Congratulations! :-)
Ognen

On 5/30/14, 5:12 AM, Patrick Wendell wrote:

I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.

Spark 1.0.0 is Spark's largest release ever, with contributions from
117 developers. I'd like to thank everyone involved in this release -
it was truly a community effort with fixes, features, and
optimizations contributed from dozens of organizations.

This release expands Spark's standard libraries, introducing a new SQL
package (SparkSQL) which lets users integrate SQL queries into
existing Spark workflows. MLlib, Spark's machine learning library, is
expanded with sparse vector support and several new algorithms. The
GraphX and Streaming libraries also introduce new features and
optimizations. Spark's core engine adds support for secured YARN
clusters, a unified tool for submitting Spark applications, and
several performance and stability improvements. Finally, Spark adds
support for Java 8 lambda syntax and improves coverage of the Java and
Python API's.

Those features only scratch the surface - check out the release notes here:
http://spark.apache.org/releases/spark-release-1-0-0.html

Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.

- Patrick




Re: Announcing Spark 1.0.0

2014-05-30 Thread Chanwit Kaewkasi
Congratulations !!

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Fri, May 30, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick


Re: SparkContext startup time out

2014-05-30 Thread Pierre B
I was annoyed by this as well.
It appears that just permuting the order of decencies inclusion solves this
problem:

first spark, than your cdh hadoop distro.

HTH,

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p6582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: KryoSerializer Exception

2014-05-30 Thread Andrea Esposito
Hi,

i just migrate to 1.0. Still having the same issue.

Either with or without the custom registrator. Just the usage of the
KryoSerializer triggers the exception immediately.

I set the kryo settings through the property:
System.setProperty(spark.serializer, org.apache.spark.serializer.
KryoSerializer)
System.setProperty(spark.kryo.registrator, it.unipi.thesis.andrea.
esposito.onjag.test.TestKryoRegistrator)

The registrator is just a sequence of:
kryo.register(classOf[MyClass])

I tried also with very small RDD (few MB of serialized data) and the
problem still occurs.

The problem seems about broadcast but i'm completely stuck.

Following complete log:








































































































*2014-05-30 15:47:36 WARN  TaskSetManager:70 - Lost TID 5 (task 3.0:1)
2014-05-30 15:47:36 WARN  TaskSetManager:70 - Loss was due to
java.io.EOFExceptionjava.io.EOFException at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:119)
at
org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:205)
at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:135)
at
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)at
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) 2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 4 (task 3.0:0)2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 6 (task 3.0:1) 2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 7 (task 3.0:0)2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 8 (task 3.0:1) 2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 9 (task 3.0:0)2014-05-30 15:47:36 WARN
TaskSetManager:70 - Lost TID 10 (task 3.0:1) 2014-05-30 15:47:36 ERROR
TaskSetManager:74 - Task 3.0:1 failed 4 times; aborting jobException in
thread main org.apache.spark.SparkException: Job aborted due to stage
failure: Task 3.0:1 failed 4 times, most recent failure: Exception failure
in TID 10 on host Andrea-Laptop.unipi.it: java.io.EOFException
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:119)

org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:205)
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89)
sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)

Re: Announcing Spark 1.0.0

2014-05-30 Thread Dean Wampler
Congratulations!!


On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick




-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Hi,

I recently posted a question on stackoverflow but didn't get any reply. I
joined the mailing list now. Can anyone of you guide me a way for the
problem mentioned in

http://stackoverflow.com/questions/23923966/writing-the-rdd-data-in-excel-file-along-mapping-in-apache-spark

Thanks in advance

-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Daniel Siegmann
The Spark 1.0.0 release notes state Internal instrumentation has been
added to allow applications to monitor and instrument Spark jobs. Can
anyone point me to the docs for this?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


RE: Announcing Spark 1.0.0

2014-05-30 Thread Ian Ferreira
Congrats

Sent from my Windows Phone

From: Dean Wamplermailto:deanwamp...@gmail.com
Sent: ‎5/‎30/‎2014 6:53 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Announcing Spark 1.0.0

Congratulations!!


On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick




--
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Using Spark on Data size larger than Memory size

2014-05-30 Thread Vibhor Banga
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data
from HBase Table.

I want to know that in the case when the size of HBase Table grows larger
than the size of RAM available in the cluster, will the application fail,
or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor


Re: Problem using Spark with Hbase

2014-05-30 Thread Vibhor Banga
Thanks Mayur for the reply.

Actually issue was the I was running Spark application on hadoop-2.2.0 and
hbase version there was 0.95.2.

But spark by default gets build by an older hbase version. So I had to
build spark again with hbase version as 0.95.2 in spark build file. And it
worked.

Thanks,
-Vibhor


On Wed, May 28, 2014 at 11:34 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 Try this..

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, May 28, 2014 at 7:40 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Any one who has used spark this way or has faced similar issue, please
 help.

 Thanks,
 -Vibhor


 On Wed, May 28, 2014 at 6:03 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am facing issues while using spark with HBase. I am getting
 NullPointerException at org.apache.hadoop.hbase.TableName.valueOf
 (TableName.java:288)

 Can someone please help to resolve this issue. What am I missing ?


 I am using following snippet of code -

 Configuration config = HBaseConfiguration.create();

 config.set(hbase.zookeeper.znode.parent, hostname1);
 config.set(hbase.zookeeper.quorum,hostname1);
 config.set(hbase.zookeeper.property.clientPort,2181);
 config.set(hbase.master, hostname1:
 config.set(fs.defaultFS,hdfs://hostname1/);
 config.set(dfs.namenode.rpc-address,hostname1:8020);

 config.set(TableInputFormat.INPUT_TABLE, tableName);

JavaSparkContext ctx = new JavaSparkContext(args[0], Simple,
  System.getenv(sparkHome),
 JavaSparkContext.jarOfClass(Simple.class));

JavaPairRDDImmutableBytesWritable, Result hBaseRDD
 = ctx.newAPIHadoopRDD( config, TableInputFormat.class,
 ImmutableBytesWritable.class, Result.class);

   MapImmutableBytesWritable, Result rddMap =
 hBaseRDD.collectAsMap();


 But when I go to the spark cluster and check the logs, I see following
 error -

 INFO NewHadoopRDD: Input split: w3-target1.nm.flipkart.com:,
 14/05/28 16:48:51 ERROR TableInputFormat: java.lang.NullPointerException
 at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:288)
 at org.apache.hadoop.hbase.client.HTable.init(HTable.java:154)
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:99)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:92)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Thanks,

 -Vibhor








-- 
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore


Re: Announcing Spark 1.0.0

2014-05-30 Thread Nicholas Chammas
You guys were up late, eh? :) I'm looking forward to using this latest
version.

Is there any place we can get a list of the new functions in the Python
API? The release notes don't enumerate them.

Nick



On Fri, May 30, 2014 at 10:15 AM, Ian Ferreira ianferre...@hotmail.com
wrote:

  Congrats

 Sent from my Windows Phone
  --
 From: Dean Wampler deanwamp...@gmail.com
 Sent: ‎5/‎30/‎2014 6:53 AM

 To: user@spark.apache.org
 Subject: Re: Announcing Spark 1.0.0

   Congratulations!!


 On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick




  --
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com



Subscribing to news releases

2014-05-30 Thread Nick Chammas
Is there a way to subscribe to news releases
http://spark.apache.org/news/index.html? That would be swell.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Subscribing-to-news-releases-tp6592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Announcing Spark 1.0.0

2014-05-30 Thread giive chen
Great work!
On May 30, 2014 10:15 PM, Ian Ferreira ianferre...@hotmail.com wrote:

  Congrats

 Sent from my Windows Phone
  --
 From: Dean Wampler deanwamp...@gmail.com
 Sent: 5/30/2014 6:53 AM
 To: user@spark.apache.org
 Subject: Re: Announcing Spark 1.0.0

   Congratulations!!


 On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick




  --
 Dean Wampler, Ph.D.
 Typesafe
 @deanwampler
 http://typesafe.com
 http://polyglotprogramming.com



Spark 1.0.0 - Java 8

2014-05-30 Thread Upender Nimbekar
Great News ! I've been awaiting this release to start doing some coding
with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host
with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need
to use a cluster of machines.
Second, are there any good exmaples of using MLIB on Spark. Please shoot me
in the right direction.

Thanks
Upender

On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick



Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Surendranauth Hiraman
With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3
CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node.

Everything works fine, though if you are using MapR, you have to make sure
they are all on the same subnet.

-Suren



On Fri, May 30, 2014 at 12:20 PM, Upender Nimbekar upent...@gmail.com
wrote:

 Great News ! I've been awaiting this release to start doing some coding
 with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host
 with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need
 to use a cluster of machines.
 Second, are there any good exmaples of using MLIB on Spark. Please shoot
 me in the right direction.

 Thanks
 Upender

 On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes
 here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick





-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Aaron Davidson
Also, the Spark examples can run out of the box on a single machine, as
well as a cluster. See the Master URLs heading here:
http://spark.apache.org/docs/latest/submitting-applications.html#master-urls


On Fri, May 30, 2014 at 9:24 AM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3
 CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node.

 Everything works fine, though if you are using MapR, you have to make sure
 they are all on the same subnet.

 -Suren



 On Fri, May 30, 2014 at 12:20 PM, Upender Nimbekar upent...@gmail.com
 wrote:

 Great News ! I've been awaiting this release to start doing some coding
 with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host
 with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need
 to use a cluster of machines.
 Second, are there any good exmaples of using MLIB on Spark. Please shoot
 me in the right direction.

 Thanks
 Upender

 On Fri, May 30, 2014 at 6:12 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes
 here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick





 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hi Rahul,

I'll just copy  paste your question here to aid with context, and
reply afterwards.

-

Can I write the RDD data in excel file along with mapping in
apache-spark? Is that a correct way? Isn't that a writing will be a
local function and can't be passed over the clusters??

Below is given the python code(Its just an example to clarify my
question, i understand that this implementation may not be actually
required):

import xlsxwriter
import sys
import math
from pyspark import SparkContext

# get the spark context in sc.

workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()

data = sc.textFile(xyz.txt)
# xyz.txt is a file whose each line contains string delimited by SPACE

row=0

def mapperFunc(x):
for i in range(0,4):
worksheet.write(row, i , x.split( )[i])
row++
return len(x.split())

data2 = data.map(mapperFunc)

workbook.close()

There are 2 questioms:

Is using row in 'mapperFunc' like this is a correct way? Will it
increment row each time?
Is writing in the excel file using worksheet.write() in side the
mapper function a correct way?

Also If #2 is correct then plz clarify the doubt that I am thinking
the worksheet is created in local machine then how does it work?
-


On Fri, May 30, 2014 at 6:55 AM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
 Hi,

 I recently posted a question on stackoverflow but didn't get any reply. I
 joined the mailing list now. Can anyone of you guide me a way for the
 problem mentioned in

 http://stackoverflow.com/questions/23923966/writing-the-rdd-data-in-excel-file-along-mapping-in-apache-spark

 Thanks in advance

 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka



-- 
Marcelo


Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hello there,

On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote:
 workbook = xlsxwriter.Workbook('output_excel.xlsx')
 worksheet = workbook.add_worksheet()

 data = sc.textFile(xyz.txt)
 # xyz.txt is a file whose each line contains string delimited by SPACE

 row=0

 def mapperFunc(x):
 for i in range(0,4):
 worksheet.write(row, i , x.split( )[i])
 row++
 return len(x.split())

 data2 = data.map(mapperFunc)

 Is using row in 'mapperFunc' like this is a correct way? Will it
 increment row each time?

No. mapperFunc will be executed somewhere else, not in the same
process running this script. I'm not familiar with how serializing
closures works in Spark/Python, but you'll most certainly be updating
the local copy of row in the executor, and your driver's copy will
remain at 0.

In general, in a distributed execution environment like Spark you want
to avoid as much as possible using state. row in your code is state,
so to do what you want you'd have to use other means (like Spark's
accumulators). But those are generally expensive in a distributed
system, and to be avoided if possible.

 Is writing in the excel file using worksheet.write() in side the
 mapper function a correct way?

No, for the same reasons. Your executor will have a copy of your
workbook variable. So the write() will happen locally to the
executor, and after the mapperFunc() returns, that will be discarded -
so your driver won't see anything.

As a rule of thumb, your closures should try to use only their
arguments as input, or at most use local variables as read-only, and
only produce output in the form of return values. There are cases
where you might want to break these rules, of course, but in general
that's the mindset you should be in.

Also note that you're not actually executing anything here.
data.map() is a transformation, so you're just building the
execution graph for the computation. You need to execute an action
(like collect() or take()) if you want the computation to actually
occur.

-- 
Marcelo


Re: Local file being refrenced in mapper function

2014-05-30 Thread Jey Kottalam
Hi Rahul,

Marcelo's explanation is correct. Here's a possible approach to your
program, in pseudo-Python:


# connect to Spark cluster
sc = SparkContext(...)

# load input data
input_data = load_xls(file(input.xls))
input_rows = input_data['Sheet1'].rows

# create RDD on cluster
input_rdd = sc.parallelize(input_rows)

# munge RDD
result_rdd = input_rdd.map(munge_row)

# collect result RDD to local process
result_rows = result_rdd.collect()

# write output file
write_xls(file(output.xls, w), result_rows)



Hope that helps,
-Jey

On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello there,

 On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote:
 workbook = xlsxwriter.Workbook('output_excel.xlsx')
 worksheet = workbook.add_worksheet()

 data = sc.textFile(xyz.txt)
 # xyz.txt is a file whose each line contains string delimited by SPACE

 row=0

 def mapperFunc(x):
 for i in range(0,4):
 worksheet.write(row, i , x.split( )[i])
 row++
 return len(x.split())

 data2 = data.map(mapperFunc)

 Is using row in 'mapperFunc' like this is a correct way? Will it
 increment row each time?

 No. mapperFunc will be executed somewhere else, not in the same
 process running this script. I'm not familiar with how serializing
 closures works in Spark/Python, but you'll most certainly be updating
 the local copy of row in the executor, and your driver's copy will
 remain at 0.

 In general, in a distributed execution environment like Spark you want
 to avoid as much as possible using state. row in your code is state,
 so to do what you want you'd have to use other means (like Spark's
 accumulators). But those are generally expensive in a distributed
 system, and to be avoided if possible.

 Is writing in the excel file using worksheet.write() in side the
 mapper function a correct way?

 No, for the same reasons. Your executor will have a copy of your
 workbook variable. So the write() will happen locally to the
 executor, and after the mapperFunc() returns, that will be discarded -
 so your driver won't see anything.

 As a rule of thumb, your closures should try to use only their
 arguments as input, or at most use local variables as read-only, and
 only produce output in the form of return values. There are cases
 where you might want to break these rules, of course, but in general
 that's the mindset you should be in.

 Also note that you're not actually executing anything here.
 data.map() is a transformation, so you're just building the
 execution graph for the computation. You need to execute an action
 (like collect() or take()) if you want the computation to actually
 occur.

 --
 Marcelo


Trouble with EC2

2014-05-30 Thread PJ$
Hey Folks,

I'm really having quite a bit of trouble getting spark running on ec2. I'm
not using scripts the https://github.com/apache/spark/tree/master/ec2
because I'd like to know how everything works. But I'm going a little
crazy. I think that something about the networking configuration must be
messed up, but I'm at a loss. Shortly after starting the cluster, I get a
lot of this:

14/05/30 18:03:22 INFO master.Master: Registering worker
ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
14/05/30 18:03:22 INFO master.Master: Registering worker
ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
14/05/30 18:03:23 INFO master.Master: Registering worker
ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
14/05/30 18:03:23 INFO master.Master: Registering worker
ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
14/05/30 18:05:54 INFO master.Master:
akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
removing it.
14/05/30 18:05:54 INFO actor.LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
was not delivered. [5] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/05/30 18:05:54 INFO master.Master:
akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
removing it.
14/05/30 18:05:54 INFO master.Master:
akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
removing it.
14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association
failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
]
14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association
failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
]
14/05/30 18:05:54 INFO master.Master:
akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
removing it.
14/05/30 18:05:54 INFO master.Master:
akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
removing it.
14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association
failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485


Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-30 Thread Andrei
Thanks, Stephen. I have eventually decided to go with assembly, but put
away Spark and Hadoop jars, and instead use `spark-submit` to automatically
provide these dependencies. This way no resource conflicts arise and
mergeStrategy needs no modification. To memorize this stable setup and also
share it with the community I've crafted a project [1] with minimal working
config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
Hadoop client. Hope, it will help somebody to take Spark setup quicker.

Though I'm fine with this setup for final builds, I'm still looking for a
more interactive dev setup - something that doesn't require full rebuild.

[1]: https://github.com/faithlessfriend/sample-spark-project

Thanks and have a good weekend,
Andrei

On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch java...@gmail.com wrote:


 The MergeStrategy combined with sbt assembly did work for me.  This is not
 painless: some trial and error and the assembly may take multiple minutes.

 You will likely want to filter out some additional classes from the
 generated jar file.  Here is an SOF answer to explain that and with IMHO
 the best answer snippet included here (in this case the OP understandably
 did not want to not include javax.servlet.Servlet)

 http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar


 mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) = ms
 filter { case (file, toPath) = toPath != javax/servlet/Servlet.class }
 }

 There is a setting to not include the project files in the assembly but I
 do not recall it at this moment.



 2014-05-29 10:13 GMT-07:00 Andrei faithlessfri...@gmail.com:

 Thanks, Jordi, your gist looks pretty much like what I have in my project
 currently (with few exceptions that I'm going to borrow).

 I like the idea of using sbt package, since it doesn't require third
 party plugins and, most important, doesn't create a mess of classes and
 resources. But in this case I'll have to handle jar list manually via Spark
 context. Is there a way to automate this process? E.g. when I was a Clojure
 guy, I could run lein deps (lein is a build tool similar to sbt) to
 download all dependencies and then just enumerate them from my app. Maybe
 you have heard of something like that for Spark/SBT?

 Thanks,
 Andrei


 On Thu, May 29, 2014 at 3:48 PM, jaranda jordi.ara...@bsc.es wrote:

 Hi Andrei,

 I think the preferred way to deploy Spark jobs is by using the sbt
 package
 task instead of using the sbt assembly plugin. In any case, as you
 comment,
 the mergeStrategy in combination with some dependency exlusions should
 fix
 your problems. Have a look at  this gist
 https://gist.github.com/JordiAranda/bdbad58d128c14277a05   for further
 details (I just followed some recommendations commented in the sbt
 assembly
 plugin documentation).

 Up to now I haven't found a proper way to combine my
 development/deployment
 phases, although I must say my experience in Spark is pretty poor (it
 really
 depends in your deployment requirements as well). In this case, I think
 someone else could give you some further insights.

 Best,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Thanks Marcelo,

It actually made my few concepts clear. (y).


On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 Hello there,

 On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com
 wrote:
  workbook = xlsxwriter.Workbook('output_excel.xlsx')
  worksheet = workbook.add_worksheet()
 
  data = sc.textFile(xyz.txt)
  # xyz.txt is a file whose each line contains string delimited by SPACE
 
  row=0
 
  def mapperFunc(x):
  for i in range(0,4):
  worksheet.write(row, i , x.split( )[i])
  row++
  return len(x.split())
 
  data2 = data.map(mapperFunc)

  Is using row in 'mapperFunc' like this is a correct way? Will it
  increment row each time?

 No. mapperFunc will be executed somewhere else, not in the same
 process running this script. I'm not familiar with how serializing
 closures works in Spark/Python, but you'll most certainly be updating
 the local copy of row in the executor, and your driver's copy will
 remain at 0.

 In general, in a distributed execution environment like Spark you want
 to avoid as much as possible using state. row in your code is state,
 so to do what you want you'd have to use other means (like Spark's
 accumulators). But those are generally expensive in a distributed
 system, and to be avoided if possible.

  Is writing in the excel file using worksheet.write() in side the
  mapper function a correct way?

 No, for the same reasons. Your executor will have a copy of your
 workbook variable. So the write() will happen locally to the
 executor, and after the mapperFunc() returns, that will be discarded -
 so your driver won't see anything.

 As a rule of thumb, your closures should try to use only their
 arguments as input, or at most use local variables as read-only, and
 only produce output in the form of return values. There are cases
 where you might want to break these rules, of course, but in general
 that's the mindset you should be in.

 Also note that you're not actually executing anything here.
 data.map() is a transformation, so you're just building the
 execution graph for the computation. You need to execute an action
 (like collect() or take()) if you want the computation to actually
 occur.

 --
 Marcelo




-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Re: Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Thanks jey

I was hellpful.


On Sat, May 31, 2014 at 12:45 AM, Rahul Bhojwani 
rahulbhojwani2...@gmail.com wrote:

 Thanks Marcelo,

 It actually made my few concepts clear. (y).


 On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 Hello there,

 On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com
 wrote:
  workbook = xlsxwriter.Workbook('output_excel.xlsx')
  worksheet = workbook.add_worksheet()
 
  data = sc.textFile(xyz.txt)
  # xyz.txt is a file whose each line contains string delimited by SPACE
 
  row=0
 
  def mapperFunc(x):
  for i in range(0,4):
  worksheet.write(row, i , x.split( )[i])
  row++
  return len(x.split())
 
  data2 = data.map(mapperFunc)

  Is using row in 'mapperFunc' like this is a correct way? Will it
  increment row each time?

 No. mapperFunc will be executed somewhere else, not in the same
 process running this script. I'm not familiar with how serializing
 closures works in Spark/Python, but you'll most certainly be updating
 the local copy of row in the executor, and your driver's copy will
 remain at 0.

 In general, in a distributed execution environment like Spark you want
 to avoid as much as possible using state. row in your code is state,
 so to do what you want you'd have to use other means (like Spark's
 accumulators). But those are generally expensive in a distributed
 system, and to be avoided if possible.

  Is writing in the excel file using worksheet.write() in side the
  mapper function a correct way?

 No, for the same reasons. Your executor will have a copy of your
 workbook variable. So the write() will happen locally to the
 executor, and after the mapperFunc() returns, that will be discarded -
 so your driver won't see anything.

 As a rule of thumb, your closures should try to use only their
 arguments as input, or at most use local variables as read-only, and
 only produce output in the form of return values. There are cases
 where you might want to break these rules, of course, but in general
 that's the mindset you should be in.

 Also note that you're not actually executing anything here.
 data.map() is a transformation, so you're just building the
 execution graph for the computation. You need to execute an action
 (like collect() or take()) if you want the computation to actually
 occur.

 --
 Marcelo




 --
 Rahul K Bhojwani
 3rd Year B.Tech
 Computer Science and Engineering
 National Institute of Technology, Karnataka




-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Failed to remove RDD error

2014-05-30 Thread Michael Chang
I'm running a some kafka streaming spark contexts (on 0.9.1), and they seem
to be dying after 10 or so minutes with a lot of these errors.  I can't
really tell what's going on here, except that maybe the driver is
unresponsive somehow?  Has anyone seen this before?

14/05/31 01:13:30 ERROR BlockManagerMaster: Failed to remove RDD 12635

akka.pattern.AskTimeoutException: Timed out

at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)

at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)

at
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:691)

at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:688)

at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)

at
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)

at
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)

at
akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)

at java.lang.Thread.run(Thread.java:744)

Thanks,

Mike


possible typos in spark 1.0 documentation

2014-05-30 Thread Yadid Ayzenberg

Congrats on the new 1.0 release. Amazing work !

It looks like there may some typos in the latest 
http://spark.apache.org/docs/latest/sql-programming-guide.html


in the Running SQL on RDDs section when choosing the java example:

1. ctx is an instance of JavaSQLContext but the textFile method is 
called as a member of ctx.
According to the API JavaSQLContext does not have such a member, so im 
guessing this should be sc instead.


2. In that same code example the object sqlCtx is referenced, but it is 
never instantiated in the code.

should this be ctx?

Cheers,

Yadid



Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
Hi Jeremy,

That's interesting, I don't think anyone has ever reported an issue running
these scripts due to Python incompatibility, but they may require Python
2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a
good place to start. But if there is a straightforward way to make them
compatible with 2.6 we should do that.

For r3.large, we can add that to the script. It's a newer type. Any
interest in contributing this?

- Patrick

On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:


 Hi there! I'm relatively new to the list, so sorry if this is a repeat:

 I just wanted to mention there are still problems with the EC2 scripts.
 Basically, they don't work.

 First, if you run the scripts on Amazon's own suggested version of linux,
 they break because amazon installs Python2.6.9, and the scripts use a
 couple of Python2.7 commands. I have to sudo yum install python27, and
 then edit the spark-ec2 shell script to use that specific version.
 Annoying, but minor.

 (the base python command isn't upgraded to 2.7 on many systems,
 apparently because it would break yum)

 The second minor problem is that the script doesn't know about the
 r3.large servers... also easily fixed by adding to the spark_ec2.py
 script. Minor,

 The big problem is that after the EC2 cluster is provisioned, installed,
 set up, and everything, it fails to start up the webserver on the master.
 Here's the tail of the log:

 Starting GANGLIA gmond:[  OK  ]
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmetad:  [FAILED]
 Starting GANGLIA gmetad:   [  OK  ]
 Stopping httpd:[FAILED]
 Starting httpd: httpd: Syntax error on line 153 of
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object
 file: No such file or directory
[FAILED]

 Basically, the AMI you have chosen does not seem to have a full install
 of apache, and is missing several modules that are referred to in the
 httpd.conf file that is installed. The full list of missing modules is:

 authn_alias_module modules/mod_authn_alias.so
 authn_default_module modules/mod_authn_default.so
 authz_default_module modules/mod_authz_default.so
 ldap_module modules/mod_ldap.so
 authnz_ldap_module modules/mod_authnz_ldap.so
 disk_cache_module modules/mod_disk_cache.so

 Alas, even if these modules are commented out, the server still fails to
 start.

 root@ip-172-31-11-193 ~]$ service httpd start
 Starting httpd: AH00534: httpd: Configuration error: No MPM loaded.

 That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run
 according to the instructions. Sorry.

 Any suggestions on how to proceed? I'll keep trying to fix the webserver,
 but (a) changes to httpd.conf get blown away by resume, and (b) anything
 I do has to be redone every time I provision another cluster. Ugh.

 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers