How to convert a non-rdd data to rdd.

2014-10-12 Thread rapelly kartheek
Hi,

I am trying to write a String that is not an rdd to HDFS. This data is a
variable in Spark Scheduler code. None of the spark File operations are
working because my data is not rdd.

So, I tried using SparkContext.parallelize(data). But it throws error:

[error]
/home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
not found: value SparkContext
[error]  SparkContext.parallelize(result)
[error]  ^
[error] one error found

I realized that this data is part of the Scheduler. So, the Sparkcontext
would not have got created yet.

Any help in writing scheduler variable data to HDFS is appreciated!!

-Karthik


Re: How to convert a non-rdd data to rdd.

2014-10-12 Thread @Sanjiv Singh
Hi Karthik,

Can you provide us more detail of dataset data that you wanted to
parallelize with

SparkContext.parallelize(data);




Regards,
Sanjiv Singh


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,

 I am trying to write a String that is not an rdd to HDFS. This data is a
 variable in Spark Scheduler code. None of the spark File operations are
 working because my data is not rdd.

 So, I tried using SparkContext.parallelize(data). But it throws error:

 [error]
 /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
 not found: value SparkContext
 [error]  SparkContext.parallelize(result)
 [error]  ^
 [error] one error found

 I realized that this data is part of the Scheduler. So, the Sparkcontext
 would not have got created yet.

 Any help in writing scheduler variable data to HDFS is appreciated!!

 -Karthik



Re: How to convert a non-rdd data to rdd.

2014-10-12 Thread rapelly kartheek
Its a variable in spark-1.0.0/*/storagre/BlockManagerMaster.scala  class.
The return data of AskDriverWithReply() method for the getPeers() method.

Basically, it is a Seq[ArrayBuffer]:

ArraySeq(ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s1,
34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0,
s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0),
BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006,
0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(driver,
karthik, 51051, 0), BlockManagerId(1, s1, 47006, 0)))


On Sun, Oct 12, 2014 at 12:59 PM, @Sanjiv Singh [via Apache Spark User
List] ml-node+s1001560n16231...@n3.nabble.com wrote:

 Hi Karthik,

 Can you provide us more detail of dataset data that you wanted to
 parallelize with

 SparkContext.parallelize(data);




 Regards,
 Sanjiv Singh


 Regards
 Sanjiv Singh
 Mob :  +091 9990-447-339

 On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16231i=0 wrote:

 Hi,

 I am trying to write a String that is not an rdd to HDFS. This data is a
 variable in Spark Scheduler code. None of the spark File operations are
 working because my data is not rdd.

 So, I tried using SparkContext.parallelize(data). But it throws error:

 [error]
 /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
 not found: value SparkContext
 [error]  SparkContext.parallelize(result)
 [error]  ^
 [error] one error found

 I realized that this data is part of the Scheduler. So, the Sparkcontext
 would not have got created yet.

 Any help in writing scheduler variable data to HDFS is appreciated!!

 -Karthik




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16231.html
  To unsubscribe from How to convert a non-rdd data to rdd., click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



Re: How to convert a non-rdd data to rdd.

2014-10-12 Thread Kartheek.R
Hi Sean,
I tried even with sc as: sc.parallelize(data). But. I get the error: value
sc not found.

On Sun, Oct 12, 2014 at 1:47 PM, sowen [via Apache Spark User List] 
ml-node+s1001560n16233...@n3.nabble.com wrote:

 It is a method of the class, not a static method of the object. Since a
 SparkContext is available as sc in the shell, or you have perhaps created
 one similarly in your app, write sc.parallelize(...)
 On Oct 12, 2014 7:15 AM, rapelly kartheek [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16233i=0 wrote:

 Hi,

 I am trying to write a String that is not an rdd to HDFS. This data is a
 variable in Spark Scheduler code. None of the spark File operations are
 working because my data is not rdd.

 So, I tried using SparkContext.parallelize(data). But it throws error:

 [error]
 /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
 not found: value SparkContext
 [error]  SparkContext.parallelize(result)
 [error]  ^
 [error] one error found

 I realized that this data is part of the Scheduler. So, the Sparkcontext
 would not have got created yet.

 Any help in writing scheduler variable data to HDFS is appreciated!!

 -Karthik



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16233.html
  To unsubscribe from How to convert a non-rdd data to rdd., click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Spark SQL parser bug?

2014-10-12 Thread Cheng, Hao
Hi, I couldn’t reproduce the bug with the latest master branch. Which version 
are you using? Can you also list data in the table “x”?

case class T(a:String, ts:java.sql.Timestamp)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val data = sc.parallelize(1::2::Nil).map(i= T(i.toString, new 
java.sql.Timestamp(i)))
data.registerTempTable(x)
val s = sqlContext.sql(select a from x where ts='1970-01-01 00:00:00';)
s.collect

output:
res1: Array[org.apache.spark.sql.Row] = Array([1], [2])

Cheng Hao

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Sunday, October 12, 2014 12:06 AM
To: Cheng Lian; user@spark.apache.org
Subject: RE: Spark SQL parser bug?

I tried even without the “T” and it still returns an empty result:

scala val sRdd = sqlContext.sql(select a from x where ts = '2012-01-01 
00:00:00';)
sRdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[35] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [a#0]
ExistingRdd [a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at 
basicOperators.scala:208

scala sRdd.collect
res10: Array[org.apache.spark.sql.Row] = Array()


Mohammed

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Friday, October 10, 2014 10:14 PM
To: Mohammed Guller; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark SQL parser bug?


Hmm, there is a “T” in the timestamp string, which makes the string not a valid 
timestamp string representation. Internally Spark SQL uses 
java.sql.Timestamp.valueOf to cast a string to a timestamp.

On 10/11/14 2:08 AM, Mohammed Guller wrote:
scala rdd.registerTempTable(x)

scala val sRdd = sqlContext.sql(select a from x where ts = 
'2012-01-01T00:00:00';)
sRdd: org.apache.spark.sql.SchemaRDD =
SchemaRDD[4] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
Project [a#0]
ExistingRdd [a#0,ts#1], MapPartitionsRDD[6] at mapPartitions at 
basicOperators.scala:208

scala sRdd.collect
res2: Array[org.apache.spark.sql.Row] = Array()
​


Re: How to convert a non-rdd data to rdd.

2014-10-12 Thread Kartheek.R
Does SparkContext exists when this part (AskDriverWithReply()) of the
scheduler code gets executed?

On Sun, Oct 12, 2014 at 1:54 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi Sean,
 I tried even with sc as: sc.parallelize(data). But. I get the error: value
 sc not found.

 On Sun, Oct 12, 2014 at 1:47 PM, sowen [via Apache Spark User List] 
 ml-node+s1001560n16233...@n3.nabble.com wrote:

 It is a method of the class, not a static method of the object. Since a
 SparkContext is available as sc in the shell, or you have perhaps created
 one similarly in your app, write sc.parallelize(...)
 On Oct 12, 2014 7:15 AM, rapelly kartheek [hidden email]
 http://user/SendEmail.jtp?type=nodenode=16233i=0 wrote:

 Hi,

 I am trying to write a String that is not an rdd to HDFS. This data is a
 variable in Spark Scheduler code. None of the spark File operations are
 working because my data is not rdd.

 So, I tried using SparkContext.parallelize(data). But it throws error:

 [error]
 /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
 not found: value SparkContext
 [error]  SparkContext.parallelize(result)
 [error]  ^
 [error] one error found

 I realized that this data is part of the Scheduler. So, the Sparkcontext
 would not have got created yet.

 Any help in writing scheduler variable data to HDFS is appreciated!!

 -Karthik



 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16233.html
  To unsubscribe from How to convert a non-rdd data to rdd., click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
Dear Sparkers,

As promised, I've just updated the repo with a new name (for the sake of
clarity), default branch but specially with a dedicated README containing:

* explanations on how to launch and use it
* an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ...
* pictures showing results

There is a notebook for each feature, so it's easier to try out!

Here is the repo:
https://github.com/andypetrella/spark-notebook/

HTH and PRs are more than welcome ;-).


aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the
 Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm! :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run spark
 job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.









ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
Hi all,

I'm using CDH 5.0.1 (Spark 0.9)  and submitting a job in Spark Standalone
Cluster mode.

The job is quite simple as follows:

  object HBaseApp {
def main(args:Array[String]) {
testHBase(student, /test/xt/saveRDD)
}


def testHBase(tableName: String, outFile:String) {
  val sparkConf = new SparkConf()
.setAppName(-- Test HBase --)
.set(spark.executor.memory, 2g)
.set(spark.cores.max, 16)

  val sparkContext = new SparkContext(sparkConf)

  val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

  val c = rdd.count // successful
  println(\n\n\n  + c + \n\n\n)

  rdd.saveAsTextFile(outFile)  // This line will throw
java.lang.ClassNotFoundException:
com.xt.scala.HBaseApp$$anonfun$testHBase$1

  println(\n  down  \n)
}
}

I submitted this job using the following script:

#!/bin/bash

HBASE_CLASSPATH=$(hbase classpath)
APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar
SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar
SPARK_MASTER=spark://b02.jsepc.com:7077

CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH
export SPARK_CLASSPATH=/usr/lib/hbase/lib/*

CONFIG_OPTS=-Dspark.master=$SPARK_MASTER

java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@

After I submitted the job, the count of rdd could be computed successfully,
but that rdd could not be saved into HDFS and the following exception was
thrown:

14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to
java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
 at
org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
 at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
 at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
 at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
 at

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Ted Yu
Your app is named scala.HBaseApp
Does it read / write to HBase ?

Just curious.

On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao xiaotao.cs@gmail.com wrote:

 Hi all,

 I'm using CDH 5.0.1 (Spark 0.9)  and submitting a job in Spark Standalone
 Cluster mode.

 The job is quite simple as follows:

   object HBaseApp {
 def main(args:Array[String]) {
 testHBase(student, /test/xt/saveRDD)
 }


 def testHBase(tableName: String, outFile:String) {
   val sparkConf = new SparkConf()
 .setAppName(-- Test HBase --)
 .set(spark.executor.memory, 2g)
 .set(spark.cores.max, 16)

   val sparkContext = new SparkContext(sparkConf)

   val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

   val c = rdd.count // successful
   println(\n\n\n  + c + \n\n\n)

   rdd.saveAsTextFile(outFile)  // This line will throw
 java.lang.ClassNotFoundException:
 com.xt.scala.HBaseApp$$anonfun$testHBase$1

   println(\n  down  \n)
 }
 }

 I submitted this job using the following script:

 #!/bin/bash

 HBASE_CLASSPATH=$(hbase classpath)
 APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar

 SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar
 SPARK_MASTER=spark://b02.jsepc.com:7077

 CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH
 export SPARK_CLASSPATH=/usr/lib/hbase/lib/*

 CONFIG_OPTS=-Dspark.master=$SPARK_MASTER

 java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@

 After I submitted the job, the count of rdd could be computed
 successfully, but that rdd could not be saved into HDFS and the following
 exception was thrown:

 14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException:
 com.xt.scala.HBaseApp$$anonfun$testHBase$1
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
  at
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
  at
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
  at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at 

setting heap space

2014-10-12 Thread Chengi Liu
Hi,
  I am trying to use spark but I am having hard time configuring the
sparkconf...
My current conf is
conf =
SparkConf().set(spark.executor.memory,10g).set(spark.akka.frameSize,
1).set(spark.driver.memory,16g)

but I still see the java heap size error
14/10/12 09:54:50 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:332)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34)
at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readO


Whats the right way to turn these knobs and what other knobs I can play
with.
Thanks


Re: Interactive interface tool for spark

2014-10-12 Thread Jaonary Rabarisoa
And what about Hue http://gethue.com ?

On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com
wrote:

 Dear Sparkers,

 As promised, I've just updated the repo with a new name (for the sake of
 clarity), default branch but specially with a dedicated README containing:

 * explanations on how to launch and use it
 * an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ...
 * pictures showing results

 There is a notebook for each feature, so it's easier to try out!

 Here is the repo:
 https://github.com/andypetrella/spark-notebook/

 HTH and PRs are more than welcome ;-).


 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com
 wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and
 the Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm! :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run
 spark job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.










Re: Interactive interface tool for spark

2014-10-12 Thread andy petrella
Yeah, if it allows to craft some Scala/Spark code in a shareable manner, it
is a good another option!

thx for sharing

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Sun, Oct 12, 2014 at 9:47 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 And what about Hue http://gethue.com ?

 On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com
 wrote:

 Dear Sparkers,

 As promised, I've just updated the repo with a new name (for the sake of
 clarity), default branch but specially with a dedicated README containing:

 * explanations on how to launch and use it
 * an intro on each feature like Spark, Classpaths, SQL, Dynamic update,
 ...
 * pictures showing results

 There is a notebook for each feature, so it's easier to try out!

 Here is the repo:
 https://github.com/andypetrella/spark-notebook/

 HTH and PRs are more than welcome ;-).


 aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab

 On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com
 wrote:

 Hi Andy,

 This sounds awesome. Please keep us posted. Meanwhile, can you share a
 link to your project? I wasn't able to find it.

 Cheers,

 Michael

 On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com
 wrote:

 Heya

 You can check Zeppellin or my fork of the Scala notebook.
 I'm going this week end to push some efforts on the doc, because it
 supports for realtime graphing, Scala, SQL, dynamic loading of dependencies
 and I started this morning a widget to track the progress of the jobs.
 I'm quite happy with it so far, I used it with graphx, mllib, ADAM and
 the Cassandra connector so far.
 However, its major drawback is that it is a one man (best) effort ftm!
 :-S
  Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit :

  Hi, All



 We need an interactive interface tool for spark in which we can run
 spark job and plot graph to explorer the data interactively.

 Ipython notebook is good, but it only support python (we want one
 supporting scala)…



 BR,

 Kevin.











NullPointerException when deploying JAR to standalone cluster..

2014-10-12 Thread Jorge Simão
Hi, everybody!

I'm trying to deploy a simple app in Spark standalone cluster with a single
node (the localhost).
Unfortunately, something goes wrong while processing the JAR file and an
exception NullPointerException is thrown.
I'm running everything in a single machine with Windows8.
Check below the detail.
Please help with suggestions what is missing to make it work - really
looking forward to work with spark in a cluster.
The problem shows up both with my own little programs and with the spark
examples (e.g. WordCount).
The problem also show both running with my custom driver or using the
spark-submit or run-examples (which calls spark-submit).


(Hadoop I also compiled from source for windows - but not really being
used.)


Drive Code:

SparkConf conf = new SparkConf().setAppName(SimpleTests)
.setJars(new
String[]{file:///myworkspace/spark-tests.jar})
.setMaster(spark://mymachine:7077)
.setSparkHome(/mysparkhome/spark-1.1.0-bin-hadoop2.4);
  JavaSparkContext sc = new JavaSparkContext(conf);

Streaming coding is trivial and the usual:

Get this output and error:

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/JorgePaulo/tmp/hadoop/hadoop-2.4.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/C:/Users/JorgePaulo/tmp/spark/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/10/12 20:15:00 INFO SecurityManager: Changing view acls to: JorgePaulo,
14/10/12 20:15:00 INFO SecurityManager: Changing modify acls to: JorgePaulo,
14/10/12 20:15:00 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(JorgePaulo, );
users with modify permissions: Set(JorgePaulo, )
14/10/12 20:15:01 INFO Slf4jLogger: Slf4jLogger started
14/10/12 20:15:02 INFO Remoting: Starting remoting
14/10/12 20:15:02 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@jsimao71-acer:4279]
14/10/12 20:15:02 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@jsimao71-acer:4279]
14/10/12 20:15:02 INFO Utils: Successfully started service 'sparkDriver' on
port 4279.
14/10/12 20:15:02 INFO SparkEnv: Registering MapOutputTracker
14/10/12 20:15:02 INFO SparkEnv: Registering BlockManagerMaster
14/10/12 20:15:02 INFO DiskBlockManager: Created local directory at
C:\Users\JORGEP~1\AppData\Local\Temp\spark-local-20141012201502-723f
14/10/12 20:15:02 INFO Utils: Successfully started service 'Connection
manager for block manager' on port 4282.
14/10/12 20:15:02 INFO ConnectionManager: Bound socket to port 4282 with id
= ConnectionManagerId(jsimao71-acer,4282)
14/10/12 20:15:02 INFO MemoryStore: MemoryStore started with capacity 669.3
MB
14/10/12 20:15:02 INFO BlockManagerMaster: Trying to register BlockManager
14/10/12 20:15:02 INFO BlockManagerMasterActor: Registering block manager
jsimao71-acer:4282 with 669.3 MB RAM
14/10/12 20:15:02 INFO BlockManagerMaster: Registered BlockManager
14/10/12 20:15:02 INFO HttpFileServer: HTTP File server directory is
C:\Users\JORGEP~1\AppData\Local\Temp\spark-4771bfb8-e4f4-43d2-a437-6d55ee7c88b4
14/10/12 20:15:02 INFO HttpServer: Starting HTTP Server
14/10/12 20:15:03 INFO Utils: Successfully started service 'HTTP file
server' on port 4283.
14/10/12 20:15:03 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
14/10/12 20:15:03 INFO SparkUI: Started SparkUI at http://jsimao71-acer:4040
14/10/12 20:15:10 INFO SparkContext: Added JAR
file:///Users/JorgePaulo/workspace/spark-tests.jar at
http://192.168.179.1:4283/jars/spark-tests.jar with timestamp 1413141310617
14/10/12 20:15:10 INFO AppClient$ClientActor: Connecting to master
spark://jsimao71-acer:7077...
14/10/12 20:15:10 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0
14/10/12 20:15:11 INFO MemoryStore: ensureFreeSpace(159118) called with
curMem=0, maxMem=701843374
14/10/12 20:15:11 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 155.4 KB, free 669.2 MB)
14/10/12 20:15:11 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20141012201511-0014
14/10/12 20:15:11 INFO AppClient$ClientActor: Executor added:
app-20141012201511-0014/0 on worker-20141012171633-jsimao71-acer-1970
(jsimao71-acer:1970) with 4 cores
14/10/12 20:15:11 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20141012201511-0014/0 on hostPort jsimao71-acer:1970 with 4 cores,
512.0 MB RAM
14/10/12 20:15:11 INFO AppClient$ClientActor: Executor updated:
app-20141012201511-0014/0 is now 

Spark in cluster and errors

2014-10-12 Thread Morbious
Hi,

Can anyone point me how spark works ?
Why is it trying to connect from master port A to master port ABCD  in
cluster mode with 6 workers ?

14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]:
Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@...:53757]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: spark-master1.domain.org/10.0.6.228:53757
]
14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]:
Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@...:53757]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: spark-master1.domain.org/10.0.6.228:53757
]

I've spent almost a week trying to find solution and the more I dag the more
same problems I found.

Best regards,

Morbious



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-cluster-and-errors-tp16249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Nested Query using SparkSQL 1.1.0

2014-10-12 Thread shahab
Hi,

 Apparently is it is possible to query nested json using spark SQL, but ,
mainly due to lack of proper documentation/examples, I did not manage to
make it working. I do appreciate if you could point me to any example or
help with this issue,

Here is my code:

  val anotherPeopleRDD = sc.parallelize(

   {

attributes: [

{

data: {

gender: woman

},

section: Economy,

collectApp: web,

id: 1409064792512

}

]

} :: Nil)

  val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)

  anotherPeople.registerTempTable(people)

   val query_people = sqlContext.sql(select attributes[0].collectApp from
people)

   query_people.foreach(println)

But instead of getting Web as print out, I am getting the following:

[[web,[woman],1409064792512, Economy]]



thanks,

/shahab


Re: Spark in cluster and errors

2014-10-12 Thread Jorge Simão
You have a connection refuse error.
You need to check:
-That the master is listening on specified hostport.
-No firewall blocking access.
-Make sure that config is pointing to the master hostport. Check the host
name from the web console.

Send more details about cluster layout for more details..

Hope it helps..

Jorge.

On Sun, Oct 12, 2014 at 10:07 PM, Morbious knowledgefromgro...@gmail.com
wrote:

 Hi,

 Can anyone point me how spark works ?
 Why is it trying to connect from master port A to master port ABCD  in
 cluster mode with 6 workers ?

 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]:
 Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@...:53757]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: spark-master1.domain.org/10.0.6.228:53757
 ]
 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError
 [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]:
 Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@...:53757]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: spark-master1.domain.org/10.0.6.228:53757
 ]

 I've spent almost a week trying to find solution and the more I dag the
 more
 same problems I found.

 Best regards,

 Morbious



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-cluster-and-errors-tp16249.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark job doesn't clean after itself

2014-10-12 Thread Rohit Pujari
Reviving this .. any thoughts experts?

On Thu, Oct 9, 2014 at 3:47 PM, Rohit Pujari rpuj...@hortonworks.com
wrote:

 Hello Folks:

 I'm running spark job on YARN. After the execution, I would expect the
 spark job to clean staging the area, but it seems every run creates a new
 staging directory. Is there a way to force spark job to clean after itself?

 Thanks,
 Rohit




-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
In the beginning I tried to read HBase and found that exception was thrown,
then I start to debug the app. I removed the codes reading HBase and tried
to save an rdd containing a list and the exception was still thrown. So I'm
sure that exception was not caused by reading HBase.

While debugging I did not change the object name and file name.



2014-10-13 0:00 GMT+08:00 Ted Yu yuzhih...@gmail.com:

 Your app is named scala.HBaseApp
 Does it read / write to HBase ?

 Just curious.

 On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao xiaotao.cs@gmail.com
 wrote:

 Hi all,

 I'm using CDH 5.0.1 (Spark 0.9)  and submitting a job in Spark Standalone
 Cluster mode.

 The job is quite simple as follows:

   object HBaseApp {
 def main(args:Array[String]) {
 testHBase(student, /test/xt/saveRDD)
 }


 def testHBase(tableName: String, outFile:String) {
   val sparkConf = new SparkConf()
 .setAppName(-- Test HBase --)
 .set(spark.executor.memory, 2g)
 .set(spark.cores.max, 16)

   val sparkContext = new SparkContext(sparkConf)

   val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

   val c = rdd.count // successful
   println(\n\n\n  + c + \n\n\n)

   rdd.saveAsTextFile(outFile)  // This line will throw
 java.lang.ClassNotFoundException:
 com.xt.scala.HBaseApp$$anonfun$testHBase$1

   println(\n  down  \n)
 }
 }

 I submitted this job using the following script:

 #!/bin/bash

 HBASE_CLASSPATH=$(hbase classpath)
 APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar

 SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar
 SPARK_MASTER=spark://b02.jsepc.com:7077

 CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH
 export SPARK_CLASSPATH=/usr/lib/hbase/lib/*

 CONFIG_OPTS=-Dspark.master=$SPARK_MASTER

 java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@

 After I submitted the job, the count of rdd could be computed
 successfully, but that rdd could not be saved into HDFS and the following
 exception was thrown:

 14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException:
 com.xt.scala.HBaseApp$$anonfun$testHBase$1
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
  at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
  at
 

Re: small bug in pyspark

2014-10-12 Thread Josh Rosen
Hi Andy,

You may be interested in https://github.com/apache/spark/pull/2651, a
recent pull request of mine which cleans up / simplifies the configuration
of PySpark's Python executables.  For instance, it makes it much easier to
control which Python options are passed when launching the PySpark drivers
and workers.

- Josh

On Fri, Oct 10, 2014 at 5:24 PM, Andy Davidson 
a...@santacruzintegration.com wrote:

 Hi

 I am running spark on an ec2 cluster. I need to update python to 2.7. I
 have been following the directions on
 http://nbviewer.ipython.org/gist/JoshRosen/6856670

 https://issues.apache.org/jira/browse/SPARK-922


 I noticed that when I start a shell using pyspark, I correctly got
 python2.7, how ever when I tried to start a notebook I got python2.6



 change

 exec ipython $IPYTHON_OPTS

 to

  exec ipython2 $IPYTHON_OPTS


 One clean way to resolve this would be to add another environmental
 variable like PYSPARK_PYTHON


 Andy



 P.s. Matplotlab does not upgrade because of dependency problems. I’ll let
 you know once I get this resolved






Re: What if I port Spark from TCP/IP to RDMA?

2014-10-12 Thread Josh Rosen
Hi Theo,

Check out *spark-perf*, a suite of performance benchmarks for Spark:
https://github.com/databricks/spark-perf.

- Josh

On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote:

 Hi,

 Let's say that I managed to port Spark from TCP/IP to RDMA.
 What tool or benchmark can I use to test the performance improvement?

 BR,
 Theo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org