How to convert a non-rdd data to rdd.
Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik
Re: How to convert a non-rdd data to rdd.
Hi Karthik, Can you provide us more detail of dataset data that you wanted to parallelize with SparkContext.parallelize(data); Regards, Sanjiv Singh Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik
Re: How to convert a non-rdd data to rdd.
Its a variable in spark-1.0.0/*/storagre/BlockManagerMaster.scala class. The return data of AskDriverWithReply() method for the getPeers() method. Basically, it is a Seq[ArrayBuffer]: ArraySeq(ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s1, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(driver, karthik, 51051, 0), BlockManagerId(1, s1, 47006, 0))) On Sun, Oct 12, 2014 at 12:59 PM, @Sanjiv Singh [via Apache Spark User List] ml-node+s1001560n16231...@n3.nabble.com wrote: Hi Karthik, Can you provide us more detail of dataset data that you wanted to parallelize with SparkContext.parallelize(data); Regards, Sanjiv Singh Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek [hidden email] http://user/SendEmail.jtp?type=nodenode=16231i=0 wrote: Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16231.html To unsubscribe from How to convert a non-rdd data to rdd., click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
Re: How to convert a non-rdd data to rdd.
Hi Sean, I tried even with sc as: sc.parallelize(data). But. I get the error: value sc not found. On Sun, Oct 12, 2014 at 1:47 PM, sowen [via Apache Spark User List] ml-node+s1001560n16233...@n3.nabble.com wrote: It is a method of the class, not a static method of the object. Since a SparkContext is available as sc in the shell, or you have perhaps created one similarly in your app, write sc.parallelize(...) On Oct 12, 2014 7:15 AM, rapelly kartheek [hidden email] http://user/SendEmail.jtp?type=nodenode=16233i=0 wrote: Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16233.html To unsubscribe from How to convert a non-rdd data to rdd., click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16234.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Spark SQL parser bug?
Hi, I couldn’t reproduce the bug with the latest master branch. Which version are you using? Can you also list data in the table “x”? case class T(a:String, ts:java.sql.Timestamp) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val data = sc.parallelize(1::2::Nil).map(i= T(i.toString, new java.sql.Timestamp(i))) data.registerTempTable(x) val s = sqlContext.sql(select a from x where ts='1970-01-01 00:00:00';) s.collect output: res1: Array[org.apache.spark.sql.Row] = Array([1], [2]) Cheng Hao From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Sunday, October 12, 2014 12:06 AM To: Cheng Lian; user@spark.apache.org Subject: RE: Spark SQL parser bug? I tried even without the “T” and it still returns an empty result: scala val sRdd = sqlContext.sql(select a from x where ts = '2012-01-01 00:00:00';) sRdd: org.apache.spark.sql.SchemaRDD = SchemaRDD[35] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [a#0] ExistingRdd [a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at basicOperators.scala:208 scala sRdd.collect res10: Array[org.apache.spark.sql.Row] = Array() Mohammed From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, October 10, 2014 10:14 PM To: Mohammed Guller; user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark SQL parser bug? Hmm, there is a “T” in the timestamp string, which makes the string not a valid timestamp string representation. Internally Spark SQL uses java.sql.Timestamp.valueOf to cast a string to a timestamp. On 10/11/14 2:08 AM, Mohammed Guller wrote: scala rdd.registerTempTable(x) scala val sRdd = sqlContext.sql(select a from x where ts = '2012-01-01T00:00:00';) sRdd: org.apache.spark.sql.SchemaRDD = SchemaRDD[4] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [a#0] ExistingRdd [a#0,ts#1], MapPartitionsRDD[6] at mapPartitions at basicOperators.scala:208 scala sRdd.collect res2: Array[org.apache.spark.sql.Row] = Array()
Re: How to convert a non-rdd data to rdd.
Does SparkContext exists when this part (AskDriverWithReply()) of the scheduler code gets executed? On Sun, Oct 12, 2014 at 1:54 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi Sean, I tried even with sc as: sc.parallelize(data). But. I get the error: value sc not found. On Sun, Oct 12, 2014 at 1:47 PM, sowen [via Apache Spark User List] ml-node+s1001560n16233...@n3.nabble.com wrote: It is a method of the class, not a static method of the object. Since a SparkContext is available as sc in the shell, or you have perhaps created one similarly in your app, write sc.parallelize(...) On Oct 12, 2014 7:15 AM, rapelly kartheek [hidden email] http://user/SendEmail.jtp?type=nodenode=16233i=0 wrote: Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16233.html To unsubscribe from How to convert a non-rdd data to rdd., click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16235.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Interactive interface tool for spark
Dear Sparkers, As promised, I've just updated the repo with a new name (for the sake of clarity), default branch but specially with a dedicated README containing: * explanations on how to launch and use it * an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ... * pictures showing results There is a notebook for each feature, so it's easier to try out! Here is the repo: https://github.com/andypetrella/spark-notebook/ HTH and PRs are more than welcome ;-). aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote: Hi Andy, This sounds awesome. Please keep us posted. Meanwhile, can you share a link to your project? I wasn't able to find it. Cheers, Michael On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote: Heya You can check Zeppellin or my fork of the Scala notebook. I'm going this week end to push some efforts on the doc, because it supports for realtime graphing, Scala, SQL, dynamic loading of dependencies and I started this morning a widget to track the progress of the jobs. I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the Cassandra connector so far. However, its major drawback is that it is a one man (best) effort ftm! :-S Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit : Hi, All We need an interactive interface tool for spark in which we can run spark job and plot graph to explorer the data interactively. Ipython notebook is good, but it only support python (we want one supporting scala)… BR, Kevin.
ClasssNotFoundExeception was thrown while trying to save rdd
Hi all, I'm using CDH 5.0.1 (Spark 0.9) and submitting a job in Spark Standalone Cluster mode. The job is quite simple as follows: object HBaseApp { def main(args:Array[String]) { testHBase(student, /test/xt/saveRDD) } def testHBase(tableName: String, outFile:String) { val sparkConf = new SparkConf() .setAppName(-- Test HBase --) .set(spark.executor.memory, 2g) .set(spark.cores.max, 16) val sparkContext = new SparkContext(sparkConf) val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) val c = rdd.count // successful println(\n\n\n + c + \n\n\n) rdd.saveAsTextFile(outFile) // This line will throw java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 println(\n down \n) } } I submitted this job using the following script: #!/bin/bash HBASE_CLASSPATH=$(hbase classpath) APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar SPARK_MASTER=spark://b02.jsepc.com:7077 CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH export SPARK_CLASSPATH=/usr/lib/hbase/lib/* CONFIG_OPTS=-Dspark.master=$SPARK_MASTER java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@ After I submitted the job, the count of rdd could be computed successfully, but that rdd could not be saved into HDFS and the following exception was thrown: 14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at
Re: ClasssNotFoundExeception was thrown while trying to save rdd
Your app is named scala.HBaseApp Does it read / write to HBase ? Just curious. On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao xiaotao.cs@gmail.com wrote: Hi all, I'm using CDH 5.0.1 (Spark 0.9) and submitting a job in Spark Standalone Cluster mode. The job is quite simple as follows: object HBaseApp { def main(args:Array[String]) { testHBase(student, /test/xt/saveRDD) } def testHBase(tableName: String, outFile:String) { val sparkConf = new SparkConf() .setAppName(-- Test HBase --) .set(spark.executor.memory, 2g) .set(spark.cores.max, 16) val sparkContext = new SparkContext(sparkConf) val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) val c = rdd.count // successful println(\n\n\n + c + \n\n\n) rdd.saveAsTextFile(outFile) // This line will throw java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 println(\n down \n) } } I submitted this job using the following script: #!/bin/bash HBASE_CLASSPATH=$(hbase classpath) APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar SPARK_MASTER=spark://b02.jsepc.com:7077 CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH export SPARK_CLASSPATH=/usr/lib/hbase/lib/* CONFIG_OPTS=-Dspark.master=$SPARK_MASTER java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@ After I submitted the job, the count of rdd could be computed successfully, but that rdd could not be saved into HDFS and the following exception was thrown: 14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
setting heap space
Hi, I am trying to use spark but I am having hard time configuring the sparkconf... My current conf is conf = SparkConf().set(spark.executor.memory,10g).set(spark.akka.frameSize, 1).set(spark.driver.memory,16g) but I still see the java heap size error 14/10/12 09:54:50 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3) java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:332) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:34) at com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:21) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoDeserializationStream.readO Whats the right way to turn these knobs and what other knobs I can play with. Thanks
Re: Interactive interface tool for spark
And what about Hue http://gethue.com ? On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com wrote: Dear Sparkers, As promised, I've just updated the repo with a new name (for the sake of clarity), default branch but specially with a dedicated README containing: * explanations on how to launch and use it * an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ... * pictures showing results There is a notebook for each feature, so it's easier to try out! Here is the repo: https://github.com/andypetrella/spark-notebook/ HTH and PRs are more than welcome ;-). aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote: Hi Andy, This sounds awesome. Please keep us posted. Meanwhile, can you share a link to your project? I wasn't able to find it. Cheers, Michael On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote: Heya You can check Zeppellin or my fork of the Scala notebook. I'm going this week end to push some efforts on the doc, because it supports for realtime graphing, Scala, SQL, dynamic loading of dependencies and I started this morning a widget to track the progress of the jobs. I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the Cassandra connector so far. However, its major drawback is that it is a one man (best) effort ftm! :-S Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit : Hi, All We need an interactive interface tool for spark in which we can run spark job and plot graph to explorer the data interactively. Ipython notebook is good, but it only support python (we want one supporting scala)… BR, Kevin.
Re: Interactive interface tool for spark
Yeah, if it allows to craft some Scala/Spark code in a shareable manner, it is a good another option! thx for sharing aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Sun, Oct 12, 2014 at 9:47 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: And what about Hue http://gethue.com ? On Sun, Oct 12, 2014 at 1:26 PM, andy petrella andy.petre...@gmail.com wrote: Dear Sparkers, As promised, I've just updated the repo with a new name (for the sake of clarity), default branch but specially with a dedicated README containing: * explanations on how to launch and use it * an intro on each feature like Spark, Classpaths, SQL, Dynamic update, ... * pictures showing results There is a notebook for each feature, so it's easier to try out! Here is the repo: https://github.com/andypetrella/spark-notebook/ HTH and PRs are more than welcome ;-). aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Wed, Oct 8, 2014 at 4:57 PM, Michael Allman mich...@videoamp.com wrote: Hi Andy, This sounds awesome. Please keep us posted. Meanwhile, can you share a link to your project? I wasn't able to find it. Cheers, Michael On Oct 8, 2014, at 3:38 AM, andy petrella andy.petre...@gmail.com wrote: Heya You can check Zeppellin or my fork of the Scala notebook. I'm going this week end to push some efforts on the doc, because it supports for realtime graphing, Scala, SQL, dynamic loading of dependencies and I started this morning a widget to track the progress of the jobs. I'm quite happy with it so far, I used it with graphx, mllib, ADAM and the Cassandra connector so far. However, its major drawback is that it is a one man (best) effort ftm! :-S Le 8 oct. 2014 11:16, Dai, Kevin yun...@ebay.com a écrit : Hi, All We need an interactive interface tool for spark in which we can run spark job and plot graph to explorer the data interactively. Ipython notebook is good, but it only support python (we want one supporting scala)… BR, Kevin.
NullPointerException when deploying JAR to standalone cluster..
Hi, everybody! I'm trying to deploy a simple app in Spark standalone cluster with a single node (the localhost). Unfortunately, something goes wrong while processing the JAR file and an exception NullPointerException is thrown. I'm running everything in a single machine with Windows8. Check below the detail. Please help with suggestions what is missing to make it work - really looking forward to work with spark in a cluster. The problem shows up both with my own little programs and with the spark examples (e.g. WordCount). The problem also show both running with my custom driver or using the spark-submit or run-examples (which calls spark-submit). (Hadoop I also compiled from source for windows - but not really being used.) Drive Code: SparkConf conf = new SparkConf().setAppName(SimpleTests) .setJars(new String[]{file:///myworkspace/spark-tests.jar}) .setMaster(spark://mymachine:7077) .setSparkHome(/mysparkhome/spark-1.1.0-bin-hadoop2.4); JavaSparkContext sc = new JavaSparkContext(conf); Streaming coding is trivial and the usual: Get this output and error: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/C:/Users/JorgePaulo/tmp/hadoop/hadoop-2.4.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/JorgePaulo/tmp/spark/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/10/12 20:15:00 INFO SecurityManager: Changing view acls to: JorgePaulo, 14/10/12 20:15:00 INFO SecurityManager: Changing modify acls to: JorgePaulo, 14/10/12 20:15:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(JorgePaulo, ); users with modify permissions: Set(JorgePaulo, ) 14/10/12 20:15:01 INFO Slf4jLogger: Slf4jLogger started 14/10/12 20:15:02 INFO Remoting: Starting remoting 14/10/12 20:15:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@jsimao71-acer:4279] 14/10/12 20:15:02 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@jsimao71-acer:4279] 14/10/12 20:15:02 INFO Utils: Successfully started service 'sparkDriver' on port 4279. 14/10/12 20:15:02 INFO SparkEnv: Registering MapOutputTracker 14/10/12 20:15:02 INFO SparkEnv: Registering BlockManagerMaster 14/10/12 20:15:02 INFO DiskBlockManager: Created local directory at C:\Users\JORGEP~1\AppData\Local\Temp\spark-local-20141012201502-723f 14/10/12 20:15:02 INFO Utils: Successfully started service 'Connection manager for block manager' on port 4282. 14/10/12 20:15:02 INFO ConnectionManager: Bound socket to port 4282 with id = ConnectionManagerId(jsimao71-acer,4282) 14/10/12 20:15:02 INFO MemoryStore: MemoryStore started with capacity 669.3 MB 14/10/12 20:15:02 INFO BlockManagerMaster: Trying to register BlockManager 14/10/12 20:15:02 INFO BlockManagerMasterActor: Registering block manager jsimao71-acer:4282 with 669.3 MB RAM 14/10/12 20:15:02 INFO BlockManagerMaster: Registered BlockManager 14/10/12 20:15:02 INFO HttpFileServer: HTTP File server directory is C:\Users\JORGEP~1\AppData\Local\Temp\spark-4771bfb8-e4f4-43d2-a437-6d55ee7c88b4 14/10/12 20:15:02 INFO HttpServer: Starting HTTP Server 14/10/12 20:15:03 INFO Utils: Successfully started service 'HTTP file server' on port 4283. 14/10/12 20:15:03 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/10/12 20:15:03 INFO SparkUI: Started SparkUI at http://jsimao71-acer:4040 14/10/12 20:15:10 INFO SparkContext: Added JAR file:///Users/JorgePaulo/workspace/spark-tests.jar at http://192.168.179.1:4283/jars/spark-tests.jar with timestamp 1413141310617 14/10/12 20:15:10 INFO AppClient$ClientActor: Connecting to master spark://jsimao71-acer:7077... 14/10/12 20:15:10 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 14/10/12 20:15:11 INFO MemoryStore: ensureFreeSpace(159118) called with curMem=0, maxMem=701843374 14/10/12 20:15:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 669.2 MB) 14/10/12 20:15:11 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141012201511-0014 14/10/12 20:15:11 INFO AppClient$ClientActor: Executor added: app-20141012201511-0014/0 on worker-20141012171633-jsimao71-acer-1970 (jsimao71-acer:1970) with 4 cores 14/10/12 20:15:11 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141012201511-0014/0 on hostPort jsimao71-acer:1970 with 4 cores, 512.0 MB RAM 14/10/12 20:15:11 INFO AppClient$ClientActor: Executor updated: app-20141012201511-0014/0 is now
Spark in cluster and errors
Hi, Can anyone point me how spark works ? Why is it trying to connect from master port A to master port ABCD in cluster mode with 6 workers ? 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]: Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@...:53757] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: spark-master1.domain.org/10.0.6.228:53757 ] 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]: Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@...:53757] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: spark-master1.domain.org/10.0.6.228:53757 ] I've spent almost a week trying to find solution and the more I dag the more same problems I found. Best regards, Morbious -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-cluster-and-errors-tp16249.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Nested Query using SparkSQL 1.1.0
Hi, Apparently is it is possible to query nested json using spark SQL, but , mainly due to lack of proper documentation/examples, I did not manage to make it working. I do appreciate if you could point me to any example or help with this issue, Here is my code: val anotherPeopleRDD = sc.parallelize( { attributes: [ { data: { gender: woman }, section: Economy, collectApp: web, id: 1409064792512 } ] } :: Nil) val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD) anotherPeople.registerTempTable(people) val query_people = sqlContext.sql(select attributes[0].collectApp from people) query_people.foreach(println) But instead of getting Web as print out, I am getting the following: [[web,[woman],1409064792512, Economy]] thanks, /shahab
Re: Spark in cluster and errors
You have a connection refuse error. You need to check: -That the master is listening on specified hostport. -No firewall blocking access. -Make sure that config is pointing to the master hostport. Check the host name from the web console. Send more details about cluster layout for more details.. Hope it helps.. Jorge. On Sun, Oct 12, 2014 at 10:07 PM, Morbious knowledgefromgro...@gmail.com wrote: Hi, Can anyone point me how spark works ? Why is it trying to connect from master port A to master port ABCD in cluster mode with 6 workers ? 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]: Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@...:53757] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: spark-master1.domain.org/10.0.6.228:53757 ] 14/10/09 19:37:19 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@...:7078] - [akka.tcp://sparkExecutor@...:53757]: Error [Association failed with [akka.tcp://sparkExecutor@...:53757]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@...:53757] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: spark-master1.domain.org/10.0.6.228:53757 ] I've spent almost a week trying to find solution and the more I dag the more same problems I found. Best regards, Morbious -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-cluster-and-errors-tp16249.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark job doesn't clean after itself
Reviving this .. any thoughts experts? On Thu, Oct 9, 2014 at 3:47 PM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: I'm running spark job on YARN. After the execution, I would expect the spark job to clean staging the area, but it seems every run creates a new staging directory. Is there a way to force spark job to clean after itself? Thanks, Rohit -- Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: ClasssNotFoundExeception was thrown while trying to save rdd
In the beginning I tried to read HBase and found that exception was thrown, then I start to debug the app. I removed the codes reading HBase and tried to save an rdd containing a list and the exception was still thrown. So I'm sure that exception was not caused by reading HBase. While debugging I did not change the object name and file name. 2014-10-13 0:00 GMT+08:00 Ted Yu yuzhih...@gmail.com: Your app is named scala.HBaseApp Does it read / write to HBase ? Just curious. On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao xiaotao.cs@gmail.com wrote: Hi all, I'm using CDH 5.0.1 (Spark 0.9) and submitting a job in Spark Standalone Cluster mode. The job is quite simple as follows: object HBaseApp { def main(args:Array[String]) { testHBase(student, /test/xt/saveRDD) } def testHBase(tableName: String, outFile:String) { val sparkConf = new SparkConf() .setAppName(-- Test HBase --) .set(spark.executor.memory, 2g) .set(spark.cores.max, 16) val sparkContext = new SparkContext(sparkConf) val rdd = sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3) val c = rdd.count // successful println(\n\n\n + c + \n\n\n) rdd.saveAsTextFile(outFile) // This line will throw java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 println(\n down \n) } } I submitted this job using the following script: #!/bin/bash HBASE_CLASSPATH=$(hbase classpath) APP_JAR=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar SPARK_ASSEMBLY_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar SPARK_MASTER=spark://b02.jsepc.com:7077 CLASSPATH=$CLASSPATH:$APP_JAR:$SPARK_ASSEMBLY_JAR:$HBASE_CLASSPATH export SPARK_CLASSPATH=/usr/lib/hbase/lib/* CONFIG_OPTS=-Dspark.master=$SPARK_MASTER java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.HBaseApp $@ After I submitted the job, the count of rdd could be computed successfully, but that rdd could not be saved into HDFS and the following exception was thrown: 14/10/11 16:09:33 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: com.xt.scala.HBaseApp$$anonfun$testHBase$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at
Re: small bug in pyspark
Hi Andy, You may be interested in https://github.com/apache/spark/pull/2651, a recent pull request of mine which cleans up / simplifies the configuration of PySpark's Python executables. For instance, it makes it much easier to control which Python options are passed when launching the PySpark drivers and workers. - Josh On Fri, Oct 10, 2014 at 5:24 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am running spark on an ec2 cluster. I need to update python to 2.7. I have been following the directions on http://nbviewer.ipython.org/gist/JoshRosen/6856670 https://issues.apache.org/jira/browse/SPARK-922 I noticed that when I start a shell using pyspark, I correctly got python2.7, how ever when I tried to start a notebook I got python2.6 change exec ipython $IPYTHON_OPTS to exec ipython2 $IPYTHON_OPTS One clean way to resolve this would be to add another environmental variable like PYSPARK_PYTHON Andy P.s. Matplotlab does not upgrade because of dependency problems. I’ll let you know once I get this resolved
Re: What if I port Spark from TCP/IP to RDMA?
Hi Theo, Check out *spark-perf*, a suite of performance benchmarks for Spark: https://github.com/databricks/spark-perf. - Josh On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote: Hi, Let's say that I managed to port Spark from TCP/IP to RDMA. What tool or benchmark can I use to test the performance improvement? BR, Theo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org