Re: Spark SQL
Hi, I'm not a master on SparkSQL, but from what I understand, the problem ıs that you're trying to access an RDD inside an RDD here: val xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate ... *** and here: xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate ... ***. RDDs can't be serialized inside other RDD tasks, therefore you're receiving the NullPointerException. More specifically, you are trying to generate a SchemaRDD inside an RDD, which you can't do. If file isn't huge, you can call .collect() to transform the RDD to an array and then use .map() on the Array. If the file is huge, then you may do number 3 first, join the two RDDs using 'txCurCode' as a key, and then do filtering operations, etc... Best, Burak - Original Message - From: rkishore999 rkishore...@yahoo.com To: u...@spark.incubator.apache.org Sent: Saturday, September 13, 2014 10:29:26 PM Subject: Spark SQL val file = sc.textFile(hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt) 1. val xyz = file.map(line = extractCurRate(sqlContext.sql(select rate from CurrencyCodeRates where txCurCode = ' + line.substring(202,205) + ' and fxCurCode = ' + fxCurCodesMap(line.substring(77,82)) + ' and effectiveDate = ' + line.substring(221,229) + ' order by effectiveDate desc)) 2. val xyz = file.map(line = sqlContext.sql(select rate, txCurCode, fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode = 'CSD' and effectiveDate = '20140901' order by effectiveDate desc)) 3. val xyz = sqlContext.sql(select rate, txCurCode, fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode = 'CSD' and effectiveDate = '20140901' order by effectiveDate desc) xyz.saveAsTextFile(/user/output) In statements 1 and 2 I'm getting nullpointer expecption. But statement 3 is good. I'm guessing spark context and sql context are not going together well. Any suggestions regarding how I can achieve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp14183.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Broadcast error
Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in broadcast pickled = pickleSer.dumps(value) File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) SystemError: error return without exception set Help?
Re: Broadcast error
Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in broadcast pickled = pickleSer.dumps(value) File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) SystemError: error return without exception set Help?
File operations on spark
Hi I am trying to perform read/write file operations in spark by creating Writable object. But, I am not able to write to a file. The concerned data is not rdd. Can someone please tell me how to perform read/write file operations on non-rdd data in spark. Regards karthik
Driver fail with out of memory exception
Hi I've written a job (I think not very complicated only 1 reduceByKey) the driver JVM always hang with OOM killing the worker of course. How can I know what is running on the driver and what is running on the worker how to debug the memory problem. I've already used --driver-memory 4g params to give more memory ut nothing help it always fail Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Driver fail with out of memory exception
Try increasing the number of partitions while doing a reduceByKey() http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD Thanks Best Regards On Sun, Sep 14, 2014 at 5:11 PM, richiesgr richie...@gmail.com wrote: Hi I've written a job (I think not very complicated only 1 reduceByKey) the driver JVM always hang with OOM killing the worker of course. How can I know what is running on the driver and what is running on the worker how to debug the memory problem. I've already used --driver-memory 4g params to give more memory ut nothing help it always fail Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
object hbase is not a member of package org.apache.hadoop
Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have tried to to run *HBaseTest.scala, *but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
Hi,Thanks!!I tried to apply the patches, bothspark-1297-v2.txt andspark-1297-v4.txt are good here, but notspark-1297-v5.txt:$ patch -p1 -i spark-1297-v4.txtpatching file examples/pom.xml$ patch -p1 -i spark-1297-v5.txtcan't find file to patch at input line 5Perhaps you used the wrong -p or --strip option?The text leading up to this was:--|diff --git docs/building-with-maven.md docs/building-with-maven.md|index 672d0ef..f8bcd2b 100644|--- docs/building-with-maven.md|+++ docs/building-with-maven.md--File to patch:{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 {\fonttbl\f0\fnil\fcharset0 Menlo-Regular;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0 \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural \f0\fs22 \cf0 \CocoaLigature0 diff --git docs/building-with-maven.md docs/building-with-maven.md\ index 672d0ef..f8bcd2b 100644\ --- docs/building-with-maven.md\ +++ docs/building-with-maven.md\ @@ -71,6 +71,9 @@ For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop versions with YARN\ /tbody\ /table\ \ +To build against HBase 0.98.x releases, hbase-hadoop1 is the default profile. This means hbase-0.98.x-hadoop1 would be used.\ +When building against hadoop-2, hbase-hadoop2 profile should be specified.\ +\ Examples:\ \ \{% highlight bash %\}\ diff --git examples/pom.xml examples/pom.xml\ index 8c4c128..9ae50cd 100644\ --- examples/pom.xml\ +++ examples/pom.xml\ @@ -45,6 +45,30 @@\ /dependency\ /dependencies\ /profile\ +profile\ + idhbase-hadoop2/id\ + activation\ +property\ + namehbase.profile/name\ + valuehadoop2/value\ +/property\ + /activation\ + properties\ +hbase.version0.98.4-hadoop2/hbase.version\ + /properties\ +/profile\ +profile\ + idhbase-hadoop1/id\ + activation\ +property\ + name!hbase.profile/name\ +/property\ + /activation\ + properties\ +hbase.version0.98.4-hadoop1/hbase.version\ + /properties\ +/profile\ +\ /profiles\ \ dependencies\ @@ -110,36 +134,121 @@\ version$\{project.version\}/version\ /dependency\ dependency\ - groupIdorg.apache.hbase/groupId\ - artifactIdhbase/artifactId\ - version$\{hbase.version\}/version\ - exclusions\ -exclusion\ - groupIdasm/groupId\ - artifactIdasm/artifactId\ -/exclusion\ -exclusion\ - groupIdorg.jboss.netty/groupId\ - artifactIdnetty/artifactId\ -/exclusion\ -exclusion\ - groupIdio.netty/groupId\ - artifactIdnetty/artifactId\ -/exclusion\ -exclusion\ - groupIdcommons-logging/groupId\ - artifactIdcommons-logging/artifactId\ -/exclusion\ -exclusion\ - groupIdorg.jruby/groupId\ - artifactIdjruby-complete/artifactId\ -/exclusion\ - /exclusions\ -/dependency\ -dependency\ groupIdorg.eclipse.jetty/groupId\ artifactIdjetty-server/artifactId\ /dependency\ + dependency\ +groupIdorg.apache.hbase/groupId\ +artifactIdhbase-testing-util/artifactId\ +version$\{hbase.version\}/version\ +exclusions\ + exclusion\ +groupIdorg.jruby/groupId\ +artifactIdjruby-complete/artifactId\ + /exclusion\ +/exclusions\ + /dependency\ + dependency\ +groupIdorg.apache.hbase/groupId\ +artifactIdhbase-protocol/artifactId\ +version$\{hbase.version\}/version\ + /dependency\ + dependency\ +groupIdorg.apache.hbase/groupId\ +artifactIdhbase-common/artifactId\ +version$\{hbase.version\}/version\ + /dependency\ + dependency\ +groupIdorg.apache.hbase/groupId\ +artifactIdhbase-client/artifactId\ +version$\{hbase.version\}/version\ +exclusions\ + exclusion\ + groupIdio.netty/groupId\ + artifactIdnetty/artifactId\ + /exclusion\ + /exclusions\ + /dependency\ + dependency\ +groupIdorg.apache.hbase/groupId\ +artifactIdhbase-server/artifactId\ +version$\{hbase.version\}/version\ +exclusions\ + exclusion\ +groupIdorg.apache.hadoop/groupId\ +artifactIdhadoop-core/artifactId\ + /exclusion\ + exclusion\ +groupIdorg.apache.hadoop/groupId\ +artifactIdhadoop-client/artifactId\ + /exclusion\ + exclusion\ +groupIdorg.apache.hadoop/groupId\ +artifactIdhadoop-mapreduce-client-jobclient/artifactId\ + /exclusion\ + exclusion\ +groupIdorg.apache.hadoop/groupId\ +
Re: object hbase is not a member of package org.apache.hadoop
spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks!! I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here, but not spark-1297-v5.txt: $ patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml $ patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise. Regards Arthur On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote: Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have tried to to run *HBaseTest.scala, *but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote: spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks!! I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here, but not spark-1297-v5.txt: $ patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml $ patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise. Regards Arthur On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote: Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
Hi, My bad. Tried again, worked. patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Thanks! Arthur On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote: spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks!! I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here, but not spark-1297-v5.txt: $ patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml $ patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise. Regards Arthur On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote: Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
I applied the patch on master branch without rejects. If you use spark 1.0.2, use pom.xml attached to the JIRA. On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote: spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks!! I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here, but not spark-1297-v5.txt: $ patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml $ patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise. Regards Arthur On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote: Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have tried to to run *HBaseTest.scala, *but I got following errors, any ideas to how to fix them? Q1) scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples Q2) scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat console:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
Hi, I applied the patch. 1) patched $ patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml 2) Compilation result [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.550s] [INFO] Spark Project Core SUCCESS [1:32.175s] [INFO] Spark Project Bagel ... SUCCESS [10.809s] [INFO] Spark Project GraphX .. SUCCESS [31.435s] [INFO] Spark Project Streaming ... SUCCESS [44.518s] [INFO] Spark Project ML Library .. SUCCESS [48.992s] [INFO] Spark Project Tools ... SUCCESS [7.028s] [INFO] Spark Project Catalyst SUCCESS [40.365s] [INFO] Spark Project SQL . SUCCESS [43.305s] [INFO] Spark Project Hive SUCCESS [36.464s] [INFO] Spark Project REPL SUCCESS [20.319s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.032s] [INFO] Spark Project YARN Stable API . SUCCESS [19.379s] [INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s] [INFO] Spark Project Assembly SUCCESS [13.822s] [INFO] Spark Project External Twitter SUCCESS [9.566s] [INFO] Spark Project External Kafka .. SUCCESS [12.848s] [INFO] Spark Project External Flume Sink . SUCCESS [10.437s] [INFO] Spark Project External Flume .. SUCCESS [14.554s] [INFO] Spark Project External ZeroMQ . SUCCESS [9.994s] [INFO] Spark Project External MQTT ... SUCCESS [8.684s] [INFO] Spark Project Examples SUCCESS [1:31.610s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 9:41.700s [INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014 [INFO] Final Memory: 83M/1071M [INFO] 3) testing: scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples ^ scala import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.client.HBaseAdmin scala import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableInputFormat scala import org.apache.spark._ import org.apache.spark._ scala object HBaseTest { | def main(args: Array[String]) { | val sparkConf = new SparkConf().setAppName(HBaseTest) | val sc = new SparkContext(sparkConf) | val conf = HBaseConfiguration.create() | // Other options for configuring scan behavior are available. More information available at | // http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html | conf.set(TableInputFormat.INPUT_TABLE, args(0)) | // Initialize hBase table if necessary | val admin = new HBaseAdmin(conf) | if (!admin.isTableAvailable(args(0))) { | val tableDesc = new HTableDescriptor(args(0)) | admin.createTable(tableDesc) | } | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], | classOf[org.apache.hadoop.hbase.client.Result]) | hBaseRDD.count() | sc.stop() | } | } warning: there were 1 deprecation warning(s); re-run with -deprecation for details defined module HBaseTest Now only got error when trying to run package org.apache.spark.examples” Please advise. Regards Arthur On 14 Sep, 2014, at 11:41 pm, Ted Yu yuzhih...@gmail.com wrote: I applied the patch on master branch without rejects. If you use spark 1.0.2, use pom.xml attached to the JIRA. On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote: spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks!! I tried to apply the patches, both spark-1297-v2.txt and
Re: object hbase is not a member of package org.apache.hadoop
Take a look at bin/run-example Cheers On Sun, Sep 14, 2014 at 9:15 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I applied the patch. 1) patched $ patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml 2) Compilation result [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.550s] [INFO] Spark Project Core SUCCESS [1:32.175s] [INFO] Spark Project Bagel ... SUCCESS [10.809s] [INFO] Spark Project GraphX .. SUCCESS [31.435s] [INFO] Spark Project Streaming ... SUCCESS [44.518s] [INFO] Spark Project ML Library .. SUCCESS [48.992s] [INFO] Spark Project Tools ... SUCCESS [7.028s] [INFO] Spark Project Catalyst SUCCESS [40.365s] [INFO] Spark Project SQL . SUCCESS [43.305s] [INFO] Spark Project Hive SUCCESS [36.464s] [INFO] Spark Project REPL SUCCESS [20.319s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.032s] [INFO] Spark Project YARN Stable API . SUCCESS [19.379s] [INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s] [INFO] Spark Project Assembly SUCCESS [13.822s] [INFO] Spark Project External Twitter SUCCESS [9.566s] [INFO] Spark Project External Kafka .. SUCCESS [12.848s] [INFO] Spark Project External Flume Sink . SUCCESS [10.437s] [INFO] Spark Project External Flume .. SUCCESS [14.554s] [INFO] Spark Project External ZeroMQ . SUCCESS [9.994s] [INFO] Spark Project External MQTT ... SUCCESS [8.684s] [INFO] Spark Project Examples SUCCESS [1:31.610s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 9:41.700s [INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014 [INFO] Final Memory: 83M/1071M [INFO] 3) testing: scala package org.apache.spark.examples console:1: error: illegal start of definition package org.apache.spark.examples ^ scala import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.client.HBaseAdmin scala import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableInputFormat scala import org.apache.spark._ import org.apache.spark._ scala object HBaseTest { | def main(args: Array[String]) { | val sparkConf = new SparkConf().setAppName(HBaseTest) | val sc = new SparkContext(sparkConf) | val conf = HBaseConfiguration.create() | // Other options for configuring scan behavior are available. More information available at | // http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html | conf.set(TableInputFormat.INPUT_TABLE, args(0)) | // Initialize hBase table if necessary | val admin = new HBaseAdmin(conf) | if (!admin.isTableAvailable(args(0))) { | val tableDesc = new HTableDescriptor(args(0)) | admin.createTable(tableDesc) | } | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], | classOf[org.apache.hadoop.hbase.client.Result]) | hBaseRDD.count() | sc.stop() | } | } warning: there were 1 deprecation warning(s); re-run with -deprecation for details defined module HBaseTest Now only got error when trying to run package org.apache.spark.examples” Please advise. Regards Arthur On 14 Sep, 2014, at 11:41 pm, Ted Yu yuzhih...@gmail.com wrote: I applied the patch on master branch without rejects. If you use spark 1.0.2, use pom.xml attached to the JIRA. On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote: spark-1297-v5.txt is level 0 patch
Re: Dependency Problem with Spark / ScalaTest / SBT
Can you post your whole SBT build file(s)? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler sp...@tbonline.de wrote: Hi, I just called: test or run Thorsten Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote: Hello, I am writing a Spark App which is already working so far. Now I started to build also some UnitTests, but I am running into some dependecy problems and I cannot find a solution right now. Perhaps someone could help me. I build my Spark Project with SBT and it seems to be configured well, because compiling, assembling and running the built jar with spark-submit are working well. Now I started with the UnitTests, which I located under /src/test/scala. When I call test in sbt, I get the following: 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered BlockManager 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server [trace] Stack trace suppressed: run last test:test for the full output. [error] Could not run test test.scala.SetSuite: java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse [info] Run completed in 626 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 0, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [error] Error during tests: [error] test.scala.SetSuite [error] (test:test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 3 s, completed 10.09.2014 12:22:06 last test:test gives me the following: last test:test [debug] Running TaskDef(test.scala.SetSuite, org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector]) java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.start(HttpServer.scala:54) at org.apache.spark.broadcast.HttpBroadcast$.createServer( HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast$.initialize( HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize( BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init( BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at test.scala.SetSuite.init(SparkTest.scala:16) I also noticed right now, that sbt run is also not working: 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/ HttpServletResponse java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.start(HttpServer.scala:54) at org.apache.spark.broadcast.HttpBroadcast$.createServer( HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast$.initialize( HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize( BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init( BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at main.scala.PartialDuplicateScanner$.main( PartialDuplicateScanner.scala:29) at main.scala.PartialDuplicateScanner.main( PartialDuplicateScanner.scala) Here is my Testprojekt.sbt file: name := Testprojekt version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= { Seq( org.apache.lucene % lucene-core % 4.9.0, org.apache.lucene % lucene-analyzers-common % 4.9.0, org.apache.lucene % lucene-queryparser % 4.9.0, (org.apache.spark %% spark-core % 1.0.2). exclude(org.mortbay.jetty, servlet-api). exclude(commons-beanutils, commons-beanutils-core). exclude(commons-collections, commons-collections). exclude(commons-collections, commons-collections). exclude(com.esotericsoftware.minlog, minlog). exclude(org.eclipse.jetty.orbit, javax.mail.glassfish). exclude(org.eclipse.jetty.orbit, javax.transaction). exclude(org.eclipse.jetty.orbit, javax.servlet) ) } resolvers += Akka Repository at http://repo.akka.io/releases/; - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dependency Problem with Spark / ScalaTest / SBT
Sorry, I meant any *other* SBT files. However, what happens if you remove the line: exclude(org.eclipse.jetty.orbit, javax.servlet) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Sun, Sep 14, 2014 at 11:53 AM, Dean Wampler deanwamp...@gmail.com wrote: Can you post your whole SBT build file(s)? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler sp...@tbonline.de wrote: Hi, I just called: test or run Thorsten Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote: Hello, I am writing a Spark App which is already working so far. Now I started to build also some UnitTests, but I am running into some dependecy problems and I cannot find a solution right now. Perhaps someone could help me. I build my Spark Project with SBT and it seems to be configured well, because compiling, assembling and running the built jar with spark-submit are working well. Now I started with the UnitTests, which I located under /src/test/scala. When I call test in sbt, I get the following: 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered BlockManager 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server [trace] Stack trace suppressed: run last test:test for the full output. [error] Could not run test test.scala.SetSuite: java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse [info] Run completed in 626 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 0, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [error] Error during tests: [error] test.scala.SetSuite [error] (test:test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 3 s, completed 10.09.2014 12:22:06 last test:test gives me the following: last test:test [debug] Running TaskDef(test.scala.SetSuite, org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector]) java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.start(HttpServer.scala:54) at org.apache.spark.broadcast.HttpBroadcast$.createServer( HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast$.initialize( HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize( BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init( BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at test.scala.SetSuite.init(SparkTest.scala:16) I also noticed right now, that sbt run is also not working: 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse at org.apache.spark.HttpServer.start(HttpServer.scala:54) at org.apache.spark.broadcast.HttpBroadcast$.createServer( HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast$.initialize( HttpBroadcast.scala:127) at org.apache.spark.broadcast.HttpBroadcastFactory.initialize( HttpBroadcastFactory.scala:31) at org.apache.spark.broadcast.BroadcastManager.initialize( BroadcastManager.scala:48) at org.apache.spark.broadcast.BroadcastManager.init( BroadcastManager.scala:35) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at main.scala.PartialDuplicateScanner$.main( PartialDuplicateScanner.scala:29) at main.scala.PartialDuplicateScanner.main( PartialDuplicateScanner.scala) Here is my Testprojekt.sbt file: name := Testprojekt version := 1.0 scalaVersion := 2.10.4 libraryDependencies ++= { Seq( org.apache.lucene % lucene-core % 4.9.0, org.apache.lucene % lucene-analyzers-common % 4.9.0, org.apache.lucene % lucene-queryparser % 4.9.0, (org.apache.spark %% spark-core % 1.0.2). exclude(org.mortbay.jetty, servlet-api). exclude(commons-beanutils, commons-beanutils-core). exclude(commons-collections, commons-collections). exclude(commons-collections, commons-collections). exclude(com.esotericsoftware.minlog, minlog).
failed to run SimpleApp locally on macbook
Hello I'm new to Spark and I couldn't make the SimpleApp run on my macbook. I feel it's related to network configuration. Could anyone take a look? Thanks. 14/09/14 10:10:36 INFO Utils: Fetching http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to /var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp1014702023347580837.tmp 14/09/14 10:11:36 INFO Executor: Fetching http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar with timestamp 1410714636103 14/09/14 10:11:36 INFO Utils: Fetching http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to /var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp4432132500879081005.tmp 14/09/14 10:11:36 ERROR Executor: Exception in task ID 1 java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org $apache$spark$executor$Executor$$updateDependencies(Executor.scala:328) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/09/14 10:11:36 WARN TaskSetManager: Lost TID 1 (task 0.0:1) 14/09/14 10:11:36 WARN TaskSetManager: Loss was due to java.net.SocketTimeoutException java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at
Re: HBase 0.96+ with Spark 1.0+
I did actually try Seans suggestion just before I posted for the first time in this thread. I got an error when doing this and thought that I am not understanding what Sean was suggesting. Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh5.1.0 I am currently using. I used following: val mortbayEnforce = org.mortbay.jetty % servlet-api % 3.0.20100224 val mortbayExclusion = ExclusionRule(organization = org.mortbay.jetty, name = servlet-api-2.5) and applied this to hadoop and hbase dependencies e.g. like this: val hbase = Seq(HBase.server, HBase.common, HBase.compat, HBase.compat2, HBase.protocol, mortbayEnforce).map(_.excludeAll(HBase.exclusions: _*)) private object HBase { val server = org.apache.hbase % hbase-server % Version.HBase ... val exclusions = Seq(ExclusionRule(org.apache.ant), mortbayExclusion) } I still get the error I got the last time I tried this experiment: 14/09/14 18:28:09 ERROR metrics.MetricsSystem: Sink class org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130) at org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84) at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230) at org.apache.spark.SparkContext.init(SparkContext.scala:202) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply$mcV$sp(SimpleTicketTextSimilaritySparkJobSpec.scala:29) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656) at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1714) at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1683) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760) at org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760) at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
Re: Broadcast error
How? Example please.. Also, if I am running this in pyspark shell.. how do i configure spark.akka.frameSize ?? On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com wrote: When the data size is huge, you better of use the torrentBroadcastFactory. Thanks Best Regards On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com wrote: Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in broadcast pickled = pickleSer.dumps(value) File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) SystemError: error return without exception set Help?
Re: compiling spark source code
I've seen the file name too long error when compiling on an encrypted Linux file system -- some of them have a limit on file name lengths. If you're on Linux, can you try compiling inside /tmp instead? Matei On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote: Can you try sbt/sbt clean first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja kkptni...@gmail.com wrote: Hi Ted, Thanks for the prompt reply :) please find details of the issue at this url http://pastebin.com/Xt0hZ38q http://pastebin.com/Xt0hZ38q Kind Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcast error
And when I use sparksubmit script, I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) My spark submit code is conf = SparkConf().set(spark.executor.memory, 32G).set(spark.akka.frameSize, 1000) sc = SparkContext(conf = conf) rdd = sc.parallelize(matrix,5) from pyspark.mllib.clustering import KMeans from math import sqrt clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2, initializationMode=random) def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) Which is executed as following: spark-submit --master $SPARKURL clustering_example.py --executor-memory 32G --driver-memory 60G On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com wrote: How? Example please.. Also, if I am running this in pyspark shell.. how do i configure spark.akka.frameSize ?? On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com wrote: When the data size is huge, you better of use the torrentBroadcastFactory. Thanks Best Regards On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com wrote: Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in broadcast pickled = pickleSer.dumps(value) File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) SystemError: error return without exception set Help?
Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?
Hi Andrew, I agree with Nicholas. That was a nice, concise summary of the meaning of the locality customization options, indicators and default Spark behaviors. I haven't combed through the documentation end-to-end in a while, but I'm also not sure that information is presently represented somewhere and it would be great to persist it somewhere besides the mailing list. best, -Brad On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote: The locality is how close the data is to the code that's processing it. PROCESS_LOCAL means data is in the same JVM as the code that's running, so it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same node, or in another executor on the same node, so is a little slower because the data has to travel across an IPC connection. RACK_LOCAL is even slower -- data is on a different server so needs to be sent over the network. Spark switches to lower locality levels when there's no unprocessed data on a node that has idle CPUs. In that situation you have two options: wait until the busy CPUs free up so you can start another task that uses data on that server, or start a new task on a farther away server that needs to bring data from that remote place. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The main tunable option is how far long the scheduler waits before starting to move data rather than code. Those are the spark.locality.* settings here: http://spark.apache.org/docs/latest/configuration.html If you want to prevent this from happening entirely, you can set the values to ridiculously high numbers. The documentation also mentions that 0 has special meaning, so you can try that as well. Good luck! Andrew On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening (barring the machine actually going down, I'd rather stick with the same process)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-1.1.0 with make-distribution.sh problem
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote: resolved: ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests This code is a bit misleading -- Zhanfeng Huo *From:* Zhanfeng Huo huozhanf...@gmail.com *Date:* 2014-09-12 14:13 *To:* user user@spark.apache.org *Subject:* spark-1.1.0 with make-distribution.sh problem Hi, I compile spark with cmd bash -x make-distribution.sh -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0, it errors. How to use it correct? message: + set -o pipefail + set -e +++ dirname make-distribution.sh ++ cd . ++ pwd + FWDIR=/home/syn/spark/spark-1.1.0 + DISTDIR=/home/syn/spark/spark-1.1.0/dist + SPARK_TACHYON=false + MAKE_TGZ=false + NAME=none + (( 7 )) + case $1 in + break + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + which git ++ git rev-parse --short HEAD + GITREV=5f6f219 + '[' '!' -z 5f6f219 ']' + GITREVSTRING=' (git revision 5f6f219)' + unset GITREV + which mvn ++ mvn help:evaluate -Dexpression=project.version ++ grep -v INFO ++ tail -n 1 + VERSION=1.1.0 ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 ++ grep -v INFO ++ tail -n 1 + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output' Best Regards -- Zhanfeng Huo
Alternative to spark.executor.extraClassPath ?
Hi, On Spark Configuration document, spark.executor.extraClassPath is regarded as a backwards-compatibility option. It also says that users typically should not need to set this option. Now, I must add a classpath to the executor environment (as well as to the driver in the future, but for now I'm running YARN-client mode). It's value is '/usr/lib/hbase/lib/*'. (I'm trying to use HBase classes.) How can I add that to the executor environment without using spark.executor.extraClassPath? BTW, spark.executor.extraClassPath 'prepends' the classpath to the CLASSPATH environment variable instead of appending it and seems to cause a few problem to my application. (I've investigated launch_container.sh) Is there a way to make it 'append' rather than 'prepend'? I use Spark version 1.0.0. Thanks.
Re: Re: spark-1.1.0 with make-distribution.sh problem
Thank you very much. It is helpful for end users. Zhanfeng Huo From: Patrick Wendell Date: 2014-09-15 10:19 To: Zhanfeng Huo CC: user Subject: Re: spark-1.1.0 with make-distribution.sh problem Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote: resolved: ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests This code is a bit misleading Zhanfeng Huo From: Zhanfeng Huo Date: 2014-09-12 14:13 To: user Subject: spark-1.1.0 with make-distribution.sh problem Hi, I compile spark with cmd bash -x make-distribution.sh -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0, it errors. How to use it correct? message: + set -o pipefail + set -e +++ dirname make-distribution.sh ++ cd . ++ pwd + FWDIR=/home/syn/spark/spark-1.1.0 + DISTDIR=/home/syn/spark/spark-1.1.0/dist + SPARK_TACHYON=false + MAKE_TGZ=false + NAME=none + (( 7 )) + case $1 in + break + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + '[' -z /home/syn/usr/jdk1.7.0_55 ']' + which git ++ git rev-parse --short HEAD + GITREV=5f6f219 + '[' '!' -z 5f6f219 ']' + GITREVSTRING=' (git revision 5f6f219)' + unset GITREV + which mvn ++ mvn help:evaluate -Dexpression=project.version ++ grep -v INFO ++ tail -n 1 + VERSION=1.1.0 ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 ++ grep -v INFO ++ tail -n 1 + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output' Best Regards Zhanfeng Huo
PathFilter for newAPIHadoopFile?
Hi, I have a directory structure with parquet+avro data in it. There are a couple of administrative files (.foo and/or _foo) that I need to ignore when processing this data or Spark tries to read them as containing parquet content, which they do not. How can I set a PathFilter on the FileInputFormat used to construct an RDD?
Re: Broadcast error
Hey Chengi, What's the version of Spark you are using? It have big improvements about broadcast in 1.1, could you try it? On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com wrote: Any suggestions.. I am really blocked on this one On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com wrote: And when I use sparksubmit script, I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) My spark submit code is conf = SparkConf().set(spark.executor.memory, 32G).set(spark.akka.frameSize, 1000) sc = SparkContext(conf = conf) rdd = sc.parallelize(matrix,5) from pyspark.mllib.clustering import KMeans from math import sqrt clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2, initializationMode=random) def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) Which is executed as following: spark-submit --master $SPARKURL clustering_example.py --executor-memory 32G --driver-memory 60G On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com wrote: How? Example please.. Also, if I am running this in pyspark shell.. how do i configure spark.akka.frameSize ?? On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com wrote: When the data size is huge, you better of use the torrentBroadcastFactory. Thanks Best Regards On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com wrote: Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370, in broadcast pickled = pickleSer.dumps(value) File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) SystemError: error return without exception set Help? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Broadcast error
I am using spark1.0.2. This is my work cluster.. so I can't setup a new version readily... But right now, I am not using broadcast .. conf = SparkConf().set(spark.executor.memory, 32G).set(spark.akka.frameSize, 1000) sc = SparkContext(conf = conf) rdd = sc.parallelize(matrix,5) from pyspark.mllib.clustering import KMeans from math import sqrt clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2, initializationMode=random) def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) executed by spark-submit --master $SPARKURL clustering_example.py --executor-memory 32G --driver-memory 60G and the error I see py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.dagscheduler.org/ $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) and 14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@hostname:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@ hostname:7077] ?? Any suggestions?? On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu dav...@databricks.com wrote: Hey Chengi, What's the version of Spark you are using? It have big improvements about broadcast in 1.1, could you try it? On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com wrote: Any suggestions.. I am really blocked on this one On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com wrote: And when I use sparksubmit script, I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at
Re: Broadcast error
And the thing is code runs just fine if I reduce the number of rows in my data? On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu chengi.liu...@gmail.com wrote: I am using spark1.0.2. This is my work cluster.. so I can't setup a new version readily... But right now, I am not using broadcast .. conf = SparkConf().set(spark.executor.memory, 32G).set(spark.akka.frameSize, 1000) sc = SparkContext(conf = conf) rdd = sc.parallelize(matrix,5) from pyspark.mllib.clustering import KMeans from math import sqrt clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2, initializationMode=random) def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y) print Within Set Sum of Squared Error = + str(WSSSE) executed by spark-submit --master $SPARKURL clustering_example.py --executor-memory 32G --driver-memory 60G and the error I see py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.dagscheduler.org/ $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) and 14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@hostname:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@ hostname:7077] ?? Any suggestions?? On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu dav...@databricks.com wrote: Hey Chengi, What's the version of Spark you are using? It have big improvements about broadcast in 1.1, could you try it? On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com wrote: Any suggestions.. I am really blocked on this one On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com wrote: And when I use sparksubmit script, I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
Re: Use Case of mutable RDD - any ideas around will help.
SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable(table2) sqlContext.cacheTable(table2) val unionedRdd = sqlContext.table(table1).unionAll(sqlContext.table(table2)) When you use table, it will return you the cached representation, so that the union executes much faster. However, there is some unknown slowdown, it's not quite as fast as what you would expect. On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian lian.cs@gmail.com wrote: Ah, I see. So basically what you need is something like cache write through support which exists in Shark but not implemented in Spark SQL yet. In Shark, when inserting data into a table that has already been cached, the newly inserted data will be automatically cached and “union”-ed with the existing table content. SPARK-1671 was created to track this feature. We’ll work on that. Currently, as a workaround, instead of doing union at the RDD level, you may try cache the new table, union it with the old table and then query the union-ed table. The drawbacks is higher code complexity and you end up with lots of temporary tables. But the performance should be reasonable. On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur archit279tha...@gmail.com wrote: LittleCode snippet: line1: cacheTable(existingRDDTableName) line2: //some operations which will materialize existingRDD dataset. line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName) line4: cacheTable(new_existingRDDTableName) line5: //some operation that will materialize new _existingRDD. now, what we expect is in line4 rather than caching both existingRDDTableName and new_existingRDDTableName, it should cache only new_existingRDDTableName. but we cannot explicitly uncache existingRDDTableName because we want the union to use the cached existingRDDTableName. since being lazy new_existingRDDTableName could be materialized later and by then we cant lose existingRDDTableName from cache. What if keep the same name of the new table so, cacheTable(existingRDDTableName) existingRDD.union(newRDD).registerAsTable(existingRDDTableName) cacheTable(existingRDDTableName) //might not be needed again. Will our both cases be satisfied, that it uses existingRDDTableName from cache for union and dont duplicate the data in the cache but somehow, append to the older cacheTable. Thanks and Regards, Archit Thakur. Sr Software Developer, Guavus, Inc. On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora pankajarora.n...@gmail.com wrote: I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache tables it will basically drop all data from memory. I cannot afford losing my cache even for short interval. As all queries from UI will get slow till the time cache loads again. UI response time needs to be predictable and shoudl be fast enough so that user does not get irritated. Also i cannot keep two copies of data(till newrdd materialize) into memory as it will surpass total available memory in system. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org