Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi,

I'm not a master on SparkSQL, but from what I understand, the problem ıs that 
you're trying to access an RDD
inside an RDD here: val xyz = file.map(line = *** 
extractCurRate(sqlContext.sql(select rate ... *** and 
here:  xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate 
... ***.
RDDs can't be serialized inside other RDD tasks, therefore you're receiving the 
NullPointerException.

More specifically, you are trying to generate a SchemaRDD inside an RDD, which 
you can't do.

If file isn't huge, you can call .collect() to transform the RDD to an array 
and then use .map() on the Array.

If the file is huge, then you may do number 3 first, join the two RDDs using 
'txCurCode' as a key, and then do filtering
operations, etc...

Best,
Burak

- Original Message -
From: rkishore999 rkishore...@yahoo.com
To: u...@spark.incubator.apache.org
Sent: Saturday, September 13, 2014 10:29:26 PM
Subject: Spark SQL

val file =
sc.textFile(hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt)

1. val xyz = file.map(line = extractCurRate(sqlContext.sql(select rate
from CurrencyCodeRates where txCurCode = ' + line.substring(202,205) + '
and fxCurCode = ' + fxCurCodesMap(line.substring(77,82)) + ' and
effectiveDate = ' + line.substring(221,229) + ' order by effectiveDate
desc))

2. val xyz = file.map(line = sqlContext.sql(select rate, txCurCode,
fxCurCode, effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and
fxCurCode = 'CSD' and effectiveDate = '20140901' order by effectiveDate
desc))

3. val xyz = sqlContext.sql(select rate, txCurCode, fxCurCode,
effectiveDate from CurrencyCodeRates where txCurCode = 'USD' and fxCurCode =
'CSD' and effectiveDate = '20140901' order by effectiveDate desc)

xyz.saveAsTextFile(/user/output)

In statements 1 and 2 I'm getting nullpointer expecption. But statement 3 is
good. I'm guessing spark context and sql context are not going together
well.

Any suggestions regarding how I can achieve this?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-tp14183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Broadcast error

2014-09-14 Thread Chengi Liu
Hi,
   I am trying to create an rdd out of large matrix sc.parallelize
suggest to use broadcast
But when I do

sc.broadcast(data)
I get this error:

Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370,
in broadcast
pickled = pickleSer.dumps(value)
  File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line
279, in dumps
def dumps(self, obj): return cPickle.dumps(obj, 2)
SystemError: error return without exception set
Help?


Re: Broadcast error

2014-09-14 Thread Chengi Liu
Specifically the error I see when I try to operate on rdd created by
sc.parallelize method
: org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.

On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com wrote:

 Hi,
I am trying to create an rdd out of large matrix sc.parallelize
 suggest to use broadcast
 But when I do

 sc.broadcast(data)
 I get this error:

 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 370,
 in broadcast
 pickled = pickleSer.dumps(value)
   File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line
 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 SystemError: error return without exception set
 Help?




File operations on spark

2014-09-14 Thread rapelly kartheek
Hi

I am trying to perform read/write file operations in spark by creating
Writable object.
But, I am not able to write to a file. The concerned data is not rdd.

Can someone please tell me how to perform read/write file operations on
non-rdd data in spark.

Regards
karthik


Driver fail with out of memory exception

2014-09-14 Thread richiesgr
Hi

I've written a job (I think not very complicated only 1 reduceByKey) the
driver JVM always hang with OOM killing the worker of course. How can I know
what is running on the driver and what is running on the worker how to debug
the memory problem.
I've already used --driver-memory 4g params to give more memory ut nothing
help it always fail

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Driver fail with out of memory exception

2014-09-14 Thread Akhil Das
Try increasing the number of partitions while doing a reduceByKey()
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Thanks
Best Regards

On Sun, Sep 14, 2014 at 5:11 PM, richiesgr richie...@gmail.com wrote:

 Hi

 I've written a job (I think not very complicated only 1 reduceByKey) the
 driver JVM always hang with OOM killing the worker of course. How can I
 know
 what is running on the driver and what is running on the worker how to
 debug
 the memory problem.
 I've already used --driver-memory 4g params to give more memory ut nothing
 help it always fail

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, 

I have tried to to run HBaseTest.scala, but I  got following errors, any ideas 
to how to fix them?

Q1) 
scala package org.apache.spark.examples
console:1: error: illegal start of definition
   package org.apache.spark.examples


Q2) 
scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
console:31: error: object hbase is not a member of package org.apache.hadoop
   import org.apache.hadoop.hbase.mapreduce.TableInputFormat



Regards
Arthur

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Spark examples builds against hbase 0.94 by default.

If you want to run against 0.98, see:
SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297

Cheers

On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I have tried to to run *HBaseTest.scala, *but I  got following errors,
 any ideas to how to fix them?

 Q1)
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples


 Q2)
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat



 Regards
 Arthur



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,Thanks!!I tried to apply the patches, bothspark-1297-v2.txt andspark-1297-v4.txt are good here, but notspark-1297-v5.txt:$ patch -p1 -i spark-1297-v4.txtpatching file examples/pom.xml$ patch -p1 -i spark-1297-v5.txtcan't find file to patch at input line 5Perhaps you used the wrong -p or --strip option?The text leading up to this was:--|diff --git docs/building-with-maven.md docs/building-with-maven.md|index 672d0ef..f8bcd2b 100644|--- docs/building-with-maven.md|+++ docs/building-with-maven.md--File to patch:{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural

\f0\fs22 \cf0 \CocoaLigature0 diff --git docs/building-with-maven.md 
docs/building-with-maven.md\
index 672d0ef..f8bcd2b 100644\
--- docs/building-with-maven.md\
+++ docs/building-with-maven.md\
@@ -71,6 +71,9 @@ For Apache Hadoop 2.x, 0.23.x, Cloudera CDH, and other Hadoop 
versions with YARN\
   /tbody\
 /table\
 \
+To build against HBase 0.98.x releases, hbase-hadoop1 is the default 
profile. This means hbase-0.98.x-hadoop1 would be used.\
+When building against hadoop-2, hbase-hadoop2 profile should be specified.\
+\
 Examples:\
 \
 \{% highlight bash %\}\
diff --git examples/pom.xml examples/pom.xml\
index 8c4c128..9ae50cd 100644\
--- examples/pom.xml\
+++ examples/pom.xml\
@@ -45,6 +45,30 @@\
 /dependency\
   /dependencies\
 /profile\
+profile\
+  idhbase-hadoop2/id\
+  activation\
+property\
+  namehbase.profile/name\
+  valuehadoop2/value\
+/property\
+  /activation\
+  properties\
+hbase.version0.98.4-hadoop2/hbase.version\
+  /properties\
+/profile\
+profile\
+  idhbase-hadoop1/id\
+  activation\
+property\
+  name!hbase.profile/name\
+/property\
+  /activation\
+  properties\
+hbase.version0.98.4-hadoop1/hbase.version\
+  /properties\
+/profile\
+\
   /profiles\
   \
   dependencies\
@@ -110,36 +134,121 @@\
   version$\{project.version\}/version\
 /dependency\
 dependency\
-  groupIdorg.apache.hbase/groupId\
-  artifactIdhbase/artifactId\
-  version$\{hbase.version\}/version\
-  exclusions\
-exclusion\
-  groupIdasm/groupId\
-  artifactIdasm/artifactId\
-/exclusion\
-exclusion\
-  groupIdorg.jboss.netty/groupId\
-  artifactIdnetty/artifactId\
-/exclusion\
-exclusion\
-  groupIdio.netty/groupId\
-  artifactIdnetty/artifactId\
-/exclusion\
-exclusion\
-  groupIdcommons-logging/groupId\
-  artifactIdcommons-logging/artifactId\
-/exclusion\
-exclusion\
-  groupIdorg.jruby/groupId\
-  artifactIdjruby-complete/artifactId\
-/exclusion\
-  /exclusions\
-/dependency\
-dependency\
   groupIdorg.eclipse.jetty/groupId\
   artifactIdjetty-server/artifactId\
 /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-testing-util/artifactId\
+version$\{hbase.version\}/version\
+exclusions\
+  exclusion\
+groupIdorg.jruby/groupId\
+artifactIdjruby-complete/artifactId\
+  /exclusion\
+/exclusions\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-protocol/artifactId\
+version$\{hbase.version\}/version\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-common/artifactId\
+version$\{hbase.version\}/version\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-client/artifactId\
+version$\{hbase.version\}/version\
+exclusions\
+ exclusion\
+  groupIdio.netty/groupId\
+  artifactIdnetty/artifactId\
+ /exclusion\
+   /exclusions\
+  /dependency\
+  dependency\
+groupIdorg.apache.hbase/groupId\
+artifactIdhbase-server/artifactId\
+version$\{hbase.version\}/version\
+exclusions\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-core/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-client/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+artifactIdhadoop-mapreduce-client-jobclient/artifactId\
+  /exclusion\
+  exclusion\
+groupIdorg.apache.hadoop/groupId\
+

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
spark-1297-v5.txt is level 0 patch

Please use spark-1297-v5.txt

Cheers

On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 Thanks!!

 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt
 are good here,  but not spark-1297-v5.txt:


 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml

 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch:






 Please advise.
 Regards
 Arthur



 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:

 Spark examples builds against hbase 0.94 by default.

 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297

 Cheers

 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 I have tried to to run *HBaseTest.scala, *but I  got following errors,
 any ideas to how to fix them?

 Q1)
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples


 Q2)
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat



 Regards
 Arthur







Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

Thanks!

patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

Still got errors.

Regards
Arthur

On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:

 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
 are good here,  but not spark-1297-v5.txt:
 
 
 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 
 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch: 
 
 
 
 
 
 
 Please advise.
 Regards
 Arthur
 
 
 
 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 Spark examples builds against hbase 0.94 by default.
 
 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
 
 Cheers
 
 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi, 
 
 I have tried to to run HBaseTest.scala, but I  got following errors, any 
 ideas to how to fix them?
 
 Q1) 
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples
 
 
 Q2) 
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package 
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 
 
 
 Regards
 Arthur
 
 
 
 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

My bad.  Tried again, worked.


patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


Thanks!
Arthur

On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 Thanks!
 
 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 Still got errors.
 
 Regards
 Arthur
 
 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
 are good here,  but not spark-1297-v5.txt:
 
 
 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 
 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch: 
 
 
 
 
 
 
 Please advise.
 Regards
 Arthur
 
 
 
 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 Spark examples builds against hbase 0.94 by default.
 
 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
 
 Cheers
 
 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi, 
 
 I have tried to to run HBaseTest.scala, but I  got following errors, any 
 ideas to how to fix them?
 
 Q1) 
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples
 
 
 Q2) 
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package 
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 
 
 
 Regards
 Arthur
 
 
 
 
 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
I applied the patch on master branch without rejects.

If you use spark 1.0.2, use pom.xml attached to the JIRA.

On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 Thanks!

 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

 Still got errors.

 Regards
 Arthur

 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:

 spark-1297-v5.txt is level 0 patch

 Please use spark-1297-v5.txt

 Cheers

 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 Thanks!!

 I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt
 are good here,  but not spark-1297-v5.txt:


 $ patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml

 $ patch -p1 -i spark-1297-v5.txt
 can't find file to patch at input line 5
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 |diff --git docs/building-with-maven.md docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- docs/building-with-maven.md
 |+++ docs/building-with-maven.md
 --
 File to patch:






 Please advise.
 Regards
 Arthur



 On 14 Sep, 2014, at 10:48 pm, Ted Yu yuzhih...@gmail.com wrote:

 Spark examples builds against hbase 0.94 by default.

 If you want to run against 0.98, see:
 SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297

 Cheers

 On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 I have tried to to run *HBaseTest.scala, *but I  got following errors,
 any ideas to how to fix them?

 Q1)
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples


 Q2)
 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 console:31: error: object hbase is not a member of package
 org.apache.hadoop
import org.apache.hadoop.hbase.mapreduce.TableInputFormat



 Regards
 Arthur









Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

I applied the patch.

1) patched

$ patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


2) Compilation result
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.550s]
[INFO] Spark Project Core  SUCCESS [1:32.175s]
[INFO] Spark Project Bagel ... SUCCESS [10.809s]
[INFO] Spark Project GraphX .. SUCCESS [31.435s]
[INFO] Spark Project Streaming ... SUCCESS [44.518s]
[INFO] Spark Project ML Library .. SUCCESS [48.992s]
[INFO] Spark Project Tools ... SUCCESS [7.028s]
[INFO] Spark Project Catalyst  SUCCESS [40.365s]
[INFO] Spark Project SQL . SUCCESS [43.305s]
[INFO] Spark Project Hive  SUCCESS [36.464s]
[INFO] Spark Project REPL  SUCCESS [20.319s]
[INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
[INFO] Spark Project YARN Stable API . SUCCESS [19.379s]
[INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s]
[INFO] Spark Project Assembly  SUCCESS [13.822s]
[INFO] Spark Project External Twitter  SUCCESS [9.566s]
[INFO] Spark Project External Kafka .. SUCCESS [12.848s]
[INFO] Spark Project External Flume Sink . SUCCESS [10.437s]
[INFO] Spark Project External Flume .. SUCCESS [14.554s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
[INFO] Spark Project External MQTT ... SUCCESS [8.684s]
[INFO] Spark Project Examples  SUCCESS [1:31.610s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 9:41.700s
[INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
[INFO] Final Memory: 83M/1071M
[INFO] 



3) testing:  
scala package org.apache.spark.examples
console:1: error: illegal start of definition
   package org.apache.spark.examples
   ^


scala import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HBaseAdmin

scala import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala import org.apache.spark._
import org.apache.spark._

scala object HBaseTest {
 | def main(args: Array[String]) {
 | val sparkConf = new SparkConf().setAppName(HBaseTest)
 | val sc = new SparkContext(sparkConf)
 | val conf = HBaseConfiguration.create()
 | // Other options for configuring scan behavior are available. More 
information available at
 | // 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
 | conf.set(TableInputFormat.INPUT_TABLE, args(0))
 | // Initialize hBase table if necessary
 | val admin = new HBaseAdmin(conf)
 | if (!admin.isTableAvailable(args(0))) {
 | val tableDesc = new HTableDescriptor(args(0))
 | admin.createTable(tableDesc)
 | }
 | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
 | classOf[org.apache.hadoop.hbase.client.Result])
 | hBaseRDD.count()
 | sc.stop()
 | }
 | }
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
defined module HBaseTest



Now only got error when trying to run package org.apache.spark.examples”

Please advise.
Regards
Arthur



On 14 Sep, 2014, at 11:41 pm, Ted Yu yuzhih...@gmail.com wrote:

 I applied the patch on master branch without rejects.
 
 If you use spark 1.0.2, use pom.xml attached to the JIRA.
 
 On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!
 
 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 Still got errors.
 
 Regards
 Arthur
 
 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 spark-1297-v5.txt is level 0 patch
 
 Please use spark-1297-v5.txt
 
 Cheers
 
 On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 Thanks!!
 
 I tried to apply the patches, both spark-1297-v2.txt and 

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Take a look at bin/run-example

Cheers

On Sun, Sep 14, 2014 at 9:15 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I applied the patch.

 1) patched

 $ patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml


 2) Compilation result
 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS [1.550s]
 [INFO] Spark Project Core  SUCCESS
 [1:32.175s]
 [INFO] Spark Project Bagel ... SUCCESS
 [10.809s]
 [INFO] Spark Project GraphX .. SUCCESS
 [31.435s]
 [INFO] Spark Project Streaming ... SUCCESS
 [44.518s]
 [INFO] Spark Project ML Library .. SUCCESS
 [48.992s]
 [INFO] Spark Project Tools ... SUCCESS [7.028s]
 [INFO] Spark Project Catalyst  SUCCESS
 [40.365s]
 [INFO] Spark Project SQL . SUCCESS
 [43.305s]
 [INFO] Spark Project Hive  SUCCESS
 [36.464s]
 [INFO] Spark Project REPL  SUCCESS
 [20.319s]
 [INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
 [INFO] Spark Project YARN Stable API . SUCCESS
 [19.379s]
 [INFO] Spark Project Hive Thrift Server .. SUCCESS
 [12.470s]
 [INFO] Spark Project Assembly  SUCCESS
 [13.822s]
 [INFO] Spark Project External Twitter  SUCCESS [9.566s]
 [INFO] Spark Project External Kafka .. SUCCESS
 [12.848s]
 [INFO] Spark Project External Flume Sink . SUCCESS
 [10.437s]
 [INFO] Spark Project External Flume .. SUCCESS
 [14.554s]
 [INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
 [INFO] Spark Project External MQTT ... SUCCESS [8.684s]
 [INFO] Spark Project Examples  SUCCESS
 [1:31.610s]
 [INFO]
 
 [INFO] BUILD SUCCESS
 [INFO]
 
 [INFO] Total time: 9:41.700s
 [INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
 [INFO] Final Memory: 83M/1071M
 [INFO]
 



 3) testing:
 scala package org.apache.spark.examples
 console:1: error: illegal start of definition
package org.apache.spark.examples
^


 scala import org.apache.hadoop.hbase.client.HBaseAdmin
 import org.apache.hadoop.hbase.client.HBaseAdmin

 scala import org.apache.hadoop.hbase.{HBaseConfiguration,
 HTableDescriptor}
 import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

 scala import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 import org.apache.hadoop.hbase.mapreduce.TableInputFormat

 scala import org.apache.spark._
 import org.apache.spark._

 scala object HBaseTest {
  | def main(args: Array[String]) {
  | val sparkConf = new SparkConf().setAppName(HBaseTest)
  | val sc = new SparkContext(sparkConf)
  | val conf = HBaseConfiguration.create()
  | // Other options for configuring scan behavior are available. More
 information available at
  | //
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
  | conf.set(TableInputFormat.INPUT_TABLE, args(0))
  | // Initialize hBase table if necessary
  | val admin = new HBaseAdmin(conf)
  | if (!admin.isTableAvailable(args(0))) {
  | val tableDesc = new HTableDescriptor(args(0))
  | admin.createTable(tableDesc)
  | }
  | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  | classOf[org.apache.hadoop.hbase.client.Result])
  | hBaseRDD.count()
  | sc.stop()
  | }
  | }
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 defined module HBaseTest



 Now only got error when trying to run package org.apache.spark.examples”

 Please advise.
 Regards
 Arthur



 On 14 Sep, 2014, at 11:41 pm, Ted Yu yuzhih...@gmail.com wrote:

 I applied the patch on master branch without rejects.

 If you use spark 1.0.2, use pom.xml attached to the JIRA.

 On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 Thanks!

 patch -p0 -i spark-1297-v5.txt
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

 Still got errors.

 Regards
 Arthur

 On 14 Sep, 2014, at 11:33 pm, Ted Yu yuzhih...@gmail.com wrote:

 spark-1297-v5.txt is level 0 patch

 

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Can you post your whole SBT build file(s)?

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler sp...@tbonline.de wrote:

 Hi,

 I just called:

  test

 or

  run

 Thorsten


 Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:

  Hi,

 What is your SBT command and the parameters?

 Arthur


 On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote:

  Hello,

 I am writing a Spark App which is already working so far.
 Now I started to build also some UnitTests, but I am running into some
 dependecy problems and I cannot find a solution right now. Perhaps someone
 could help me.

 I build my Spark Project with SBT and it seems to be configured well,
 because compiling, assembling and running the built jar with spark-submit
 are working well.

 Now I started with the UnitTests, which I located under /src/test/scala.

 When I call test in sbt, I get the following:

 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
 BlockManager
 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
 [trace] Stack trace suppressed: run last test:test for the full output.
 [error] Could not run test test.scala.SetSuite: 
 java.lang.NoClassDefFoundError:
 javax/servlet/http/HttpServletResponse
 [info] Run completed in 626 milliseconds.
 [info] Total number of tests run: 0
 [info] Suites: completed 0, aborted 0
 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
 [info] All tests passed.
 [error] Error during tests:
 [error] test.scala.SetSuite
 [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
 [error] Total time: 3 s, completed 10.09.2014 12:22:06

 last test:test gives me the following:

  last test:test

 [debug] Running TaskDef(test.scala.SetSuite,
 org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.init(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at test.scala.SetSuite.init(SparkTest.scala:16)

 I also noticed right now, that sbt run is also not working:

 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
 [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/
 HttpServletResponse
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.init(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at main.scala.PartialDuplicateScanner$.main(
 PartialDuplicateScanner.scala:29)
 at main.scala.PartialDuplicateScanner.main(
 PartialDuplicateScanner.scala)

 Here is my Testprojekt.sbt file:

 name := Testprojekt

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= {
   Seq(
 org.apache.lucene % lucene-core % 4.9.0,
 org.apache.lucene % lucene-analyzers-common % 4.9.0,
 org.apache.lucene % lucene-queryparser % 4.9.0,
 (org.apache.spark %% spark-core % 1.0.2).
 exclude(org.mortbay.jetty, servlet-api).
 exclude(commons-beanutils, commons-beanutils-core).
 exclude(commons-collections, commons-collections).
 exclude(commons-collections, commons-collections).
 exclude(com.esotericsoftware.minlog, minlog).
 exclude(org.eclipse.jetty.orbit, javax.mail.glassfish).
 exclude(org.eclipse.jetty.orbit, javax.transaction).
 exclude(org.eclipse.jetty.orbit, javax.servlet)
   )
 }

 resolvers += Akka Repository at http://repo.akka.io/releases/;







 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Sorry, I meant any *other* SBT files.

However, what happens if you remove the line:

exclude(org.eclipse.jetty.orbit, javax.servlet)


dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sun, Sep 14, 2014 at 11:53 AM, Dean Wampler deanwamp...@gmail.com
wrote:

 Can you post your whole SBT build file(s)?

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler sp...@tbonline.de
 wrote:

 Hi,

 I just called:

  test

 or

  run

 Thorsten


 Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:

  Hi,

 What is your SBT command and the parameters?

 Arthur


 On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote:

  Hello,

 I am writing a Spark App which is already working so far.
 Now I started to build also some UnitTests, but I am running into some
 dependecy problems and I cannot find a solution right now. Perhaps someone
 could help me.

 I build my Spark Project with SBT and it seems to be configured well,
 because compiling, assembling and running the built jar with spark-submit
 are working well.

 Now I started with the UnitTests, which I located under /src/test/scala.

 When I call test in sbt, I get the following:

 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
 BlockManager
 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
 [trace] Stack trace suppressed: run last test:test for the full output.
 [error] Could not run test test.scala.SetSuite: 
 java.lang.NoClassDefFoundError:
 javax/servlet/http/HttpServletResponse
 [info] Run completed in 626 milliseconds.
 [info] Total number of tests run: 0
 [info] Suites: completed 0, aborted 0
 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
 [info] All tests passed.
 [error] Error during tests:
 [error] test.scala.SetSuite
 [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
 [error] Total time: 3 s, completed 10.09.2014 12:22:06

 last test:test gives me the following:

  last test:test

 [debug] Running TaskDef(test.scala.SetSuite,
 org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.init(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at test.scala.SetSuite.init(SparkTest.scala:16)

 I also noticed right now, that sbt run is also not working:

 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
 [error] (run-main-2) java.lang.NoClassDefFoundError:
 javax/servlet/http/HttpServletResponse
 java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
 at org.apache.spark.HttpServer.start(HttpServer.scala:54)
 at org.apache.spark.broadcast.HttpBroadcast$.createServer(
 HttpBroadcast.scala:156)
 at org.apache.spark.broadcast.HttpBroadcast$.initialize(
 HttpBroadcast.scala:127)
 at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
 HttpBroadcastFactory.scala:31)
 at org.apache.spark.broadcast.BroadcastManager.initialize(
 BroadcastManager.scala:48)
 at org.apache.spark.broadcast.BroadcastManager.init(
 BroadcastManager.scala:35)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
 at org.apache.spark.SparkContext.init(SparkContext.scala:202)
 at main.scala.PartialDuplicateScanner$.main(
 PartialDuplicateScanner.scala:29)
 at main.scala.PartialDuplicateScanner.main(
 PartialDuplicateScanner.scala)

 Here is my Testprojekt.sbt file:

 name := Testprojekt

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= {
   Seq(
 org.apache.lucene % lucene-core % 4.9.0,
 org.apache.lucene % lucene-analyzers-common % 4.9.0,
 org.apache.lucene % lucene-queryparser % 4.9.0,
 (org.apache.spark %% spark-core % 1.0.2).
 exclude(org.mortbay.jetty, servlet-api).
 exclude(commons-beanutils, commons-beanutils-core).
 exclude(commons-collections, commons-collections).
 exclude(commons-collections, commons-collections).
 exclude(com.esotericsoftware.minlog, minlog).
 

failed to run SimpleApp locally on macbook

2014-09-14 Thread Gary Zhao
Hello

I'm new to Spark and I couldn't make the SimpleApp run on my macbook. I
feel it's related to network configuration. Could anyone take a look?
Thanks.

14/09/14 10:10:36 INFO Utils: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to
/var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp1014702023347580837.tmp
14/09/14 10:11:36 INFO Executor: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar with timestamp
1410714636103
14/09/14 10:11:36 INFO Utils: Fetching
http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to
/var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4ftc9n6/T/fetchFileTemp4432132500879081005.tmp
14/09/14 10:11:36 ERROR Executor: Exception in task ID 1
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.init(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org
$apache$spark$executor$Executor$$updateDependencies(Executor.scala:328)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/09/14 10:11:36 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
14/09/14 10:11:36 WARN TaskSetManager: Loss was due to
java.net.SocketTimeoutException
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.init(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:348)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at 

Re: HBase 0.96+ with Spark 1.0+

2014-09-14 Thread Reinis Vicups
I did actually try Seans suggestion just before I posted for the first 
time in this thread. I got an error when doing this and thought that I 
am not understanding what Sean was suggesting.


Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 
0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh5.1.0 I am currently using.


I used following:

  val mortbayEnforce = org.mortbay.jetty % servlet-api % 3.0.20100224
  val mortbayExclusion = ExclusionRule(organization = 
org.mortbay.jetty, name = servlet-api-2.5)


and applied this to hadoop and hbase dependencies e.g. like this:

val hbase = Seq(HBase.server, HBase.common, HBase.compat, HBase.compat2, 
HBase.protocol, mortbayEnforce).map(_.excludeAll(HBase.exclusions: _*))


private object HBase {
val server = org.apache.hbase  % hbase-server % Version.HBase
...
val exclusions = Seq(ExclusionRule(org.apache.ant), mortbayExclusion)
}

I still get the error I got the last time I tried this experiment:

14/09/14 18:28:09 ERROR metrics.MetricsSystem: Sink class 
org.apache.spark.metrics.sink.MetricsServlet cannot be instantialized

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:136)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:130)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:130)
at 
org.apache.spark.metrics.MetricsSystem.init(MetricsSystem.scala:84)
at 
org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:167)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:230)
at org.apache.spark.SparkContext.init(SparkContext.scala:202)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply$mcV$sp(SimpleTicketTextSimilaritySparkJobSpec.scala:29)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21)
at 
d.s.f.s.t.SimpleTicketTextSimilaritySparkJobSpec$$anonfun$1.apply(SimpleTicketTextSimilaritySparkJobSpec.scala:21)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)

at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)

at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)

at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)

at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1714)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1683)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FlatSpec.org$scalatest$FlatSpecLike$$super$run(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760)
at 
org.scalatest.FlatSpecLike$$anonfun$run$1.apply(FlatSpecLike.scala:1760)

at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
  

Re: Broadcast error

2014-09-14 Thread Chengi Liu
How? Example please..
Also, if I am running this in pyspark shell.. how do i configure
spark.akka.frameSize ??


On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 When the data size is huge, you better of use the torrentBroadcastFactory.

 Thanks
 Best Regards

 On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Specifically the error I see when I try to operate on rdd created by
 sc.parallelize method
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
 (10485760 bytes). Consider using broadcast variables for large values.

 On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Hi,
I am trying to create an rdd out of large matrix sc.parallelize
 suggest to use broadcast
 But when I do

 sc.broadcast(data)
 I get this error:

 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line
 370, in broadcast
 pickled = pickleSer.dumps(value)
   File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py, line
 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 SystemError: error return without exception set
 Help?






Re: compiling spark source code

2014-09-14 Thread Matei Zaharia
I've seen the file name too long error when compiling on an encrypted Linux 
file system -- some of them have a limit on file name lengths. If you're on 
Linux, can you try compiling inside /tmp instead?

Matei

On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote:

Can you try sbt/sbt clean first?

On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. [error] File name too long

It is not clear which file(s) loadfiles was loading.
Is the filename in earlier part of the output ?

Cheers

On Sat, Sep 13, 2014 at 10:58 AM, kkptninja kkptni...@gmail.com wrote:
Hi Ted,

Thanks for the prompt reply :)

please find details of the issue at this url  http://pastebin.com/Xt0hZ38q
http://pastebin.com/Xt0hZ38q

Kind Regards




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: Broadcast error

2014-09-14 Thread Chengi Liu
And when I use sparksubmit script, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling
o26.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


My spark submit code is

conf = SparkConf().set(spark.executor.memory,
32G).set(spark.akka.frameSize, 1000)
sc = SparkContext(conf = conf)
rdd = sc.parallelize(matrix,5)

from pyspark.mllib.clustering import KMeans
from math import sqrt
clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
initializationMode=random)
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print Within Set Sum of Squared Error =  + str(WSSSE)

Which is executed as following:
spark-submit --master $SPARKURL clustering_example.py  --executor-memory
32G  --driver-memory 60G

On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com
wrote:

 How? Example please..
 Also, if I am running this in pyspark shell.. how do i configure
 spark.akka.frameSize ??


 On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 When the data size is huge, you better of use the torrentBroadcastFactory.

 Thanks
 Best Regards

 On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Specifically the error I see when I try to operate on rdd created by
 sc.parallelize method
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize
 (10485760 bytes). Consider using broadcast variables for large values.

 On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Hi,
I am trying to create an rdd out of large matrix sc.parallelize
 suggest to use broadcast
 But when I do

 sc.broadcast(data)
 I get this error:

 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line
 370, in broadcast
 pickled = pickleSer.dumps(value)
   File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py,
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 SystemError: error return without exception set
 Help?







Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew,

I agree with Nicholas.  That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors.  I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somewhere and it would be great to persist it
somewhere besides the mailing list.

best,
-Brad

On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Andrew,

 This email was pretty helpful. I feel like this stuff should be summarized
 in the docs somewhere, or perhaps in a blog post.

 Do you know if it is?

 Nick


 On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:

 The locality is how close the data is to the code that's processing it.
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
 same node, or in another executor on the same node, so is a little slower
 because the data has to travel across an IPC connection.  RACK_LOCAL is even
 slower -- data is on a different server so needs to be sent over the
 network.

 Spark switches to lower locality levels when there's no unprocessed data
 on a node that has idle CPUs.  In that situation you have two options: wait
 until the busy CPUs free up so you can start another task that uses data on
 that server, or start a new task on a farther away server that needs to
 bring data from that remote place.  What Spark typically does is wait a bit
 in the hopes that a busy CPU frees up.  Once that timeout expires, it starts
 moving the data from far away to the free CPU.

 The main tunable option is how far long the scheduler waits before
 starting to move data rather than code.  Those are the spark.locality.*
 settings here: http://spark.apache.org/docs/latest/configuration.html

 If you want to prevent this from happening entirely, you can set the
 values to ridiculously high numbers.  The documentation also mentions that
 0 has special meaning, so you can try that as well.

 Good luck!
 Andrew


 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:

 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
 assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

 When these happen things get extremely slow.

 Does this mean that the executor got terminated and restarted?

 Is there a way to prevent this from happening (barring the machine
 actually going down, I'd rather stick with the same process)?




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make
it in time for the release:

https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54


On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com
wrote:

 resolved:

 ./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon
 -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests

 This code is a bit misleading


 --
 Zhanfeng Huo


 *From:* Zhanfeng Huo huozhanf...@gmail.com
 *Date:* 2014-09-12 14:13
 *To:* user user@spark.apache.org
 *Subject:* spark-1.1.0 with make-distribution.sh problem
 Hi,

 I compile spark with cmd  bash -x make-distribution.sh -Pyarn -Phive
 --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
 -Phadoop.version=2.3.0, it errors.

 How to use it correct?

message:
 + set -o pipefail
 + set -e
 +++ dirname make-distribution.sh
 ++ cd .
 ++ pwd
 + FWDIR=/home/syn/spark/spark-1.1.0
 + DISTDIR=/home/syn/spark/spark-1.1.0/dist
 + SPARK_TACHYON=false
 + MAKE_TGZ=false
 + NAME=none
 + (( 7 ))
 + case $1 in
 + break
 + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
 + '[' -z /home/syn/usr/jdk1.7.0_55 ']'
 + which git
 ++ git rev-parse --short HEAD
 + GITREV=5f6f219
 + '[' '!' -z 5f6f219 ']'
 + GITREVSTRING=' (git revision 5f6f219)'
 + unset GITREV
 + which mvn
 ++ mvn help:evaluate -Dexpression=project.version
 ++ grep -v INFO
 ++ tail -n 1
 + VERSION=1.1.0
 ++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive
 --skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
 -Phadoop.version=2.3.0
 ++ grep -v INFO
 ++ tail -n 1
 + SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

 Best Regards
 --
 Zhanfeng Huo




Alternative to spark.executor.extraClassPath ?

2014-09-14 Thread innowireless TaeYun Kim
Hi,

 

On Spark Configuration document, spark.executor.extraClassPath is regarded
as a backwards-compatibility option. It also says that users typically
should not need to set this option.

 

Now, I must add a classpath to the executor environment (as well as to the
driver in the future, but for now I'm running YARN-client mode).

It's value is '/usr/lib/hbase/lib/*'. (I'm trying to use HBase classes.)

How can I add that to the executor environment without using
spark.executor.extraClassPath?

 

BTW, spark.executor.extraClassPath 'prepends' the classpath to the CLASSPATH
environment variable instead of appending it and seems to cause a few
problem to my application. (I've investigated launch_container.sh) Is there
a way to make it 'append' rather than 'prepend'?

 

I use Spark version 1.0.0.

 

Thanks.

 

 



Re: Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Zhanfeng Huo
Thank you very much. 

It is helpful for end users.



Zhanfeng Huo
 
From: Patrick Wendell
Date: 2014-09-15 10:19
To: Zhanfeng Huo
CC: user
Subject: Re: spark-1.1.0 with make-distribution.sh problem
Yeah that issue has been fixed by adding better docs, it just didn't make it in 
time for the release:

https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54


On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote:
resolved:

./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn 
-Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests

This code is a bit misleading




Zhanfeng Huo
 
From: Zhanfeng Huo
Date: 2014-09-12 14:13
To: user
Subject: spark-1.1.0 with make-distribution.sh problem
Hi,

I compile spark with cmd  bash -x make-distribution.sh -Pyarn -Phive 
--skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 
-Phadoop.version=2.3.0, it errors.
   
How to use it correct?

   message:
+ set -o pipefail 
+ set -e 
+++ dirname make-distribution.sh 
++ cd . 
++ pwd 
+ FWDIR=/home/syn/spark/spark-1.1.0 
+ DISTDIR=/home/syn/spark/spark-1.1.0/dist 
+ SPARK_TACHYON=false 
+ MAKE_TGZ=false 
+ NAME=none 
+ (( 7 )) 
+ case $1 in 
+ break 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ which git 
++ git rev-parse --short HEAD 
+ GITREV=5f6f219 
+ '[' '!' -z 5f6f219 ']' 
+ GITREVSTRING=' (git revision 5f6f219)' 
+ unset GITREV 
+ which mvn 
++ mvn help:evaluate -Dexpression=project.version 
++ grep -v INFO 
++ tail -n 1 
+ VERSION=1.1.0 
++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test 
--with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 
++ grep -v INFO 
++ tail -n 1 
+ SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

Best Regards


Zhanfeng Huo



PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi,

I have a directory structure with parquet+avro data in it. There are a
couple of administrative files (.foo and/or _foo) that I need to ignore
when processing this data or Spark tries to read them as containing parquet
content, which they do not.

How can I set a PathFilter on the FileInputFormat used to construct an RDD?


Re: Broadcast error

2014-09-14 Thread Davies Liu
Hey Chengi,

What's the version of Spark you are using? It have big improvements
about broadcast in 1.1, could you try it?

On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com wrote:
 Any suggestions.. I am really blocked on this one

 On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com wrote:

 And when I use sparksubmit script, I get the following error:

 py4j.protocol.Py4JJavaError: An error occurred while calling
 o26.trainKMeansModel.
 : org.apache.spark.SparkException: Job aborted due to stage failure: All
 masters are unresponsive! Giving up.
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 My spark submit code is

 conf = SparkConf().set(spark.executor.memory,
 32G).set(spark.akka.frameSize, 1000)
 sc = SparkContext(conf = conf)
 rdd = sc.parallelize(matrix,5)

 from pyspark.mllib.clustering import KMeans
 from math import sqrt
 clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
 initializationMode=random)
 def error(point):
 center = clusters.centers[clusters.predict(point)]
 return sqrt(sum([x**2 for x in (point - center)]))

 WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
 print Within Set Sum of Squared Error =  + str(WSSSE)

 Which is executed as following:
 spark-submit --master $SPARKURL clustering_example.py  --executor-memory
 32G  --driver-memory 60G

 On Sun, Sep 14, 2014 at 10:47 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 How? Example please..
 Also, if I am running this in pyspark shell.. how do i configure
 spark.akka.frameSize ??


 On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 When the data size is huge, you better of use the
 torrentBroadcastFactory.

 Thanks
 Best Regards

 On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Specifically the error I see when I try to operate on rdd created by
 sc.parallelize method
 : org.apache.spark.SparkException: Job aborted due to stage failure:
 Serialized task 12:12 was 12062263 bytes which exceeds 
 spark.akka.frameSize
 (10485760 bytes). Consider using broadcast variables for large values.

 On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu chengi.liu...@gmail.com
 wrote:

 Hi,
I am trying to create an rdd out of large matrix sc.parallelize
 suggest to use broadcast
 But when I do

 sc.broadcast(data)
 I get this error:

 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line
 370, in broadcast
 pickled = pickleSer.dumps(value)
   File /usr/common/usg/spark/1.0.2/python/pyspark/serializers.py,
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 SystemError: error return without exception set
 Help?







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Broadcast error

2014-09-14 Thread Chengi Liu
I am using spark1.0.2.
This is my work cluster.. so I can't setup a new version readily...
But right now, I am not using broadcast ..


conf = SparkConf().set(spark.executor.memory,
32G).set(spark.akka.frameSize, 1000)
sc = SparkContext(conf = conf)
rdd = sc.parallelize(matrix,5)

from pyspark.mllib.clustering import KMeans
from math import sqrt
clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
initializationMode=random)
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print Within Set Sum of Squared Error =  + str(WSSSE)


executed by
spark-submit --master $SPARKURL clustering_example.py  --executor-memory
32G  --driver-memory 60G

and the error I see
py4j.protocol.Py4JJavaError: An error occurred while calling
o26.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! Giving up.
at org.apache.spark.scheduler.DAGScheduler.org
http://org.apache.spark.scheduler.dagscheduler.org/
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


and
14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@hostname:7077:
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkMaster@ hostname:7077]

??
Any suggestions??


On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu dav...@databricks.com wrote:

 Hey Chengi,

 What's the version of Spark you are using? It have big improvements
 about broadcast in 1.1, could you try it?

 On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  Any suggestions.. I am really blocked on this one
 
  On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
 
  And when I use sparksubmit script, I get the following error:
 
  py4j.protocol.Py4JJavaError: An error occurred while calling
  o26.trainKMeansModel.
  : org.apache.spark.SparkException: Job aborted due to stage failure: All
  masters are unresponsive! Giving up.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
  at scala.Option.foreach(Option.scala:236)
  at
 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
  at
 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at
 
 

Re: Broadcast error

2014-09-14 Thread Chengi Liu
And the thing is code runs just fine if I reduce the number of rows in my
data?

On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu chengi.liu...@gmail.com wrote:

 I am using spark1.0.2.
 This is my work cluster.. so I can't setup a new version readily...
 But right now, I am not using broadcast ..


 conf = SparkConf().set(spark.executor.memory,
 32G).set(spark.akka.frameSize, 1000)
 sc = SparkContext(conf = conf)
 rdd = sc.parallelize(matrix,5)

 from pyspark.mllib.clustering import KMeans
 from math import sqrt
 clusters = KMeans.train(rdd, 5, maxIterations=2,runs=2,
 initializationMode=random)
 def error(point):
 center = clusters.centers[clusters.predict(point)]
 return sqrt(sum([x**2 for x in (point - center)]))

 WSSSE = rdd.map(lambda point: error(point)).reduce(lambda x, y: x + y)
 print Within Set Sum of Squared Error =  + str(WSSSE)


 executed by
 spark-submit --master $SPARKURL clustering_example.py  --executor-memory
 32G  --driver-memory 60G

 and the error I see
 py4j.protocol.Py4JJavaError: An error occurred while calling
 o26.trainKMeansModel.
 : org.apache.spark.SparkException: Job aborted due to stage failure: All
 masters are unresponsive! Giving up.
 at org.apache.spark.scheduler.DAGScheduler.org
 http://org.apache.spark.scheduler.dagscheduler.org/
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 and
 14/09/14 20:43:30 WARN AppClient$ClientActor: Could not connect to
 akka.tcp://sparkMaster@hostname:7077:
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkMaster@ hostname:7077]

 ??
 Any suggestions??


 On Sun, Sep 14, 2014 at 8:39 PM, Davies Liu dav...@databricks.com wrote:

 Hey Chengi,

 What's the version of Spark you are using? It have big improvements
 about broadcast in 1.1, could you try it?

 On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
  Any suggestions.. I am really blocked on this one
 
  On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu chengi.liu...@gmail.com
 wrote:
 
  And when I use sparksubmit script, I get the following error:
 
  py4j.protocol.Py4JJavaError: An error occurred while calling
  o26.trainKMeansModel.
  : org.apache.spark.SparkException: Job aborted due to stage failure:
 All
  masters are unresponsive! Giving up.
  at
  org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
  at
 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
  at scala.Option.foreach(Option.scala:236)
  at
 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
  at
 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at 

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising.

Note that even right now, you don't need to un-cache the existing
table.   You can do something like this:

newAdditionRdd.registerTempTable(table2)
sqlContext.cacheTable(table2)
val unionedRdd = sqlContext.table(table1).unionAll(sqlContext.table(table2))

When you use table, it will return you the cached representation, so
that the union executes much faster.

However, there is some unknown slowdown, it's not quite as fast as
what you would expect.

On Fri, Sep 12, 2014 at 2:09 PM, Cheng Lian lian.cs@gmail.com wrote:
 Ah, I see. So basically what you need is something like cache write through
 support which exists in Shark but not implemented in Spark SQL yet. In
 Shark, when inserting data into a table that has already been cached, the
 newly inserted data will be automatically cached and “union”-ed with the
 existing table content. SPARK-1671 was created to track this feature. We’ll
 work on that.

 Currently, as a workaround, instead of doing union at the RDD level, you may
 try cache the new table, union it with the old table and then query the
 union-ed table. The drawbacks is higher code complexity and you end up with
 lots of temporary tables. But the performance should be reasonable.


 On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur archit279tha...@gmail.com
 wrote:

 LittleCode snippet:

 line1: cacheTable(existingRDDTableName)
 line2: //some operations which will materialize existingRDD dataset.
 line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
 line4: cacheTable(new_existingRDDTableName)
 line5: //some operation that will materialize new _existingRDD.

 now, what we expect is in line4 rather than caching both
 existingRDDTableName and new_existingRDDTableName, it should cache only
 new_existingRDDTableName. but we cannot explicitly uncache
 existingRDDTableName because we want the union to use the cached
 existingRDDTableName. since being lazy new_existingRDDTableName could be
 materialized later and by then we cant lose existingRDDTableName from cache.

 What if keep the same name of the new table

 so, cacheTable(existingRDDTableName)
 existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
 cacheTable(existingRDDTableName) //might not be needed again.

 Will our both cases be satisfied, that it uses existingRDDTableName from
 cache for union and dont duplicate the data in the cache but somehow, append
 to the older cacheTable.

 Thanks and Regards,


 Archit Thakur.
 Sr Software Developer,
 Guavus, Inc.

 On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora
 pankajarora.n...@gmail.com wrote:

 I think i should elaborate usecase little more.

 So we have UI dashboard whose response time is quite fast as all the data
 is
 cached. Users query data based on time range and also there is always new
 data coming into the system at predefined frequency lets say 1 hour.

 As you said i can uncache tables it will basically drop all data from
 memory.
 I cannot afford losing my cache even for short interval. As all queries
 from
 UI will get slow till the time cache loads again. UI response time needs
 to
 be predictable and shoudl be fast enough so that user does not get
 irritated.

 Also i cannot keep two copies of data(till newrdd materialize) into
 memory
 as it will surpass total available memory in system.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org