Re: spark run issue

2014-05-04 Thread Tathagata Das
All the stuff in lib_managed are what gets downloaded by sbt/maven when you
compile. Those are necessary for running spark, spark streaming, etc. But
you should not have to add all that to classpath individually and manually
when running Spark programs. If you are trying to run your Spark program
locally, you should use sbt or maven to compile your project with Spark as
a dependency, and sbt/maven will take care of putting all the necessary
jars in the classpath (when you run run your program with sbt/maven). If
you are trying to run your Spark program on a cluster, then refer to
the deployment
guide  http://spark.apache.org/docs/latest/cluster-overview.html. To run
a Spark stand alone cluster, you just have to compile spark and place the
whole spark directory on the worker nodes. For other deploy modes like Yarn
and Mesos, you should just compile spark into a big all-inclusive jar and
supply that when you launch your program on Yarn/mesos. See the guide for
more details.

TD


On Sat, May 3, 2014 at 7:24 PM, Weide Zhang weo...@gmail.com wrote:

 Hi Tathagata,

 I actually have a separate question. What's the usage of lib_managed
 folder inside spark source folder ? Are those the library required for
 spark streaming to run ? Do they needed to be added to spark classpath when
 starting sparking cluster?

 Weide


 On Sat, May 3, 2014 at 7:08 PM, Weide Zhang weo...@gmail.com wrote:

 Hi Tathagata,

 I figured out the reason. I was adding a wrong kafka lib along side with
 the version spark uses. Sorry for spamming.

 Weide


 On Sat, May 3, 2014 at 7:04 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 I am a little confused about the version of Spark you are using. Are you
 using Spark 0.9.1 that uses scala 2.10.3 ?

 TD


 On Sat, May 3, 2014 at 6:16 PM, Weide Zhang weo...@gmail.com wrote:

 Hi I'm trying to run the kafka-word-count example in spark2.9.1. I
 encountered some exception when initialize kafka consumer/producer config.
 I'm using scala 2.10.3 and used maven build inside spark streaming kafka
 library comes with spark2.9.1. Any one see this exception before?

 Thanks,

 producer:
  [java] The args attribute is deprecated. Please use nested arg
 elements.
  [java] Exception in thread main java.lang.NoClassDefFoundError:
 scala/Tuple2$mcLL$sp
  [java] at
 kafka.producer.ProducerConfig.init(ProducerConfig.scala:56)
  [java] at
 com.turn.apache.KafkaWordCountProducer$.main(HelloWorld.scala:89)
  [java] at
 com.turn.apache.KafkaWordCountProducer.main(HelloWorld.scala)
  [java] Caused by: java.lang.ClassNotFoundException:
 scala.Tuple2$mcLL$sp
  [java] at
 java.net.URLClassLoader$1.run(URLClassLoader.java:202)
  [java] at java.security.AccessController.doPrivileged(Native
 Method)
  [java] at
 java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  [java] at
 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
  [java] ... 3 more
  [java] Java Result: 1







different in spark on yarn mode and standalone mode

2014-05-04 Thread Sophia
Hey you guys,
What is the different in spark on yarn mode and standalone mode about
resource schedule?
Wish you happy everyday.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Crazy Kryo Exception

2014-05-04 Thread Soren Macbeth
Does this perhaps have to do with the spark.closure.serializer?


On Sat, May 3, 2014 at 7:50 AM, Soren Macbeth so...@yieldbot.com wrote:

 Poking around in the bowels of scala, it seems like this has something to
 do with implicit scala - java collection munging. Why would it be doing
 this and where? The stack trace given is entirely unhelpful to me. Is there
 a better one buried in my task logs? None of my tasks actually failed, so
 it seems that it dying while trying to fetch results from my tasks to
 return back to the driver.

 Am I close?


 On Fri, May 2, 2014 at 3:35 PM, Soren Macbeth so...@yieldbot.com wrote:

 Hallo,

 I've getting this rather crazy kryo exception trying to run my spark job:

 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Exception while deserializing and fetching task:
 com.esotericsoftware.kryo.KryoException:
 java.lang.IllegalArgumentException: Can not set final
 scala.collection.convert.Wrappers field
 scala.collection.convert.Wrappers$SeqWrapper.$outer to my.custom.class
 Serialization trace:
 $outer (scala.collection.convert.Wrappers$SeqWrapper)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 I have a kryo serializer for my.custom.class and I've registered it using
 a custom registrator on my context object. I've tested the custom
 serializer and the registrator locally and they both function as expected.
 This job is running spark 0.9.1 under mesos in fine grained mode.

 Please help!





using kryo for spark.closure.serializer with a registrator doesn't work

2014-05-04 Thread Soren Macbeth
Is this supposed to be supported? It doesn't work, at least in mesos fine
grained mode. First it fails a bunch of times because it can't find my
registrator class because my assembly jar hasn't been fetched like so:

java.lang.ClassNotFoundException: pickles.kryo.PicklesRegistrator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:63)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:61)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:61)
at 
org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:116)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:79)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:180)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

after that it does finally fetch my jar, then it fails with the following
expection:

14/05/04 04:23:57 ERROR executor.Executor: Exception in task ID 79
java.nio.ReadOnlyBufferException
at java.nio.ByteBuffer.array(ByteBuffer.java:961)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:136)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


Should I file a bug?


Re: cache not work as expected for iteration?

2014-05-04 Thread Andrea Esposito
Maybe your memory isn't enough to contain the current RDD and also all the
past ones?
RDDs that are cached or persisted have to be unpersisted explicitly, no
auto-unpersist (maybe changes will be for 1.0 version?) exists.
Be careful that calling cache() or persist() doesn't imply the RDD will be
materialised..
I personally found this pattern of usage as simpler one:

 val mwzNew = mwz.mapPartitions(...).cache.persist
 mwzNew.count() or mwzNew.foreach(x = {}) // Force evaluation of the new
 RDD in order to have it materialized
 mwz.unpersist() // Drop from memory and disk the old, not  anymore used,
 RDD





2014-05-04 5:16 GMT+02:00 Earthson earthson...@gmail.com:

 I'm using spark for LDA impementation. I need cache RDD for next step of
 Gibbs Sampling, and cached the result and the cache previous could be
 uncache. Something like LRU cache should delete the previous cache because
 it is never used then, but the cache runs into confusion:

 Here is the code:)
 
 https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala#L99
 

 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache1.png
 

 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache2.png
 





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread Tathagata Das
Can you tell which version of Spark you are using? Spark 1.0 RC3, or
something intermediate?
And do you call sparkContext.stop at the end of your application? If so,
does this error occur before or after the stop()?

TD


On Sun, May 4, 2014 at 2:40 AM, wxhsdp wxh...@gmail.com wrote:

 Hi, all

 i use sbt to run my spark application, after the app completes, error
 occurs:

 14/05/04 17:32:28 INFO network.ConnectionManager: Selector thread was
 interrupted!
 14/05/04 17:32:28 ERROR spark.ContextCleaner: Error in cleaning thread
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
 at
 org.apache.spark.ContextCleaner.org
 $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:116)
 at
 org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:64)

 has anyone met this before?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: sbt/sbt run command returns a JVM problem

2014-05-04 Thread Carter
Hi Michael,

The log after I typed last is as below:
 last
scala.tools.nsc.MissingRequirementError: object scala not found.
at
scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655)
at
scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145)
at
scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146)
at
scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176)
at
scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814)
at scala.tools.nsc.Global$Run.init(Global.scala:697)
at sbt.compiler.Eval$$anon$1.init(Eval.scala:53)
at sbt.compiler.Eval.run$1(Eval.scala:53)
at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56)
at sbt.compiler.Eval.eval(Eval.scala:62)
at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212)
at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at sbt.Command$.process(Command.scala:90)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at 
sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71)
at sbt.State$$anon$2.process(State.scala:171)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.MainLoop$.next(MainLoop.scala:71)
at sbt.MainLoop$.run(MainLoop.scala:64)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50)
at sbt.Using.apply(Using.scala:25)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17)
at sbt.MainLoop$.runLogged(MainLoop.scala:13)
at sbt.xMain.run(Main.scala:26)
at xsbt.boot.Launch$.run(Launch.scala:55)
at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45)
at xsbt.boot.Launch$.launch(Launch.scala:60)
at xsbt.boot.Launch$.apply(Launch.scala:16)
at xsbt.boot.Boot$.runImpl(Boot.scala:31)
at xsbt.boot.Boot$.main(Boot.scala:20)
at xsbt.boot.Boot.main(Boot.scala)
[error] scala.tools.nsc.MissingRequirementError: object scala not found.
[error] Use 'last' for the full log.


And my sbt file is like below (my sbt launcher is sbt-launch-0.12.4.jar in
the same folder):
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the License); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an AS IS BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This script launches sbt for this project. If present it uses the system 
# version of sbt. If there is no system version of sbt it attempts to
download
# sbt locally.
SBT_VERSION=`awk -F = '/sbt\\.version/ {print $2}'
./project/build.properties`
URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar
JAR=sbt/sbt-launch-${SBT_VERSION}.jar

# Download sbt launch jar if it hasn't been downloaded yet
if [ ! -f ${JAR} ]; then
  # Download
  printf Attempting to fetch sbt\n
  if hash curl 2/dev/null; then
curl --progress-bar ${URL1}  ${JAR} || curl --progress-bar ${URL2} 
${JAR}
  elif hash wget 2/dev/null; then
wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O
${JAR}
  else
printf You do not have curl or wget installed, please install sbt
manually from http://www.scala-sbt.org/\n;
exit -1
  fi
fi
if [ ! -f ${JAR} ]; then
  # We failed to download
  printf Our attempt to 

Re: SparkException: env SPARK_YARN_APP_JAR is not set

2014-05-04 Thread phoenix bai
according to the code, SPARK_YARN_APP_JAR is retrieved from system
variables.
and the key-value pairs you pass through to JavaSparkContext is isolated
from system variables.
so, you maybe should try setting it through System.setProperty().

thanks


On Wed, Apr 23, 2014 at 6:05 PM, 肥肥 19934...@qq.com wrote:

 I have a small program, which I can launch successfully by yarn client
 with yarn-standalon mode.

 the command look like this:
 (javac javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar
 LoadTest.java)
 (jar cvf loadtest.jar LoadTest.class)
 SPARK_JAR=assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar
 ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar
 /opt/mytest/loadtest.jar --class LoadTest --args yarn-standalone
 --num-workers 2 --master-memory 2g --worker-memory 2g --worker-cores 1

 the program LoadTest.java:
 public class LoadTest {
 static final String USER = root;
 public static void main(String[] args) {
 System.setProperty(user.name, USER);
 System.setProperty(HADOOP_USER_NAME, USER);
 System.setProperty(spark.executor.memory, 7g);
 JavaSparkContext sc = new JavaSparkContext(args[0], LoadTest,
 System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(LoadTest.class));
 String file = file:/opt/mytest/123.data;
 JavaRDDString data1 = sc.textFile(file, 2);
 long c1=data1.count();
 System.out.println(1+c1);
 }
 }

 BUT due to my other pragram's need, I must have it run with command of
 java. So I add “environment” parameter to JavaSparkContext(). Followed is
 The ERROR I get:
 Exception in thread main org.apache.spark.SparkException: env
 SPARK_YARN_APP_JAR is not set
 at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 at
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:125)
 at org.apache.spark.SparkContext.init(SparkContext.scala:200)
 at org.apache.spark.SparkContext.init(SparkContext.scala:100)
 at
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:93)
 at LoadTest.main(LoadTest.java:37)

 the program LoadTest.java:
 public class LoadTest {

 static final String USER = root;
 public static void main(String[] args) {
 System.setProperty(user.name, USER);
 System.setProperty(HADOOP_USER_NAME, USER);
 System.setProperty(spark.executor.memory, 7g);

 MapString, String env = new HashMapString, String();
 env.put(SPARK_YARN_APP_JAR, file:/opt/mytest/loadtest.jar);
 env.put(SPARK_WORKER_INSTANCES, 2 );
 env.put(SPARK_WORKER_CORES, 1);
 env.put(SPARK_WORKER_MEMORY, 2G);
 env.put(SPARK_MASTER_MEMORY, 2G);
 env.put(SPARK_YARN_APP_NAME, LoadTest);
 env.put(SPARK_YARN_DIST_ARCHIVES,
 file:/opt/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar);
 JavaSparkContext sc = new JavaSparkContext(yarn-client,
 LoadTest, System.getenv(SPARK_HOME),
 JavaSparkContext.jarOfClass(LoadTest.class), env);
 String file = file:/opt/mytest/123.dna;
 JavaRDDString data1 = sc.textFile(file, 2);//.cache();

 long c1=data1.count();
 System.out.println(1+c1);
 }
 }

 the command:
 javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest.java
 jar cvf loadtest.jar LoadTest.class
 nohup java -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest
  loadTest.log 21 

 What did I miss?? Or I did it in wrong way??



Re: cache not work as expected for iteration?

2014-05-04 Thread Earthson
thx for the help, unpersist is excatly what I want:) 

I see that spark will remove some cache automatically when memory is full,
it is much more helpful if the rule satisfy something like LRU



It seems that persist and cache is some kind of lazy? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292p5308.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread wxhsdp
Hi, TD

actually, i'am not very clear with my spark version. i check out from
https://github.com/apache/spark/trunk on Apr 30.

please tell me from where do you get the version Spark 1.0 RC3

i do not call sparkContext.stop. now i add it to the end of my code

here's the log
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/metrics/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/static,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/executors/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/executors,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/environment/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/environment,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/rdd,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/storage,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/pool/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/pool,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/stage,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages/json,null}
14/05/04 18:48:21 INFO handler.ContextHandler: stopped
o.e.j.s.ServletContextHandler{/stages,null}
14/05/04 18:48:21 INFO ui.SparkUI: Stopped Spark web UI at
http://ubuntu.local:4040
14/05/04 18:48:21 INFO scheduler.DAGScheduler: Stopping DAGScheduler
14/05/04 18:48:22 INFO spark.MapOutputTrackerMasterActor:
MapOutputTrackerActor stopped!
14/05/04 18:48:23 INFO network.ConnectionManager: Selector thread was
interrupted!
14/05/04 18:48:23 INFO network.ConnectionManager: ConnectionManager stopped
14/05/04 18:48:23 INFO storage.MemoryStore: MemoryStore cleared
14/05/04 18:48:23 INFO storage.BlockManager: BlockManager stopped
14/05/04 18:48:23 INFO storage.BlockManagerMasterActor: Stopping
BlockManagerMaster
14/05/04 18:48:23 INFO storage.BlockManagerMaster: BlockManagerMaster
stopped
14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remote daemon shut down; proceeding with flushing remote transports.
14/05/04 18:48:23 INFO spark.SparkContext: Successfully stopped SparkContext
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #3
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #1
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #2
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #3
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #1
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #2
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #6
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #4
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #5
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #6
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #4
14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for
resources to be released #5
14/05/04 18:48:23 INFO Remoting: Remoting shut down
14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remoting shut down.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5309.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp
Hi,
  i'am trying to use breeze linalg library for matrix operation in my spark
code. i already add dependency
  on breeze in my build.sbt, and package my code sucessfully.

  when i run on local mode, sbt run local..., everything is ok

  but when turn to standalone mode, sbt run spark://127.0.0.1:7077...,
error occurs

14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to
java.lang.NoSuchMethodError
java.lang.NoSuchMethodError:
breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;

  in my opinion, everything needed is packaged to the jar file, isn't it?
  and does anyone used breeze before? is it good for matrix operation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


unsubscribe

2014-05-04 Thread Nabeel Memon
unsubscribe


Re: Lease Exception hadoop 2.4

2014-05-04 Thread Andre Kuhnen
Thanks Mayur,  the only think that my code is doing is:

read from s3,  and saveAsTextFile on hdfs.  Like I said,  everything is
written correctly,  but at the end of the job there is this warnning,
 I will try to compile with hadoop 2.4
thanks




2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com:

 You should compile Spark with every hadoop version you use. I am surprised
 its working otherwise as HDFS breaks compatibility quite often.
 As for this error it comes when your code writes/reads from file that has
 already deleted. Are you trying to update a single file in multiple
 mappers/reduce partitioners?


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote:

 Please, can anyone give a feedback?  thanks

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks


 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com:

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks






Re: cache not work as expected for iteration?

2014-05-04 Thread Nicholas Chammas
Yes, persist/cache will cache an RDD only when an action is applied to it.


On Sun, May 4, 2014 at 6:32 AM, Earthson earthson...@gmail.com wrote:

 thx for the help, unpersist is excatly what I want:)

 I see that spark will remove some cache automatically when memory is full,
 it is much more helpful if the rule satisfy something like LRU

 

 It seems that persist and cache is some kind of lazy?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292p5308.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Nicholas Chammas
Chris,

To use s3distcp in this case, are you suggesting saving the RDD to
local/ephemeral HDFS and then copying it up to S3 using this tool?


On Sat, May 3, 2014 at 7:14 PM, Chris Fregly ch...@fregly.com wrote:

 not sure if this directly addresses your issue, peter, but it's worth
 mentioned a handy AWS EMR utility called s3distcp that can upload a single
 HDFS file - in parallel - to a single, concatenated S3 file once all the
 partitions are uploaded.  kinda cool.

 here's some info:


 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html


 s3distcp is an extension of the familiar hadoop distcp, of course.


 On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 The fastest way to save to S3 should be to leave the RDD with many
 partitions, because all partitions will be written out in parallel.

 Then, once the various parts are in S3, somehow concatenate the files
 together into one file.

 If this can be done within S3 (I don't know if this is possible), then
 you get the best of both worlds: a highly parallelized write to S3, and a
 single cleanly named output file.


 On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote:

 Thank you Patrick.

 I took a quick stab at it:

 val s3Client = new AmazonS3Client(...)
 val copyObjectResult = s3Client.copyObject(upload, outputPrefix +
 /part-0, rolled-up-logs, 2014-04-28.csv)
 val objectListing = s3Client.listObjects(upload, outputPrefix)
 s3Client.deleteObjects(new
 DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s
 = new KeyVersion(s.getKey)).asJava))

  Using a 3GB object I achieved about 33MB/s between buckets in the same
 AZ.

 This is a workable solution for the short term but not ideal for the
 longer term as data size increases. I understand it's a limitation of the
 Hadoop API but ultimately it must be possible to dump a RDD to a single S3
 object :)

   On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell 
 pwend...@gmail.com wrote:
  This is a consequence of the way the Hadoop files API works. However,
 you can (fairly easily) add code to just rename the file because it
 will always produce the same filename.

 (heavy use of pseudo code)

 dir = /some/dir
 rdd.coalesce(1).saveAsTextFile(dir)
 f = new File(dir + part-0)
 f.moveTo(somewhere else)
 dir.remove()

 It might be cool to add a utility called `saveAsSingleFile` or
 something that does this for you. In fact probably we should have
 called saveAsTextfile saveAsTextFiles to make it more clear...

 On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote:
  Thanks Nicholas, this is a bit of a shame, not very practical for log
 roll
  up for example when every output needs to be in it's own directory.
  On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
  Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
  coalesce(1), you move everything in the RDD to a single partition,
 which
  then gives you 1 output file.
  It will still be called part-0 or something like that because
 that's
  defined by the Hadoop API that Spark uses for reading to/writing from
 S3. I
  don't know of a way to change that.
 
 
  On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com
 wrote:
 
  Ah, looks like RDD.coalesce(1) solves one part of the problem.
  On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com
  wrote:
  Hi
 
  Playing around with Spark  S3, I'm opening multiple objects (CSV
 files)
  with:
 
 val hfile = sc.textFile(s3n://bucket/2014-04-28/)
 
  so hfile is a RDD representing 10 objects that were underneath
 2014-04-28.
  After I've sorted and otherwise transformed the content, I'm trying to
 write
  it back to a single object:
 
 
 
 sortedMap.values.map(_.mkString(,)).saveAsTextFile(s3n://bucket/concatted.csv)
 
  unfortunately this results in a folder named concatted.csv with 10
 objects
  underneath, part-0 .. part-00010, corresponding to the 10 original
  objects loaded.
 
  How can I achieve the desired behaviour of putting a single object
 named
  concatted.csv ?
 
  I've tried 0.9.1 and 1.0.0-RC3.
 
  Thanks!
  Peter
 
 
 
 
 
 
 







Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Peter
Thank you Chris, I am familiar with S3distcp, I'm trying to replicate some of 
that functionality and combine it with my log post processing in one step 
instead of yet another step. 
On Saturday, May 3, 2014 4:15 PM, Chris Fregly ch...@fregly.com wrote:
 
not sure if this directly addresses your issue, peter, but it's worth mentioned 
a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in 
parallel - to a single, concatenated S3 file once all the partitions are 
uploaded.  kinda cool.

here's some info:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
 

s3distcp is an extension of the familiar hadoop distcp, of course.



On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

The fastest way to save to S3 should be to leave the RDD with many partitions, 
because all partitions will be written out in parallel. 


Then, once the various parts are in S3, somehow concatenate the files together 
into one file. 


If this can be done within S3 (I don't know if this is possible), then you get 
the best of both worlds: a highly parallelized write to S3, and a single 
cleanly named output file.



On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote:

Thank you Patrick. 


I took a quick stab at it:


    val s3Client = new AmazonS3Client(...)
    val copyObjectResult = s3Client.copyObject(upload, outputPrefix + 
/part-0, rolled-up-logs, 2014-04-28.csv)
    val objectListing = s3Client.listObjects(upload, outputPrefix)
    s3Client.deleteObjects(new 
DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s
 = new KeyVersion(s.getKey)).asJava))


Using a 3GB object I achieved about 33MB/s between buckets in the same AZ. 


This is a workable solution for the short term but not ideal for the longer 
term as data size increases. I understand it's a limitation of the Hadoop API 
but ultimately it must be possible to dump a RDD to a single S3 object :) 


On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell pwend...@gmail.com 
wrote:
 
This is a consequence of the way the Hadoop files API works. However,
you can (fairly easily) add code to just rename the file because it
will always produce the same filename.

(heavy use of pseudo code)

dir = /some/dir
rdd.coalesce(1).saveAsTextFile(dir)
f = new File(dir + part-0)
f.moveTo(somewhere else)
dir.remove()

It might be cool to add a utility called `saveAsSingleFile` or
something that does this for you. In fact probably we should have
called saveAsTextfile saveAsTextFiles to make it more clear...


On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote:
 Thanks Nicholas, this is a bit of a shame, not very practical for log
 roll
 up for example when every output needs to be in it's own directory.
 On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
 coalesce(1), you move everything in the RDD to a single partition, which
 then gives you 1 output file.
 It will still be called part-0 or something like that because that's
 defined by the Hadoop API that Spark uses for reading to/writing from S3. I
 don't know of a way to change that.


 On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com wrote:

 Ah, looks like RDD.coalesce(1) solves one part of the problem.
 On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com
 wrote:
 Hi

 Playing around with Spark  S3, I'm opening multiple objects
 (CSV files)
 with:

     val hfile = sc.textFile(s3n://bucket/2014-04-28/)

 so hfile is a RDD representing 10 objects that were underneath 2014-04-28.
 After I've sorted and otherwise transformed the content, I'm trying to write
 it back to a single object:


 sortedMap.values.map(_.mkString(,)).saveAsTextFile(s3n://bucket/concatted.csv)

 unfortunately this results in a folder named concatted.csv with 10 objects
 underneath, part-0 ..
 part-00010, corresponding to the 10 original
 objects loaded.

 How can I achieve the desired behaviour of putting a single object named
 concatted.csv ?

 I've tried 0.9.1 and 1.0.0-RC3.

 Thanks!
 Peter












Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Peter
Hi Patrick

I should probably explain my use case in a bit more detail. I have hundreds of 
thousands to millions of clients uploading events to my pipeline, these are 
batched periodically (every 60 seconds atm) into logs which are dumped into S3 
(and uploaded into a data warehouse). I need to post process these at another 
set interval (say hourly), mainly dedup  global sort and then roll them up 
into a single file. I will repeat this process daily (need to experiment with 
the granularity here but daily feels appropriate) to yield a single file for 
the day's data. 
I'm hoping to use Spark for these roll ups, it seems so much easier than using 
Hadoop.

The roll up is to avoid the hadoop small files problem, here's one article 
discussing it: The Small Files Problem
 
 The Small Files Problem
Small files are a big problem in Hadoop — or, at least, they are if the number 
of questions on the user list on this topic is anything to go by.   
View on blog.cloudera.com Preview by Yahoo  

Inspecting my dataset should be much more efficient and manageable with larger 
100s or maybe low 1000s of partitions rather than 1,000,000s. 

Hope that makes some sense :)

Thanks
Peter
On Saturday, May 3, 2014 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
 
Hi Peter,

If your dataset is large (3GB) then why not just chunk it into
multiple files? You'll need to do this anyways if you want to
read/write this from S3 quickly, because S3's throughput per file is
limited (as you've seen).

This is exactly why the Hadoop API lets you save datasets with many
partitions, since often there are bottlenecks at the granularity of a
file.

Is there a reason you need this to be exactly one file?

- Patrick


On Sat, May 3, 2014 at 4:14 PM, Chris Fregly ch...@fregly.com wrote:
 not sure if this directly addresses your issue, peter, but it's worth
 mentioned a handy AWS EMR utility called s3distcp that can upload a single
 HDFS file - in parallel - to a single, concatenated S3 file once all the
 partitions are uploaded.  kinda cool.

 here's some info:

 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

 s3distcp is an extension of the familiar hadoop distcp, of course.


 On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 The fastest way to save to S3 should be to leave the RDD with many
 partitions, because all partitions will be written out in parallel.

 Then, once the various parts are in S3, somehow concatenate the files
 together into one file.

 If this can be done within S3 (I don't know if this is possible), then you
 get the best of both worlds: a highly parallelized write to S3, and a single
 cleanly named output file.


 On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote:

 Thank you Patrick.

 I took a quick stab at it:

     val s3Client = new AmazonS3Client(...)
     val copyObjectResult = s3Client.copyObject(upload, outputPrefix +
 /part-0, rolled-up-logs, 2014-04-28.csv)
     val objectListing = s3Client.listObjects(upload, outputPrefix)
     s3Client.deleteObjects(new
 DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s
 = new KeyVersion(s.getKey)).asJava))

 Using a 3GB object I achieved about 33MB/s between buckets in the same
 AZ.

 This is a workable solution for the short term but not ideal for the
 longer term as data size increases. I understand it's a limitation of the
 Hadoop API but ultimately it must be possible to dump a RDD to a single S3
 object :)

 On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell
 pwend...@gmail.com wrote:
 This is a consequence of the way the Hadoop files API works. However,
 you can (fairly easily) add code to just rename the file because it
 will always produce the same filename.

 (heavy use of pseudo code)

 dir = /some/dir
 rdd.coalesce(1).saveAsTextFile(dir)
 f = new File(dir + part-0)
 f.moveTo(somewhere else)
 dir.remove()

 It might be cool to add a utility called `saveAsSingleFile` or
 something that does this for you. In fact probably we should have
 called saveAsTextfile saveAsTextFiles to make it more clear...

 On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote:
  Thanks Nicholas, this is a bit of a shame, not very practical for log
  roll
  up for example when every output needs to be in it's own directory.
  On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
  Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
  coalesce(1), you move everything in the RDD to a single partition,
  which
  then gives you 1 output file.
  It will still be called part-0 or something like that because
  that's
  defined by the Hadoop API that Spark uses for reading to/writing from
  S3. I
  don't know of a way to change that.
 
 
  On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com wrote:
 
  Ah, looks like RDD.coalesce(1) solves one part of the 

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread DB Tsai
If you add the breeze dependency in your build.sbt project, it will not be
available to all the workers.

There are couple options, 1) use sbt assembly to package breeze into your
application jar. 2) manually copy breeze jar into all the nodes, and have
them in the classpath. 3) spark 1.0 has breeze jar in the spark flat
assembly jar, so you don't need to add breeze dependency yourself.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh...@gmail.com wrote:

 Hi,
   i'am trying to use breeze linalg library for matrix operation in my spark
 code. i already add dependency
   on breeze in my build.sbt, and package my code sucessfully.

   when i run on local mode, sbt run local..., everything is ok

   but when turn to standalone mode, sbt run spark://127.0.0.1:7077...,
 error occurs

 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.NoSuchMethodError
 java.lang.NoSuchMethodError:

 breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;

   in my opinion, everything needed is packaged to the jar file, isn't it?
   and does anyone used breeze before? is it good for matrix operation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread Yadid Ayzenberg
An additional option 4) Use SparkContext.addJar() and have the 
application ship your jar to all the nodes.



Yadid

On 5/4/14, 4:07 PM, DB Tsai wrote:
If you add the breeze dependency in your build.sbt project, it will 
not be available to all the workers.


There are couple options, 1) use sbt assembly to package breeze into 
your application jar. 2) manually copy breeze jar into all the nodes, 
and have them in the classpath. 3) spark 1.0 has breeze jar in the 
spark flat assembly jar, so you don't need to add breeze dependency 
yourself.



Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh...@gmail.com 
mailto:wxh...@gmail.com wrote:


Hi,
  i'am trying to use breeze linalg library for matrix operation in
my spark
code. i already add dependency
  on breeze in my build.sbt, and package my code sucessfully.

  when i run on local mode, sbt run local..., everything is ok

  but when turn to standalone mode, sbt run
spark://127.0.0.1:7077...,
error occurs

14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to
java.lang.NoSuchMethodError
java.lang.NoSuchMethodError:

breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;

  in my opinion, everything needed is packaged to the jar file,
isn't it?
  and does anyone used breeze before? is it good for matrix operation?



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.






spark ec2 error

2014-05-04 Thread Jeremy Freeman
Hi all,

A heads up in case others hit this and are confused…  This nice  addition
https://github.com/apache/spark/pull/612   causes an error if running the
spark-ec2.py deploy script from a version other than master (e.g. 0.8.0).

The error occurs during launch, here:

...
Creating local config files...
configuring /etc/ganglia/gmond.conf
Traceback (most recent call last):
  File ./deploy_templates.py, line 89, in module
text = text.replace({{ + key + }}, template_vars[key])
TypeError: expected a character buffer object
Deploying Spark config files...
chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or
directory
...

And then several more errors because of missing variables (though the
cluster itself launches, there are several configuration problems, e.g. with
HDFS). 

deploy_templates fails because the new SPARK_MASTER_OPTS and
SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py
still use deploy_templates from https://github.com/mesos/spark-ec2.git -b
v2, which has the new variables.

Using the updated spark-ec2.py from master works fine.

-- Jeremy

-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Initial job has not accepted any resources

2014-05-04 Thread pedro
I have been working on a Spark program, completed it, but have spent the past
few hours trying to run on EC2 without any luck. I am hoping i can
comprehensively describe my problem and what I have done, but I am pretty
stuck.

My code uses the following lines to configure the SparkContext, which are
taken from the standalone app example found here:
https://spark.apache.org/docs/0.9.0/quick-start.html
And combined with ampcamps code found here:
http://spark-summit.org/2013/exercises/machine-learning-with-spark.html
To give the following code:
http://pastebin.com/zDYkk1T8

Launch spark cluster with:
./spark-ec2 -k plda -i ~/plda.pem -s 1 --instance-type=t1.micro
--region=us-west-2 start lda
When logged in, launch my job with sbt via this, while in my projects
directory
$/root/bin/sbt run

This results in the following log, indicating the problem in my subject
line:
http://pastebin.com/DiQCj6jQ

Following this, I got advice to set my conf/spark-env.sh so it exports
MASTER and SPARK_HOME_IP
There's an inconsistency in the way the master addresses itself. The Spark
master uses the internal (ip-*.internal) address, but the driver is trying
to connect using the external (ec2-*.compute-1.amazonaws.com) address. The
solution is to set the Spark master URL to the external address in the
spark-env.sh file.

Your conf/spark-env.sh is probably empty. It should set MASTER and
SPARK_MASTER_IP to the external URL, as the EC2 launch script does:
https://github.com/.../templ.../root/spark/conf/spark-env.sh;

My spark-env.sh looks like this:
#!/usr/bin/env bash

export MASTER=`cat /root/spark-ec2/cluster-url`
export SPARK_MASTER_IP=ec2-54-186-178-145.us-west-2.compute.amazonaws.com
export SPARK_WORKER_MEM=128m

At this point, I did a test
1. Remove my spark-env.sh variables
2. Run spark-shell
3. Run: sc.parallelize(1 to 1000).count()
4. This works as expected
5. Reset my spark-env.sh variables
6. Run prior spark-shell and commands
7. I get the same error as reported above.

Hence, it is something wrong with how I am setting my master/slave
configuration. Any help would be greatly appreciated.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


spark streaming question

2014-05-04 Thread Weide Zhang
Hi ,

It might be a very general question to ask here but I'm curious to know why
spark streaming can achieve better throughput than storm as claimed in the
spark streaming paper. Does it depend on certain use cases and/or data
source ? What drives better performance in spark streaming case or in other
ways, what makes storm not as performant as spark streaming ?

Also, in order to guarantee exact-once semantics when node failure happens,
 spark makes replicas of RDDs and checkpoints so that data can be
recomputed on the fly while on Trident case, they use transactional object
to persist the state and result but it's not obvious to me which approach
is more costly and why ? Any one can provide some experience here ?

Thanks a lot,

Weide


Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Hey Jeremy,

This is actually a big problem - thanks for reporting it, I'm going to
revert this change until we can make sure it is backwards compatible.

- Patrick

On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
 Hi all,

 A heads up in case others hit this and are confused...  This nice  addition
 https://github.com/apache/spark/pull/612   causes an error if running the
 spark-ec2.py deploy script from a version other than master (e.g. 0.8.0).

 The error occurs during launch, here:

 ...
 Creating local config files...
 configuring /etc/ganglia/gmond.conf
 Traceback (most recent call last):
   File ./deploy_templates.py, line 89, in module
 text = text.replace({{ + key + }}, template_vars[key])
 TypeError: expected a character buffer object
 Deploying Spark config files...
 chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or
 directory
 ...

 And then several more errors because of missing variables (though the
 cluster itself launches, there are several configuration problems, e.g. with
 HDFS).

 deploy_templates fails because the new SPARK_MASTER_OPTS and
 SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py
 still use deploy_templates from https://github.com/mesos/spark-ec2.git -b
 v2, which has the new variables.

 Using the updated spark-ec2.py from master works fine.

 -- Jeremy

 -
 Jeremy Freeman, PhD
 Neuroscientist
 @thefreemanlab



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Okay I just went ahead and fixed this to make it backwards-compatible
(was a simple fix). I launched a cluster successfully with Spark
0.8.1.

Jeremy - if you could try again and let me know if there are any
issues, that would be great. Thanks again for reporting this.

On Sun, May 4, 2014 at 3:41 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Jeremy,

 This is actually a big problem - thanks for reporting it, I'm going to
 revert this change until we can make sure it is backwards compatible.

 - Patrick

 On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com 
 wrote:
 Hi all,

 A heads up in case others hit this and are confused...  This nice  addition
 https://github.com/apache/spark/pull/612   causes an error if running the
 spark-ec2.py deploy script from a version other than master (e.g. 0.8.0).

 The error occurs during launch, here:

 ...
 Creating local config files...
 configuring /etc/ganglia/gmond.conf
 Traceback (most recent call last):
   File ./deploy_templates.py, line 89, in module
 text = text.replace({{ + key + }}, template_vars[key])
 TypeError: expected a character buffer object
 Deploying Spark config files...
 chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or
 directory
 ...

 And then several more errors because of missing variables (though the
 cluster itself launches, there are several configuration problems, e.g. with
 HDFS).

 deploy_templates fails because the new SPARK_MASTER_OPTS and
 SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py
 still use deploy_templates from https://github.com/mesos/spark-ec2.git -b
 v2, which has the new variables.

 Using the updated spark-ec2.py from master works fine.

 -- Jeremy

 -
 Jeremy Freeman, PhD
 Neuroscientist
 @thefreemanlab



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark streaming question

2014-05-04 Thread Chris Fregly
great questions, weide.  in addition, i'd also like to hear more about how
to horizontally scale a spark-streaming cluster.

i've gone through the samples (standalone mode) and read the documentation,
but it's still not clear to me how to scale this puppy out under high load.
 i assume i add more receivers (kinesis, flume, etc), but physically how
does this work?

@TD:  can you comment?

thanks!

-chris


On Sun, May 4, 2014 at 2:10 PM, Weide Zhang weo...@gmail.com wrote:

 Hi ,

 It might be a very general question to ask here but I'm curious to know
 why spark streaming can achieve better throughput than storm as claimed in
 the spark streaming paper. Does it depend on certain use cases and/or data
 source ? What drives better performance in spark streaming case or in other
 ways, what makes storm not as performant as spark streaming ?

 Also, in order to guarantee exact-once semantics when node failure
 happens,  spark makes replicas of RDDs and checkpoints so that data can be
 recomputed on the fly while on Trident case, they use transactional object
 to persist the state and result but it's not obvious to me which approach
 is more costly and why ? Any one can provide some experience here ?

 Thanks a lot,

 Weide



Re: spark ec2 error

2014-05-04 Thread Jeremy Freeman
Cool, glad to help! I just tested with 0.8.1 and 0.9.0 and both worked
perfectly, so seems to all be good.

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323p5329.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Lease Exception hadoop 2.4

2014-05-04 Thread Andre Kuhnen
I compiled spark with SPARK_HADOOP_VERSION=2.4.0 sbt/sbt assembly, fixed
the s3 dependencies,  but I am still getting the same error...
14/05/05 00:32:33 WARN TaskSetManager: Loss was due to
org.apache.hadoop.ipc.RemoteException
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on

Any ideas?

thanks



2014-05-04 11:53 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com:

 Thanks Mayur,  the only think that my code is doing is:

 read from s3,  and saveAsTextFile on hdfs.  Like I said,  everything is
 written correctly,  but at the end of the job there is this warnning,
  I will try to compile with hadoop 2.4
 thanks




 2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com:

 You should compile Spark with every hadoop version you use. I am surprised
 its working otherwise as HDFS breaks compatibility quite often.
 As for this error it comes when your code writes/reads from file that has
 already deleted. Are you trying to update a single file in multiple
 mappers/reduce partitioners?


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote:

 Please, can anyone give a feedback?  thanks

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks


 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com:

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks







unsubscribe

2014-05-04 Thread ZHANG Jun


 原始邮件 
主题:unsubscribe
发件人:Nabeel Memon nm3...@gmail.com
收件人:user@spark.apache.org
抄送:

unsubscribe

Error starting EC2 cluster

2014-05-04 Thread Aliaksei Litouka
I am using Spark 0.9.1. When I'm trying to start a EC2 cluster with the
spark-ec2 script, an error occurs and the following message is issued:
AttributeError: 'module' object has no attribute 'check_output'. By this
time, EC2 instances are up and running but Spark doesn't seem to be
installed on them. Any ideas how to fix it?

$ ./spark-ec2 -k my_key -i /home/alitouka/my_key.pem -s 1
--region=us-east-1 --instance-type=m3.medium launch test_cluster
Setting up security groups...
Searching for existing cluster test_cluster...
Don't recognize m3.medium, assuming type is pvm
Spark AMI: ami-5bb18832
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-
Launched master in us-east-1c, regid = r-
Waiting for instances to start up...
Waiting 120 more seconds...
Generating cluster's SSH key on master...
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
\n cat ~/.ssh/id_rsa.pub  ~/.ssh/authorized_keys)\n]'
returned non-zero exit status 255
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
\n cat ~/.ssh/id_rsa.pub  ~/.ssh/authorized_keys)\n]'
returned non-zero exit status 255
ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22:
Connection refused
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t',
'-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n  [ -f
~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
\n cat ~/.ssh/id_rsa.pub  ~/.ssh/authorized_keys)\n]'
returned non-zero exit status 255
Warning: Permanently added
'ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com,54.227.205.82'
(RSA) to the list of known hosts.
Connection to ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com closed.
Traceback (most recent call last):
  File ./spark_ec2.py, line 806, in module
main()
  File ./spark_ec2.py, line 799, in main
real_main()
  File ./spark_ec2.py, line 684, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, True)
  File ./spark_ec2.py, line 419, in setup_cluster
dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh'])
  File ./spark_ec2.py, line 624, in ssh_read
return subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'


RE: different in spark on yarn mode and standalone mode

2014-05-04 Thread Liu, Raymond
In the core, they are not quite different
In standalone mode, you have spark master and spark worker who allocate driver 
and executors for your spark app.
While in Yarn mode, Yarn resource manager and node manager do this work.
When the driver and executors have been launched, the rest part of resource 
scheduling go through the same process, say between driver and executor through 
akka actor.

Best Regards,
Raymond Liu


-Original Message-
From: Sophia [mailto:sln-1...@163.com] 

Hey you guys,
What is the different in spark on yarn mode and standalone mode about resource 
schedule?
Wish you happy everyday.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Initial job has not accepted any resources

2014-05-04 Thread Jeremy Freeman
Hey Pedro,

From which version of Spark were you running the spark-ec2.py script? You
might have run into the problem described here
(http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html),
which Patrick just fixed up to ensure backwards compatibility.

With the bug, it would successfully complete deployment but prevent the
correct setting of various variables, so may have caused the errors you were
seeing, though I'm not positive.

I'd definitely try re-running the spark-ec2 script now.

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5335.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Initial job has not accepted any resources

2014-05-04 Thread pedro
Hi Jeremy,

I am running from the most recent release, 0.9. I just fixed the problem, and 
it is indeed correct setting of variables in deployment.

Once I had the cluster I wanted running, I began to suspect that master was not 
responding. So I killed a worker, then recreated it, and found it could not 
connect to master. So I killed master and remade it, then remade the worker. 
Odd things happened, but it seemed like I was on the right track.

So I stopped all, then restarted all, then tried again and it began to work. So 
now, after probably around 6 hours of debugging, I have EC2 working (yay).

Next thing that we are trying to debug is that when we build our project on 
EC2, it can’t find breeze, and when we try to include it as a spark example 
(under the correct directory), it also can’t find it. The first occurs as a 
runtime error and is a NoClassDefFoundError for breeze/linalg/DenseVector which 
makes me think that breeze isn’t on the cluster and also isn’t being built with 
our project (and sbt assembly is acting strange). The second is a compile time 
error, which is odd because other examples in the same directory use breeze and 
I don’t find anything that specifies their dependencies.

Thanks
-- 
Pedro Rodriguez
UCBerkeley 2014 | Computer Science
BSU Cryosphere Science Research
SnowGeek Founder

snowgeek.org
pedro-rodriguez.com
ski.rodrig...@gmail.com
208-340-1703

On May 4, 2014 at 6:51:56 PM, Jeremy Freeman [via Apache Spark User List] 
(ml-node+s1001560n5335...@n3.nabble.com) wrote:

Hey Pedro,

From which version of Spark were you running the spark-ec2.py script? You 
might have run into the problem described here 
(http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html),
 which Patrick just fixed up to ensure backwards compatibility.

With the bug, it would successfully complete deployment but prevent the correct 
setting of various variables, so may have caused the errors you were seeing, 
though I'm not positive.

I'd definitely try re-running the spark-ec2 script now.

-- Jeremy

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5335.html
To unsubscribe from Initial job has not accepted any resources, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Lease Exception hadoop 2.4

2014-05-04 Thread Andre Kuhnen
I think I forgot to rsync the slaves with the new compiled jar,  I will
give it a try as soon as possible,
Em 04/05/2014 21:35, Andre Kuhnen andrekuh...@gmail.com escreveu:

 I compiled spark with SPARK_HADOOP_VERSION=2.4.0 sbt/sbt assembly, fixed
 the s3 dependencies,  but I am still getting the same error...
 14/05/05 00:32:33 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on

 Any ideas?

 thanks



 2014-05-04 11:53 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com:

 Thanks Mayur,  the only think that my code is doing is:

 read from s3,  and saveAsTextFile on hdfs.  Like I said,  everything is
 written correctly,  but at the end of the job there is this warnning,
  I will try to compile with hadoop 2.4
 thanks




 2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com:

 You should compile Spark with every hadoop version you use. I am
 surprised its working otherwise as HDFS breaks compatibility quite often.
 As for this error it comes when your code writes/reads from file that
 has already deleted. Are you trying to update a single file in multiple
 mappers/reduce partitioners?


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote:

 Please, can anyone give a feedback?  thanks

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks


 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com:

 Hello, I am getting this warning after upgrading Hadoop 2.4, when I try
 to write something to the HDFS.   The content is written correctly, but I
 do not like this warning.

 DO I have to compile SPARK with hadoop 2.4?

 WARN TaskSetManager: Loss was due to
 org.apache.hadoop.ipc.RemoteException

 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

 thanks








Re: sbt/sbt run command returns a JVM problem

2014-05-04 Thread phoenix bai
the total memory of your machine is 2G right?
then how much memory is left free? wouldn`t ubuntu take up quite a big
portion of 2G?

just a guess!


On Sat, May 3, 2014 at 8:15 PM, Carter gyz...@hotmail.com wrote:

 Hi, thanks for all your help.
 I tried your setting in the sbt file, but the problem is still there.

 The Java setting in my sbt file is:
 java \
   -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \
   -jar ${JAR} \
   $@

 I have tried to set these 3 parameters bigger and smaller, but nothing
 works. Did I change the right thing?

 Thank you very much.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-04 Thread Cheney Sun
Hi Nan, 

Have you found a way to fix the issue? Now I run into the same problem with
version 0.9.1.

Thanks,
Cheney



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5341.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: ClassNotFoundException

2014-05-04 Thread pedro
I just ran into the same problem. I will respond if I find how to fix.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-tp5182p5342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Initial job has not accepted any resources

2014-05-04 Thread pedro
Since it appears breeze is going to be included by default in Spark in 1.0,
and I ran into the issue here:
http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-td5182.html

And it seems like the issues I had were recently introduced, I am cloning
spark and checking out the 1.0 branch. Maybe this makes my problem(s) worse,
but am going to give it a try. Rapidly running out of time to get our code
fully working on EC2.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark 0.9.1: ClassNotFoundException

2014-05-04 Thread phoenix bai
check if the jar file that includes your example code is under
examples/target/scala-2.10/.



On Sat, May 3, 2014 at 5:58 AM, SK skrishna...@gmail.com wrote:

 I am using Spark 0.9.1 in standalone mode. In the
 SPARK_HOME/examples/src/main/scala/org/apache/spark/ folder, I created my
 directory called mycode in which I have placed some standalone scala
 code.
 I was able to compile. I ran the code using:

 ./bin/run-example org.apache.spark.mycode.MyClass local

 However, I get a ClassNotFound exception, although I do see the compiled
 classes in
 examples/target/scala-2.10/classes/org/apache/spark/mycode

 When I place the same code in the same folder structure in the spark 0.9.0
 version, I am able to run it. Where should I place my standalone code with
 respect to SPARK_HOME, in spark0.9.1 so that the classes can be found?

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-9-1-ClassNotFoundException-tp5256.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-04 Thread Andrew Lee
Hi Jacob,
Taking both concerns into account, I'm actually thinking about using a separate 
subnet to isolate the Spark Workers, but need to look into how to bind the 
process onto the correct interface first. This may require some code 
change.Separate subnet doesn't limit itself with port range so port exhaustion 
should rarely happen, and won't impact performance.
By opening up all port between 32768-61000 is actually the same as no firewall, 
this expose some security concerns, but need more information whether that is 
critical or not.
The bottom line is the driver needs to talk to the Workers. The way how user 
access the Driver should be easier to solve such as launching Spark (shell) 
driver on a specific interface.
Likewise, if you found out any interesting solutions, please let me know. I'll 
share the solution once I have something up and running. Currently, it is 
running ok with iptables off, but still need to figure out how to 
product-ionize the security part.
Subject: RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication
To: user@spark.apache.org
From: jeis...@us.ibm.com
Date: Fri, 2 May 2014 16:07:50 -0500


Howdy Andrew,



I think I am running into the same issue [1] as you.  It appears that Spark 
opens up dynamic / ephemera [2] ports for each job on the shell and the 
workers.  As you are finding out, this makes securing and managing the network 
for Spark very difficult.



 Any idea how to restrict the 'Workers' port range?

The port range can be found by running:

$ sysctl net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = 3276861000


With that being said, a couple avenues you may try:

Limit the dynamic ports [3] to a more reasonable number and open all of these 
ports on your firewall; obviously, this might have unintended consequences like 
port exhaustion.
Secure the network another way like through a private VPN; this may reduce 
Spark's performance.


If you have other workarounds, I am all ears --- please let me know!

Jacob



[1] 
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html

[2] http://en.wikipedia.org/wiki/Ephemeral_port

[3] 
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html



Jacob D. Eisinger

IBM Emerging Technologies

jeis...@us.ibm.com - (512) 286-6075



Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana,  I did. I configured the the 
port in spark-env.sh, the problem is not the driver port which



From:   Andrew Lee alee...@hotmail.com

To: user@spark.apache.org user@spark.apache.org

Date:   05/02/2014 03:15 PM

Subject:RE: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication







Hi Yana, 



I did. I configured the the port in spark-env.sh, the problem is not the driver 
port which is fixed.

it's the Workers port that are dynamic every time when they are launched in the 
YARN container. :-(



Any idea how to restrict the 'Workers' port range?



Date: Fri, 2 May 2014 14:49:23 -0400

Subject: Re: spark-shell driver interacting with Workers in YARN mode - 
firewall blocking communication

From: yana.kadiy...@gmail.com

To: user@spark.apache.org



I think what you want to do is set spark.driver.port to a fixed port.





On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote:
Hi All,



I encountered this problem when the firewall is enabled between the spark-shell 
and the Workers.



When I launch spark-shell in yarn-client mode, I notice that Workers on the 
YARN containers are trying to talk to the driver (spark-shell), however, the 
firewall is not opened and caused timeout.



For the Workers, it tries to open listening ports on 54xxx for each Worker? Is 
the port random in such case?

What will be the better way to predict the ports so I can configure the 
firewall correctly between the driver (spark-shell) and the Workers? Is there a 
range of ports we can specify in the firewall/iptables?



Any ideas?

  

Re: pySpark memory usage

2014-05-04 Thread Aaron Davidson
I'd just like to update this thread by pointing to the PR based on our
initial design: https://github.com/apache/spark/pull/640

This solution is a little more general and avoids catching IOException
altogether. Long live exception propagation!


On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Jim,

 This IOException thing is a general issue that we need to fix and your
 observation is spot-in. There is actually a JIRA for it here I created a
 few days ago:
 https://issues.apache.org/jira/browse/SPARK-1579

 Aaron is assigned on that one but not actively working on it, so we'd
 welcome a PR from you on this if you are interested.

 The first thought we had was to set a volatile flag when the reader sees
 an exception (indicating there was a failure in the task) and avoid
 swallowing the IOException in the writer if this happens. But I think there
 is a race here where the writer sees the error first before the reader
 knows what is going on.

 Anyways maybe if you have a simpler solution you could sketch it out in
 the JIRA and we could talk over there. The current proposal in the JIRA is
 somewhat complicated...

 - Patrick






 On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl...@gmail.com wrote:

 FYI, it looks like this stdin writer to Python finished early error was
 caused by a break in the connection to S3, from which the data was being
 pulled.  A recent commit to 
 PythonRDDhttps://github.com/apache/spark/commit/a967b005c8937a3053e215c952d2172ee3dc300d#commitcomment-6114780noted
  that the current exception catching can potentially mask an exception
 for the data source, and that is indeed what I see happening.  The
 underlying libraries (jets3t and httpclient) do have retry capabilities,
 but I don't see a great way of setting them through Spark code.  Instead I
 added the patch below which kills the worker on the exception.  This allows
 me to completely load the data source after a few worker retries.

 Unfortunately, java.net.SocketException is the same error that is
 sometimes expected from the client when using methods like take().  One
 approach around this conflation is to create a new locally scoped exception
 class, eg. WriterException, catch java.net.SocketException during output
 writing, then re-throw the new exception.  The worker thread could then
 distinguish between the reasons java.net.SocketException might be thrown.
  Perhaps there is a more elegant way to do this in Scala, though?

 Let me know if I should open a ticket or discuss this on the developers
 list instead.  Best,

 Jim

 diff --git
 a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
 b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
 index 0d71fdb..f31158c 100644
 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
 +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
 @@ -95,6 +95,12 @@ private[spark] class PythonRDD[T: ClassTag](
  readerException = e
  Try(worker.shutdownOutput()) // kill Python worker process

 +  case e: java.net.SocketException =
 +   // This can happen if a connection to the datasource, eg S3,
 resets
 +   // or is otherwise broken
 +readerException = e
 +Try(worker.shutdownOutput()) // kill Python worker process
 +
case e: IOException =
  // This can happen for legitimate reasons if the Python code
 stops returning data
  // before we are done passing elements through, e.g., for
 take(). Just log a message to


 On Wed, Apr 9, 2014 at 7:04 PM, Jim Blomo jim.bl...@gmail.com wrote:

 This dataset is uncompressed text at ~54GB. stats() returns (count:
 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min:
 343)

 On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Okay, thanks. Do you have any info on how large your records and data
 file are? I'd like to reproduce and fix this.
 
  Matei
 
  On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote:
 
  Hi Matei, thanks for working with me to find these issues.
 
  To summarize, the issues I've seen are:
  0.9.0:
  - https://issues.apache.org/jira/browse/SPARK-1323
 
  SNAPSHOT 2014-03-18:
  - When persist() used and batchSize=1, java.lang.OutOfMemoryError:
  Java heap space.  To me this indicates a memory leak since Spark
  should simply be counting records of size  3MB
  - Without persist(), stdin writer to Python finished early hangs the
  application, unknown root cause
 
  I've recently rebuilt another SNAPSHOT, git commit 16b8308 with
  debugging turned on.  This gives me the stacktrace on the new stdin
  problem:
 
  14/04/09 22:22:45 DEBUG PythonRDD: stdin writer to Python finished
 early
  java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:196)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
   

Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
A new broadcast object will generated for every iteration step, it may eat up
the memory and make persist fail. 

The broadcast object should not be removed because RDD may be recomputed.
And I am trying to prevent recomputing RDD, it need old broadcast release
some memory.

I've tried to set spark.cleaner.ttl, but my task runs into Error(broadcast
object not found), I think task is recomputed. I don't think this is a good
idea, it makes my code depends on my environment much more.

So I changed the persistLevel of my RDD to MEMORY_AND_DISK, but it runs into
ERROR(broadcast object not found) too.  And I remove the setting of
spark.cleaner.ttl, finally.

I think support for cache should be more friendly, and broadcast object
should be cached, too. flush the old object to disk is much more essential
than the new one.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
Code Here
https://github.com/Earthson/sparklda/blob/dev/src/main/scala/net/earthson/nlp/lda/lda.scala#L121
  

Finally, iteration still runs into recomputing...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-04 Thread 诺铁
hello,ZhangYi

I find ooyala's opensourced spark-jobserver,
https://github.com/ooyala/spark-jobserver
seems that they are also using akka and spray and spark, maybe helpful for
you.




On Mon, May 5, 2014 at 11:37 AM, ZhangYi yizh...@thoughtworks.com wrote:

  Hi all,

 Currently, our project is planning to adopt spark to be big data platform.
 For the client side, we decide expose REST api based on Spray. Our domain
 is focus on communication field for 3G and 4G user of processing some data
 analyst and statictics . Now, Spark + Spray is brand new for us, and we
 can't find any best practice via google.

 In our opinion, event-driven architecture is good choice for our project
 maybe. However, more idea is welcome. Thanks.

 --
 ZhangYi (张逸)
 Developer
 tel: 15023157626
 blog: agiledon.github.com
 weibo: tw张逸
 Sent with Sparrow http://www.sparrowmailapp.com/?sig




Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
I tried using serialization instead of broadcast, and my program exit with
Error(beyond physical memory limits).

The large object can not be released by GC? because it is needed for
recomputing? So what is the recomended way to solve this problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-04 Thread wxhsdp
Hi, DB, i think it's something related to sbt publishLocal

if i remove the breeze dependency in my sbt file, breeze can not be found

[error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not
found: object breeze
[error] import breeze.linalg._
[error]^

here's my sbt file:

name := Build Project

version := 1.0

scalaVersion := 2.10.4

libraryDependencies += org.apache.spark %% spark-core % 1.0.0-SNAPSHOT

resolvers += Akka Repository at http://repo.akka.io/releases/;

i run sbt publishLocal on the Spark tree. 

but if i manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib
directory, sbt package is
ok, i can run my app in workers without addJar

what's the difference between add dependency in sbt after sbt publishLocal
and manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib
directory?

why can i run my app in worker without addJar this time?


DB Tsai-2 wrote
 If you add the breeze dependency in your build.sbt project, it will not be
 available to all the workers.
 
 There are couple options, 1) use sbt assembly to package breeze into your
 application jar. 2) manually copy breeze jar into all the nodes, and have
 them in the classpath. 3) spark 1.0 has breeze jar in the spark flat
 assembly jar, so you don't need to add breeze dependency yourself.
 
 
 Sincerely,
 
 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
 On Sun, May 4, 2014 at 4:07 AM, wxhsdp lt;

 wxhsdp@

 gt; wrote:
 
 Hi,
   i'am trying to use breeze linalg library for matrix operation in my
 spark
 code. i already add dependency
   on breeze in my build.sbt, and package my code sucessfully.

   when i run on local mode, sbt run local..., everything is ok

   but when turn to standalone mode, sbt run spark://127.0.0.1:7077...,
 error occurs

 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.NoSuchMethodError
 java.lang.NoSuchMethodError:

 breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$;

   in my opinion, everything needed is packaged to the jar file, isn't it?
   and does anyone used breeze before? is it good for matrix operation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310p5355.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.