Re: spark run issue
All the stuff in lib_managed are what gets downloaded by sbt/maven when you compile. Those are necessary for running spark, spark streaming, etc. But you should not have to add all that to classpath individually and manually when running Spark programs. If you are trying to run your Spark program locally, you should use sbt or maven to compile your project with Spark as a dependency, and sbt/maven will take care of putting all the necessary jars in the classpath (when you run run your program with sbt/maven). If you are trying to run your Spark program on a cluster, then refer to the deployment guide http://spark.apache.org/docs/latest/cluster-overview.html. To run a Spark stand alone cluster, you just have to compile spark and place the whole spark directory on the worker nodes. For other deploy modes like Yarn and Mesos, you should just compile spark into a big all-inclusive jar and supply that when you launch your program on Yarn/mesos. See the guide for more details. TD On Sat, May 3, 2014 at 7:24 PM, Weide Zhang weo...@gmail.com wrote: Hi Tathagata, I actually have a separate question. What's the usage of lib_managed folder inside spark source folder ? Are those the library required for spark streaming to run ? Do they needed to be added to spark classpath when starting sparking cluster? Weide On Sat, May 3, 2014 at 7:08 PM, Weide Zhang weo...@gmail.com wrote: Hi Tathagata, I figured out the reason. I was adding a wrong kafka lib along side with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am a little confused about the version of Spark you are using. Are you using Spark 0.9.1 that uses scala 2.10.3 ? TD On Sat, May 3, 2014 at 6:16 PM, Weide Zhang weo...@gmail.com wrote: Hi I'm trying to run the kafka-word-count example in spark2.9.1. I encountered some exception when initialize kafka consumer/producer config. I'm using scala 2.10.3 and used maven build inside spark streaming kafka library comes with spark2.9.1. Any one see this exception before? Thanks, producer: [java] The args attribute is deprecated. Please use nested arg elements. [java] Exception in thread main java.lang.NoClassDefFoundError: scala/Tuple2$mcLL$sp [java] at kafka.producer.ProducerConfig.init(ProducerConfig.scala:56) [java] at com.turn.apache.KafkaWordCountProducer$.main(HelloWorld.scala:89) [java] at com.turn.apache.KafkaWordCountProducer.main(HelloWorld.scala) [java] Caused by: java.lang.ClassNotFoundException: scala.Tuple2$mcLL$sp [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more [java] Java Result: 1
different in spark on yarn mode and standalone mode
Hey you guys, What is the different in spark on yarn mode and standalone mode about resource schedule? Wish you happy everyday. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Crazy Kryo Exception
Does this perhaps have to do with the spark.closure.serializer? On Sat, May 3, 2014 at 7:50 AM, Soren Macbeth so...@yieldbot.com wrote: Poking around in the bowels of scala, it seems like this has something to do with implicit scala - java collection munging. Why would it be doing this and where? The stack trace given is entirely unhelpful to me. Is there a better one buried in my task logs? None of my tasks actually failed, so it seems that it dying while trying to fetch results from my tasks to return back to the driver. Am I close? On Fri, May 2, 2014 at 3:35 PM, Soren Macbeth so...@yieldbot.com wrote: Hallo, I've getting this rather crazy kryo exception trying to run my spark job: Exception in thread main org.apache.spark.SparkException: Job aborted: Exception while deserializing and fetching task: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collection.convert.Wrappers field scala.collection.convert.Wrappers$SeqWrapper.$outer to my.custom.class Serialization trace: $outer (scala.collection.convert.Wrappers$SeqWrapper) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I have a kryo serializer for my.custom.class and I've registered it using a custom registrator on my context object. I've tested the custom serializer and the registrator locally and they both function as expected. This job is running spark 0.9.1 under mesos in fine grained mode. Please help!
using kryo for spark.closure.serializer with a registrator doesn't work
Is this supposed to be supported? It doesn't work, at least in mesos fine grained mode. First it fails a bunch of times because it can't find my registrator class because my assembly jar hasn't been fetched like so: java.lang.ClassNotFoundException: pickles.kryo.PicklesRegistrator at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:63) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$2.apply(KryoSerializer.scala:61) at scala.Option.foreach(Option.scala:236) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:61) at org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:116) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:79) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:180) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) after that it does finally fetch my jar, then it fails with the following expection: 14/05/04 04:23:57 ERROR executor.Executor: Exception in task ID 79 java.nio.ReadOnlyBufferException at java.nio.ByteBuffer.array(ByteBuffer.java:961) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:136) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Should I file a bug?
Re: cache not work as expected for iteration?
Maybe your memory isn't enough to contain the current RDD and also all the past ones? RDDs that are cached or persisted have to be unpersisted explicitly, no auto-unpersist (maybe changes will be for 1.0 version?) exists. Be careful that calling cache() or persist() doesn't imply the RDD will be materialised.. I personally found this pattern of usage as simpler one: val mwzNew = mwz.mapPartitions(...).cache.persist mwzNew.count() or mwzNew.foreach(x = {}) // Force evaluation of the new RDD in order to have it materialized mwz.unpersist() // Drop from memory and disk the old, not anymore used, RDD 2014-05-04 5:16 GMT+02:00 Earthson earthson...@gmail.com: I'm using spark for LDA impementation. I need cache RDD for next step of Gibbs Sampling, and cached the result and the cache previous could be uncache. Something like LRU cache should delete the previous cache because it is never used then, but the cache runs into confusion: Here is the code:) https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala#L99 http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache1.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache2.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt run with spark.ContextCleaner ERROR
Can you tell which version of Spark you are using? Spark 1.0 RC3, or something intermediate? And do you call sparkContext.stop at the end of your application? If so, does this error occur before or after the stop()? TD On Sun, May 4, 2014 at 2:40 AM, wxhsdp wxh...@gmail.com wrote: Hi, all i use sbt to run my spark application, after the app completes, error occurs: 14/05/04 17:32:28 INFO network.ConnectionManager: Selector thread was interrupted! 14/05/04 17:32:28 ERROR spark.ContextCleaner: Error in cleaning thread java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) at org.apache.spark.ContextCleaner.org $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:116) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:64) has anyone met this before? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt/sbt run command returns a JVM problem
Hi Michael, The log after I typed last is as below: last scala.tools.nsc.MissingRequirementError: object scala not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$.getModule(Definitions.scala:605) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackage(Definitions.scala:145) at scala.tools.nsc.symtab.Definitions$definitions$.ScalaPackageClass(Definitions.scala:146) at scala.tools.nsc.symtab.Definitions$definitions$.AnyClass(Definitions.scala:176) at scala.tools.nsc.symtab.Definitions$definitions$.init(Definitions.scala:814) at scala.tools.nsc.Global$Run.init(Global.scala:697) at sbt.compiler.Eval$$anon$1.init(Eval.scala:53) at sbt.compiler.Eval.run$1(Eval.scala:53) at sbt.compiler.Eval.unlinkAll$1(Eval.scala:56) at sbt.compiler.Eval.eval(Eval.scala:62) at sbt.EvaluateConfigurations$.evaluateSetting(Build.scala:104) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:212) at sbt.BuiltinCommands$$anonfun$set$1.apply(Main.scala:209) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62) at sbt.Command$.process(Command.scala:90) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1$$anonfun$apply$1.apply(MainLoop.scala:71) at sbt.State$$anon$2.process(State.scala:171) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.MainLoop$$anonfun$next$1.apply(MainLoop.scala:71) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.MainLoop$.next(MainLoop.scala:71) at sbt.MainLoop$.run(MainLoop.scala:64) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:53) at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:50) at sbt.Using.apply(Using.scala:25) at sbt.MainLoop$.runWithNewLog(MainLoop.scala:50) at sbt.MainLoop$.runAndClearLast(MainLoop.scala:33) at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:17) at sbt.MainLoop$.runLogged(MainLoop.scala:13) at sbt.xMain.run(Main.scala:26) at xsbt.boot.Launch$.run(Launch.scala:55) at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:45) at xsbt.boot.Launch$.launch(Launch.scala:60) at xsbt.boot.Launch$.apply(Launch.scala:16) at xsbt.boot.Boot$.runImpl(Boot.scala:31) at xsbt.boot.Boot$.main(Boot.scala:20) at xsbt.boot.Boot.main(Boot.scala) [error] scala.tools.nsc.MissingRequirementError: object scala not found. [error] Use 'last' for the full log. And my sbt file is like below (my sbt launcher is sbt-launch-0.12.4.jar in the same folder): #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # #http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # This script launches sbt for this project. If present it uses the system # version of sbt. If there is no system version of sbt it attempts to download # sbt locally. SBT_VERSION=`awk -F = '/sbt\\.version/ {print $2}' ./project/build.properties` URL1=http://typesafe.artifactoryonline.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar URL2=http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar JAR=sbt/sbt-launch-${SBT_VERSION}.jar # Download sbt launch jar if it hasn't been downloaded yet if [ ! -f ${JAR} ]; then # Download printf Attempting to fetch sbt\n if hash curl 2/dev/null; then curl --progress-bar ${URL1} ${JAR} || curl --progress-bar ${URL2} ${JAR} elif hash wget 2/dev/null; then wget --progress=bar ${URL1} -O ${JAR} || wget --progress=bar ${URL2} -O ${JAR} else printf You do not have curl or wget installed, please install sbt manually from http://www.scala-sbt.org/\n; exit -1 fi fi if [ ! -f ${JAR} ]; then # We failed to download printf Our attempt to
Re: SparkException: env SPARK_YARN_APP_JAR is not set
according to the code, SPARK_YARN_APP_JAR is retrieved from system variables. and the key-value pairs you pass through to JavaSparkContext is isolated from system variables. so, you maybe should try setting it through System.setProperty(). thanks On Wed, Apr 23, 2014 at 6:05 PM, 肥肥 19934...@qq.com wrote: I have a small program, which I can launch successfully by yarn client with yarn-standalon mode. the command look like this: (javac javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest.java) (jar cvf loadtest.jar LoadTest.class) SPARK_JAR=assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client --jar /opt/mytest/loadtest.jar --class LoadTest --args yarn-standalone --num-workers 2 --master-memory 2g --worker-memory 2g --worker-cores 1 the program LoadTest.java: public class LoadTest { static final String USER = root; public static void main(String[] args) { System.setProperty(user.name, USER); System.setProperty(HADOOP_USER_NAME, USER); System.setProperty(spark.executor.memory, 7g); JavaSparkContext sc = new JavaSparkContext(args[0], LoadTest, System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(LoadTest.class)); String file = file:/opt/mytest/123.data; JavaRDDString data1 = sc.textFile(file, 2); long c1=data1.count(); System.out.println(1+c1); } } BUT due to my other pragram's need, I must have it run with command of java. So I add “environment” parameter to JavaSparkContext(). Followed is The ERROR I get: Exception in thread main org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:125) at org.apache.spark.SparkContext.init(SparkContext.scala:200) at org.apache.spark.SparkContext.init(SparkContext.scala:100) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:93) at LoadTest.main(LoadTest.java:37) the program LoadTest.java: public class LoadTest { static final String USER = root; public static void main(String[] args) { System.setProperty(user.name, USER); System.setProperty(HADOOP_USER_NAME, USER); System.setProperty(spark.executor.memory, 7g); MapString, String env = new HashMapString, String(); env.put(SPARK_YARN_APP_JAR, file:/opt/mytest/loadtest.jar); env.put(SPARK_WORKER_INSTANCES, 2 ); env.put(SPARK_WORKER_CORES, 1); env.put(SPARK_WORKER_MEMORY, 2G); env.put(SPARK_MASTER_MEMORY, 2G); env.put(SPARK_YARN_APP_NAME, LoadTest); env.put(SPARK_YARN_DIST_ARCHIVES, file:/opt/test/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar); JavaSparkContext sc = new JavaSparkContext(yarn-client, LoadTest, System.getenv(SPARK_HOME), JavaSparkContext.jarOfClass(LoadTest.class), env); String file = file:/opt/mytest/123.dna; JavaRDDString data1 = sc.textFile(file, 2);//.cache(); long c1=data1.count(); System.out.println(1+c1); } } the command: javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest.java jar cvf loadtest.jar LoadTest.class nohup java -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest loadTest.log 21 What did I miss?? Or I did it in wrong way??
Re: cache not work as expected for iteration?
thx for the help, unpersist is excatly what I want:) I see that spark will remove some cache automatically when memory is full, it is much more helpful if the rule satisfy something like LRU It seems that persist and cache is some kind of lazy? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292p5308.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt run with spark.ContextCleaner ERROR
Hi, TD actually, i'am not very clear with my spark version. i check out from https://github.com/apache/spark/trunk on Apr 30. please tell me from where do you get the version Spark 1.0 RC3 i do not call sparkContext.stop. now i add it to the end of my code here's the log 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null} 14/05/04 18:48:21 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null} 14/05/04 18:48:21 INFO ui.SparkUI: Stopped Spark web UI at http://ubuntu.local:4040 14/05/04 18:48:21 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/05/04 18:48:22 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/05/04 18:48:23 INFO network.ConnectionManager: Selector thread was interrupted! 14/05/04 18:48:23 INFO network.ConnectionManager: ConnectionManager stopped 14/05/04 18:48:23 INFO storage.MemoryStore: MemoryStore cleared 14/05/04 18:48:23 INFO storage.BlockManager: BlockManager stopped 14/05/04 18:48:23 INFO storage.BlockManagerMasterActor: Stopping BlockManagerMaster 14/05/04 18:48:23 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/05/04 18:48:23 INFO spark.SparkContext: Successfully stopped SparkContext 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #3 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #1 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #2 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #3 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #1 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #2 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #6 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #4 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #5 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #6 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #4 14/05/04 18:48:23 ERROR nio.AbstractNioSelector: Interrupted while wait for resources to be released #5 14/05/04 18:48:23 INFO Remoting: Remoting shut down 14/05/04 18:48:23 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5309.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
NoSuchMethodError: breeze.linalg.DenseMatrix
Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt run local..., everything is ok but when turn to standalone mode, sbt run spark://127.0.0.1:7077..., error occurs 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoSuchMethodError java.lang.NoSuchMethodError: breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$; in my opinion, everything needed is packaged to the jar file, isn't it? and does anyone used breeze before? is it good for matrix operation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
unsubscribe
unsubscribe
Re: Lease Exception hadoop 2.4
Thanks Mayur, the only think that my code is doing is: read from s3, and saveAsTextFile on hdfs. Like I said, everything is written correctly, but at the end of the job there is this warnning, I will try to compile with hadoop 2.4 thanks 2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com: You should compile Spark with every hadoop version you use. I am surprised its working otherwise as HDFS breaks compatibility quite often. As for this error it comes when your code writes/reads from file that has already deleted. Are you trying to update a single file in multiple mappers/reduce partitioners? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote: Please, can anyone give a feedback? thanks Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com: Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks
Re: cache not work as expected for iteration?
Yes, persist/cache will cache an RDD only when an action is applied to it. On Sun, May 4, 2014 at 6:32 AM, Earthson earthson...@gmail.com wrote: thx for the help, unpersist is excatly what I want:) I see that spark will remove some cache automatically when memory is full, it is much more helpful if the rule satisfy something like LRU It seems that persist and cache is some kind of lazy? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-not-work-as-expected-for-iteration-tp5292p5308.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Reading multiple S3 objects, transforming, writing back one
Chris, To use s3distcp in this case, are you suggesting saving the RDD to local/ephemeral HDFS and then copying it up to S3 using this tool? On Sat, May 3, 2014 at 7:14 PM, Chris Fregly ch...@fregly.com wrote: not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html s3distcp is an extension of the familiar hadoop distcp, of course. On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible), then you get the best of both worlds: a highly parallelized write to S3, and a single cleanly named output file. On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote: Thank you Patrick. I took a quick stab at it: val s3Client = new AmazonS3Client(...) val copyObjectResult = s3Client.copyObject(upload, outputPrefix + /part-0, rolled-up-logs, 2014-04-28.csv) val objectListing = s3Client.listObjects(upload, outputPrefix) s3Client.deleteObjects(new DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s = new KeyVersion(s.getKey)).asJava)) Using a 3GB object I achieved about 33MB/s between buckets in the same AZ. This is a workable solution for the short term but not ideal for the longer term as data size increases. I understand it's a limitation of the Hadoop API but ultimately it must be possible to dump a RDD to a single S3 object :) On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell pwend...@gmail.com wrote: This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0) f.moveTo(somewhere else) dir.remove() It might be cool to add a utility called `saveAsSingleFile` or something that does this for you. In fact probably we should have called saveAsTextfile saveAsTextFiles to make it more clear... On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote: Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own directory. On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that's defined by the Hadoop API that Spark uses for reading to/writing from S3. I don't know of a way to change that. On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com wrote: Ah, looks like RDD.coalesce(1) solves one part of the problem. On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com wrote: Hi Playing around with Spark S3, I'm opening multiple objects (CSV files) with: val hfile = sc.textFile(s3n://bucket/2014-04-28/) so hfile is a RDD representing 10 objects that were underneath 2014-04-28. After I've sorted and otherwise transformed the content, I'm trying to write it back to a single object: sortedMap.values.map(_.mkString(,)).saveAsTextFile(s3n://bucket/concatted.csv) unfortunately this results in a folder named concatted.csv with 10 objects underneath, part-0 .. part-00010, corresponding to the 10 original objects loaded. How can I achieve the desired behaviour of putting a single object named concatted.csv ? I've tried 0.9.1 and 1.0.0-RC3. Thanks! Peter
Re: Reading multiple S3 objects, transforming, writing back one
Thank you Chris, I am familiar with S3distcp, I'm trying to replicate some of that functionality and combine it with my log post processing in one step instead of yet another step. On Saturday, May 3, 2014 4:15 PM, Chris Fregly ch...@fregly.com wrote: not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html s3distcp is an extension of the familiar hadoop distcp, of course. On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible), then you get the best of both worlds: a highly parallelized write to S3, and a single cleanly named output file. On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote: Thank you Patrick. I took a quick stab at it: val s3Client = new AmazonS3Client(...) val copyObjectResult = s3Client.copyObject(upload, outputPrefix + /part-0, rolled-up-logs, 2014-04-28.csv) val objectListing = s3Client.listObjects(upload, outputPrefix) s3Client.deleteObjects(new DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s = new KeyVersion(s.getKey)).asJava)) Using a 3GB object I achieved about 33MB/s between buckets in the same AZ. This is a workable solution for the short term but not ideal for the longer term as data size increases. I understand it's a limitation of the Hadoop API but ultimately it must be possible to dump a RDD to a single S3 object :) On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell pwend...@gmail.com wrote: This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0) f.moveTo(somewhere else) dir.remove() It might be cool to add a utility called `saveAsSingleFile` or something that does this for you. In fact probably we should have called saveAsTextfile saveAsTextFiles to make it more clear... On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote: Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own directory. On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that's defined by the Hadoop API that Spark uses for reading to/writing from S3. I don't know of a way to change that. On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com wrote: Ah, looks like RDD.coalesce(1) solves one part of the problem. On Wednesday, April 30, 2014 11:15 AM, Peter thenephili...@yahoo.com wrote: Hi Playing around with Spark S3, I'm opening multiple objects (CSV files) with: val hfile = sc.textFile(s3n://bucket/2014-04-28/) so hfile is a RDD representing 10 objects that were underneath 2014-04-28. After I've sorted and otherwise transformed the content, I'm trying to write it back to a single object: sortedMap.values.map(_.mkString(,)).saveAsTextFile(s3n://bucket/concatted.csv) unfortunately this results in a folder named concatted.csv with 10 objects underneath, part-0 .. part-00010, corresponding to the 10 original objects loaded. How can I achieve the desired behaviour of putting a single object named concatted.csv ? I've tried 0.9.1 and 1.0.0-RC3. Thanks! Peter
Re: Reading multiple S3 objects, transforming, writing back one
Hi Patrick I should probably explain my use case in a bit more detail. I have hundreds of thousands to millions of clients uploading events to my pipeline, these are batched periodically (every 60 seconds atm) into logs which are dumped into S3 (and uploaded into a data warehouse). I need to post process these at another set interval (say hourly), mainly dedup global sort and then roll them up into a single file. I will repeat this process daily (need to experiment with the granularity here but daily feels appropriate) to yield a single file for the day's data. I'm hoping to use Spark for these roll ups, it seems so much easier than using Hadoop. The roll up is to avoid the hadoop small files problem, here's one article discussing it: The Small Files Problem The Small Files Problem Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. View on blog.cloudera.com Preview by Yahoo Inspecting my dataset should be much more efficient and manageable with larger 100s or maybe low 1000s of partitions rather than 1,000,000s. Hope that makes some sense :) Thanks Peter On Saturday, May 3, 2014 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Peter, If your dataset is large (3GB) then why not just chunk it into multiple files? You'll need to do this anyways if you want to read/write this from S3 quickly, because S3's throughput per file is limited (as you've seen). This is exactly why the Hadoop API lets you save datasets with many partitions, since often there are bottlenecks at the granularity of a file. Is there a reason you need this to be exactly one file? - Patrick On Sat, May 3, 2014 at 4:14 PM, Chris Fregly ch...@fregly.com wrote: not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html s3distcp is an extension of the familiar hadoop distcp, of course. On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible), then you get the best of both worlds: a highly parallelized write to S3, and a single cleanly named output file. On Thu, May 1, 2014 at 12:52 PM, Peter thenephili...@yahoo.com wrote: Thank you Patrick. I took a quick stab at it: val s3Client = new AmazonS3Client(...) val copyObjectResult = s3Client.copyObject(upload, outputPrefix + /part-0, rolled-up-logs, 2014-04-28.csv) val objectListing = s3Client.listObjects(upload, outputPrefix) s3Client.deleteObjects(new DeleteObjectsRequest(upload).withKeys(objectListing.getObjectSummaries.asScala.map(s = new KeyVersion(s.getKey)).asJava)) Using a 3GB object I achieved about 33MB/s between buckets in the same AZ. This is a workable solution for the short term but not ideal for the longer term as data size increases. I understand it's a limitation of the Hadoop API but ultimately it must be possible to dump a RDD to a single S3 object :) On Wednesday, April 30, 2014 7:01 PM, Patrick Wendell pwend...@gmail.com wrote: This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0) f.moveTo(somewhere else) dir.remove() It might be cool to add a utility called `saveAsSingleFile` or something that does this for you. In fact probably we should have called saveAsTextfile saveAsTextFiles to make it more clear... On Wed, Apr 30, 2014 at 2:00 PM, Peter thenephili...@yahoo.com wrote: Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own directory. On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that's defined by the Hadoop API that Spark uses for reading to/writing from S3. I don't know of a way to change that. On Wed, Apr 30, 2014 at 2:47 PM, Peter thenephili...@yahoo.com wrote: Ah, looks like RDD.coalesce(1) solves one part of the
Re: NoSuchMethodError: breeze.linalg.DenseMatrix
If you add the breeze dependency in your build.sbt project, it will not be available to all the workers. There are couple options, 1) use sbt assembly to package breeze into your application jar. 2) manually copy breeze jar into all the nodes, and have them in the classpath. 3) spark 1.0 has breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency yourself. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh...@gmail.com wrote: Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt run local..., everything is ok but when turn to standalone mode, sbt run spark://127.0.0.1:7077..., error occurs 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoSuchMethodError java.lang.NoSuchMethodError: breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$; in my opinion, everything needed is packaged to the jar file, isn't it? and does anyone used breeze before? is it good for matrix operation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchMethodError: breeze.linalg.DenseMatrix
An additional option 4) Use SparkContext.addJar() and have the application ship your jar to all the nodes. Yadid On 5/4/14, 4:07 PM, DB Tsai wrote: If you add the breeze dependency in your build.sbt project, it will not be available to all the workers. There are couple options, 1) use sbt assembly to package breeze into your application jar. 2) manually copy breeze jar into all the nodes, and have them in the classpath. 3) spark 1.0 has breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency yourself. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 4:07 AM, wxhsdp wxh...@gmail.com mailto:wxh...@gmail.com wrote: Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt run local..., everything is ok but when turn to standalone mode, sbt run spark://127.0.0.1:7077..., error occurs 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoSuchMethodError java.lang.NoSuchMethodError: breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$; in my opinion, everything needed is packaged to the jar file, isn't it? and does anyone used breeze before? is it good for matrix operation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark ec2 error
Hi all, A heads up in case others hit this and are confused… This nice addition https://github.com/apache/spark/pull/612 causes an error if running the spark-ec2.py deploy script from a version other than master (e.g. 0.8.0). The error occurs during launch, here: ... Creating local config files... configuring /etc/ganglia/gmond.conf Traceback (most recent call last): File ./deploy_templates.py, line 89, in module text = text.replace({{ + key + }}, template_vars[key]) TypeError: expected a character buffer object Deploying Spark config files... chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or directory ... And then several more errors because of missing variables (though the cluster itself launches, there are several configuration problems, e.g. with HDFS). deploy_templates fails because the new SPARK_MASTER_OPTS and SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py still use deploy_templates from https://github.com/mesos/spark-ec2.git -b v2, which has the new variables. Using the updated spark-ec2.py from master works fine. -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Initial job has not accepted any resources
I have been working on a Spark program, completed it, but have spent the past few hours trying to run on EC2 without any luck. I am hoping i can comprehensively describe my problem and what I have done, but I am pretty stuck. My code uses the following lines to configure the SparkContext, which are taken from the standalone app example found here: https://spark.apache.org/docs/0.9.0/quick-start.html And combined with ampcamps code found here: http://spark-summit.org/2013/exercises/machine-learning-with-spark.html To give the following code: http://pastebin.com/zDYkk1T8 Launch spark cluster with: ./spark-ec2 -k plda -i ~/plda.pem -s 1 --instance-type=t1.micro --region=us-west-2 start lda When logged in, launch my job with sbt via this, while in my projects directory $/root/bin/sbt run This results in the following log, indicating the problem in my subject line: http://pastebin.com/DiQCj6jQ Following this, I got advice to set my conf/spark-env.sh so it exports MASTER and SPARK_HOME_IP There's an inconsistency in the way the master addresses itself. The Spark master uses the internal (ip-*.internal) address, but the driver is trying to connect using the external (ec2-*.compute-1.amazonaws.com) address. The solution is to set the Spark master URL to the external address in the spark-env.sh file. Your conf/spark-env.sh is probably empty. It should set MASTER and SPARK_MASTER_IP to the external URL, as the EC2 launch script does: https://github.com/.../templ.../root/spark/conf/spark-env.sh; My spark-env.sh looks like this: #!/usr/bin/env bash export MASTER=`cat /root/spark-ec2/cluster-url` export SPARK_MASTER_IP=ec2-54-186-178-145.us-west-2.compute.amazonaws.com export SPARK_WORKER_MEM=128m At this point, I did a test 1. Remove my spark-env.sh variables 2. Run spark-shell 3. Run: sc.parallelize(1 to 1000).count() 4. This works as expected 5. Reset my spark-env.sh variables 6. Run prior spark-shell and commands 7. I get the same error as reported above. Hence, it is something wrong with how I am setting my master/slave configuration. Any help would be greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark streaming question
Hi , It might be a very general question to ask here but I'm curious to know why spark streaming can achieve better throughput than storm as claimed in the spark streaming paper. Does it depend on certain use cases and/or data source ? What drives better performance in spark streaming case or in other ways, what makes storm not as performant as spark streaming ? Also, in order to guarantee exact-once semantics when node failure happens, spark makes replicas of RDDs and checkpoints so that data can be recomputed on the fly while on Trident case, they use transactional object to persist the state and result but it's not obvious to me which approach is more costly and why ? Any one can provide some experience here ? Thanks a lot, Weide
Re: spark ec2 error
Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, A heads up in case others hit this and are confused... This nice addition https://github.com/apache/spark/pull/612 causes an error if running the spark-ec2.py deploy script from a version other than master (e.g. 0.8.0). The error occurs during launch, here: ... Creating local config files... configuring /etc/ganglia/gmond.conf Traceback (most recent call last): File ./deploy_templates.py, line 89, in module text = text.replace({{ + key + }}, template_vars[key]) TypeError: expected a character buffer object Deploying Spark config files... chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or directory ... And then several more errors because of missing variables (though the cluster itself launches, there are several configuration problems, e.g. with HDFS). deploy_templates fails because the new SPARK_MASTER_OPTS and SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py still use deploy_templates from https://github.com/mesos/spark-ec2.git -b v2, which has the new variables. Using the updated spark-ec2.py from master works fine. -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark ec2 error
Okay I just went ahead and fixed this to make it backwards-compatible (was a simple fix). I launched a cluster successfully with Spark 0.8.1. Jeremy - if you could try again and let me know if there are any issues, that would be great. Thanks again for reporting this. On Sun, May 4, 2014 at 3:41 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, A heads up in case others hit this and are confused... This nice addition https://github.com/apache/spark/pull/612 causes an error if running the spark-ec2.py deploy script from a version other than master (e.g. 0.8.0). The error occurs during launch, here: ... Creating local config files... configuring /etc/ganglia/gmond.conf Traceback (most recent call last): File ./deploy_templates.py, line 89, in module text = text.replace({{ + key + }}, template_vars[key]) TypeError: expected a character buffer object Deploying Spark config files... chmod: cannot access `/root/spark/conf/spark-env.sh': No such file or directory ... And then several more errors because of missing variables (though the cluster itself launches, there are several configuration problems, e.g. with HDFS). deploy_templates fails because the new SPARK_MASTER_OPTS and SPARK_WORKER_INSTANCES don't exist, and earlier versions of spark-ec2.py still use deploy_templates from https://github.com/mesos/spark-ec2.git -b v2, which has the new variables. Using the updated spark-ec2.py from master works fine. -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark streaming question
great questions, weide. in addition, i'd also like to hear more about how to horizontally scale a spark-streaming cluster. i've gone through the samples (standalone mode) and read the documentation, but it's still not clear to me how to scale this puppy out under high load. i assume i add more receivers (kinesis, flume, etc), but physically how does this work? @TD: can you comment? thanks! -chris On Sun, May 4, 2014 at 2:10 PM, Weide Zhang weo...@gmail.com wrote: Hi , It might be a very general question to ask here but I'm curious to know why spark streaming can achieve better throughput than storm as claimed in the spark streaming paper. Does it depend on certain use cases and/or data source ? What drives better performance in spark streaming case or in other ways, what makes storm not as performant as spark streaming ? Also, in order to guarantee exact-once semantics when node failure happens, spark makes replicas of RDDs and checkpoints so that data can be recomputed on the fly while on Trident case, they use transactional object to persist the state and result but it's not obvious to me which approach is more costly and why ? Any one can provide some experience here ? Thanks a lot, Weide
Re: spark ec2 error
Cool, glad to help! I just tested with 0.8.1 and 0.9.0 and both worked perfectly, so seems to all be good. -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323p5329.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lease Exception hadoop 2.4
I compiled spark with SPARK_HADOOP_VERSION=2.4.0 sbt/sbt assembly, fixed the s3 dependencies, but I am still getting the same error... 14/05/05 00:32:33 WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on Any ideas? thanks 2014-05-04 11:53 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com: Thanks Mayur, the only think that my code is doing is: read from s3, and saveAsTextFile on hdfs. Like I said, everything is written correctly, but at the end of the job there is this warnning, I will try to compile with hadoop 2.4 thanks 2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com: You should compile Spark with every hadoop version you use. I am surprised its working otherwise as HDFS breaks compatibility quite often. As for this error it comes when your code writes/reads from file that has already deleted. Are you trying to update a single file in multiple mappers/reduce partitioners? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote: Please, can anyone give a feedback? thanks Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com: Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks
unsubscribe
原始邮件 主题:unsubscribe 发件人:Nabeel Memon nm3...@gmail.com 收件人:user@spark.apache.org 抄送: unsubscribe
Error starting EC2 cluster
I am using Spark 0.9.1. When I'm trying to start a EC2 cluster with the spark-ec2 script, an error occurs and the following message is issued: AttributeError: 'module' object has no attribute 'check_output'. By this time, EC2 instances are up and running but Spark doesn't seem to be installed on them. Any ideas how to fix it? $ ./spark-ec2 -k my_key -i /home/alitouka/my_key.pem -s 1 --region=us-east-1 --instance-type=m3.medium launch test_cluster Setting up security groups... Searching for existing cluster test_cluster... Don't recognize m3.medium, assuming type is pvm Spark AMI: ami-5bb18832 Launching instances... Launched 1 slaves in us-east-1c, regid = r- Launched master in us-east-1c, regid = r- Waiting for instances to start up... Waiting 120 more seconds... Generating cluster's SSH key on master... ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22: Connection refused Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t', '-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n [ -f ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa \n cat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys)\n]' returned non-zero exit status 255 ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22: Connection refused Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t', '-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n [ -f ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa \n cat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys)\n]' returned non-zero exit status 255 ssh: connect to host ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com port 22: Connection refused Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/alitouka/my_key.pem', '-t', '-t', u'r...@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com', \n [ -f ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa \n cat ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys)\n]' returned non-zero exit status 255 Warning: Permanently added 'ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com,54.227.205.82' (RSA) to the list of known hosts. Connection to ec2-XX-XXX-XXX-XX.compute-1.amazonaws.com closed. Traceback (most recent call last): File ./spark_ec2.py, line 806, in module main() File ./spark_ec2.py, line 799, in main real_main() File ./spark_ec2.py, line 684, in real_main setup_cluster(conn, master_nodes, slave_nodes, opts, True) File ./spark_ec2.py, line 419, in setup_cluster dot_ssh_tar = ssh_read(master, opts, ['tar', 'c', '.ssh']) File ./spark_ec2.py, line 624, in ssh_read return subprocess.check_output( AttributeError: 'module' object has no attribute 'check_output'
RE: different in spark on yarn mode and standalone mode
In the core, they are not quite different In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app. While in Yarn mode, Yarn resource manager and node manager do this work. When the driver and executors have been launched, the rest part of resource scheduling go through the same process, say between driver and executor through akka actor. Best Regards, Raymond Liu -Original Message- From: Sophia [mailto:sln-1...@163.com] Hey you guys, What is the different in spark on yarn mode and standalone mode about resource schedule? Wish you happy everyday. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/different-in-spark-on-yarn-mode-and-standalone-mode-tp5300.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Initial job has not accepted any resources
Hey Pedro, From which version of Spark were you running the spark-ec2.py script? You might have run into the problem described here (http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html), which Patrick just fixed up to ensure backwards compatibility. With the bug, it would successfully complete deployment but prevent the correct setting of various variables, so may have caused the errors you were seeing, though I'm not positive. I'd definitely try re-running the spark-ec2 script now. -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5335.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Initial job has not accepted any resources
Hi Jeremy, I am running from the most recent release, 0.9. I just fixed the problem, and it is indeed correct setting of variables in deployment. Once I had the cluster I wanted running, I began to suspect that master was not responding. So I killed a worker, then recreated it, and found it could not connect to master. So I killed master and remade it, then remade the worker. Odd things happened, but it seemed like I was on the right track. So I stopped all, then restarted all, then tried again and it began to work. So now, after probably around 6 hours of debugging, I have EC2 working (yay). Next thing that we are trying to debug is that when we build our project on EC2, it can’t find breeze, and when we try to include it as a spark example (under the correct directory), it also can’t find it. The first occurs as a runtime error and is a NoClassDefFoundError for breeze/linalg/DenseVector which makes me think that breeze isn’t on the cluster and also isn’t being built with our project (and sbt assembly is acting strange). The second is a compile time error, which is odd because other examples in the same directory use breeze and I don’t find anything that specifies their dependencies. Thanks -- Pedro Rodriguez UCBerkeley 2014 | Computer Science BSU Cryosphere Science Research SnowGeek Founder snowgeek.org pedro-rodriguez.com ski.rodrig...@gmail.com 208-340-1703 On May 4, 2014 at 6:51:56 PM, Jeremy Freeman [via Apache Spark User List] (ml-node+s1001560n5335...@n3.nabble.com) wrote: Hey Pedro, From which version of Spark were you running the spark-ec2.py script? You might have run into the problem described here (http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html), which Patrick just fixed up to ensure backwards compatibility. With the bug, it would successfully complete deployment but prevent the correct setting of various variables, so may have caused the errors you were seeing, though I'm not positive. I'd definitely try re-running the spark-ec2 script now. -- Jeremy If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5335.html To unsubscribe from Initial job has not accepted any resources, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5336.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lease Exception hadoop 2.4
I think I forgot to rsync the slaves with the new compiled jar, I will give it a try as soon as possible, Em 04/05/2014 21:35, Andre Kuhnen andrekuh...@gmail.com escreveu: I compiled spark with SPARK_HADOOP_VERSION=2.4.0 sbt/sbt assembly, fixed the s3 dependencies, but I am still getting the same error... 14/05/05 00:32:33 WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on Any ideas? thanks 2014-05-04 11:53 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com: Thanks Mayur, the only think that my code is doing is: read from s3, and saveAsTextFile on hdfs. Like I said, everything is written correctly, but at the end of the job there is this warnning, I will try to compile with hadoop 2.4 thanks 2014-05-04 11:17 GMT-03:00 Mayur Rustagi mayur.rust...@gmail.com: You should compile Spark with every hadoop version you use. I am surprised its working otherwise as HDFS breaks compatibility quite often. As for this error it comes when your code writes/reads from file that has already deleted. Are you trying to update a single file in multiple mappers/reduce partitioners? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, May 4, 2014 at 5:30 PM, Andre Kuhnen andrekuh...@gmail.comwrote: Please, can anyone give a feedback? thanks Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks 2014-05-03 13:09 GMT-03:00 Andre Kuhnen andrekuh...@gmail.com: Hello, I am getting this warning after upgrading Hadoop 2.4, when I try to write something to the HDFS. The content is written correctly, but I do not like this warning. DO I have to compile SPARK with hadoop 2.4? WARN TaskSetManager: Loss was due to org.apache.hadoop.ipc.RemoteException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException) thanks
Re: sbt/sbt run command returns a JVM problem
the total memory of your machine is 2G right? then how much memory is left free? wouldn`t ubuntu take up quite a big portion of 2G? just a guess! On Sat, May 3, 2014 at 8:15 PM, Carter gyz...@hotmail.com wrote: Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ $@ I have tried to set these 3 parameters bigger and smaller, but nothing works. Did I change the right thing? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p5267.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: master attempted to re-register the worker and then took all workers as unregistered
Hi Nan, Have you found a way to fix the issue? Now I run into the same problem with version 0.9.1. Thanks, Cheney -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5341.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: ClassNotFoundException
I just ran into the same problem. I will respond if I find how to fix. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-tp5182p5342.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Initial job has not accepted any resources
Since it appears breeze is going to be included by default in Spark in 1.0, and I ran into the issue here: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFoundException-td5182.html And it seems like the issues I had were recently introduced, I am cloning spark and checking out the 1.0 branch. Maybe this makes my problem(s) worse, but am going to give it a try. Rapidly running out of time to get our code fully working on EC2. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-tp5322p5344.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark 0.9.1: ClassNotFoundException
check if the jar file that includes your example code is under examples/target/scala-2.10/. On Sat, May 3, 2014 at 5:58 AM, SK skrishna...@gmail.com wrote: I am using Spark 0.9.1 in standalone mode. In the SPARK_HOME/examples/src/main/scala/org/apache/spark/ folder, I created my directory called mycode in which I have placed some standalone scala code. I was able to compile. I ran the code using: ./bin/run-example org.apache.spark.mycode.MyClass local However, I get a ClassNotFound exception, although I do see the compiled classes in examples/target/scala-2.10/classes/org/apache/spark/mycode When I place the same code in the same folder structure in the spark 0.9.0 version, I am able to run it. Where should I place my standalone code with respect to SPARK_HOME, in spark0.9.1 so that the classes can be found? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-0-9-1-ClassNotFoundException-tp5256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication
Hi Jacob, Taking both concerns into account, I'm actually thinking about using a separate subnet to isolate the Spark Workers, but need to look into how to bind the process onto the correct interface first. This may require some code change.Separate subnet doesn't limit itself with port range so port exhaustion should rarely happen, and won't impact performance. By opening up all port between 32768-61000 is actually the same as no firewall, this expose some security concerns, but need more information whether that is critical or not. The bottom line is the driver needs to talk to the Workers. The way how user access the Driver should be easier to solve such as launching Spark (shell) driver on a specific interface. Likewise, if you found out any interesting solutions, please let me know. I'll share the solution once I have something up and running. Currently, it is running ok with iptables off, but still need to figure out how to product-ionize the security part. Subject: RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication To: user@spark.apache.org From: jeis...@us.ibm.com Date: Fri, 2 May 2014 16:07:50 -0500 Howdy Andrew, I think I am running into the same issue [1] as you. It appears that Spark opens up dynamic / ephemera [2] ports for each job on the shell and the workers. As you are finding out, this makes securing and managing the network for Spark very difficult. Any idea how to restrict the 'Workers' port range? The port range can be found by running: $ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 3276861000 With that being said, a couple avenues you may try: Limit the dynamic ports [3] to a more reasonable number and open all of these ports on your firewall; obviously, this might have unintended consequences like port exhaustion. Secure the network another way like through a private VPN; this may reduce Spark's performance. If you have other workarounds, I am all ears --- please let me know! Jacob [1] http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html [2] http://en.wikipedia.org/wiki/Ephemeral_port [3] http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Andrew Lee ---05/02/2014 03:15:42 PM---Hi Yana, I did. I configured the the port in spark-env.sh, the problem is not the driver port which From: Andrew Lee alee...@hotmail.com To: user@spark.apache.org user@spark.apache.org Date: 05/02/2014 03:15 PM Subject:RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication Hi Yana, I did. I configured the the port in spark-env.sh, the problem is not the driver port which is fixed. it's the Workers port that are dynamic every time when they are launched in the YARN container. :-( Any idea how to restrict the 'Workers' port range? Date: Fri, 2 May 2014 14:49:23 -0400 Subject: Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication From: yana.kadiy...@gmail.com To: user@spark.apache.org I think what you want to do is set spark.driver.port to a fixed port. On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client mode, I notice that Workers on the YARN containers are trying to talk to the driver (spark-shell), however, the firewall is not opened and caused timeout. For the Workers, it tries to open listening ports on 54xxx for each Worker? Is the port random in such case? What will be the better way to predict the ports so I can configure the firewall correctly between the driver (spark-shell) and the Workers? Is there a range of ports we can specify in the firewall/iptables? Any ideas?
Re: pySpark memory usage
I'd just like to update this thread by pointing to the PR based on our initial design: https://github.com/apache/spark/pull/640 This solution is a little more general and avoids catching IOException altogether. Long live exception propagation! On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jim, This IOException thing is a general issue that we need to fix and your observation is spot-in. There is actually a JIRA for it here I created a few days ago: https://issues.apache.org/jira/browse/SPARK-1579 Aaron is assigned on that one but not actively working on it, so we'd welcome a PR from you on this if you are interested. The first thought we had was to set a volatile flag when the reader sees an exception (indicating there was a failure in the task) and avoid swallowing the IOException in the writer if this happens. But I think there is a race here where the writer sees the error first before the reader knows what is going on. Anyways maybe if you have a simpler solution you could sketch it out in the JIRA and we could talk over there. The current proposal in the JIRA is somewhat complicated... - Patrick On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl...@gmail.com wrote: FYI, it looks like this stdin writer to Python finished early error was caused by a break in the connection to S3, from which the data was being pulled. A recent commit to PythonRDDhttps://github.com/apache/spark/commit/a967b005c8937a3053e215c952d2172ee3dc300d#commitcomment-6114780noted that the current exception catching can potentially mask an exception for the data source, and that is indeed what I see happening. The underlying libraries (jets3t and httpclient) do have retry capabilities, but I don't see a great way of setting them through Spark code. Instead I added the patch below which kills the worker on the exception. This allows me to completely load the data source after a few worker retries. Unfortunately, java.net.SocketException is the same error that is sometimes expected from the client when using methods like take(). One approach around this conflation is to create a new locally scoped exception class, eg. WriterException, catch java.net.SocketException during output writing, then re-throw the new exception. The worker thread could then distinguish between the reasons java.net.SocketException might be thrown. Perhaps there is a more elegant way to do this in Scala, though? Let me know if I should open a ticket or discuss this on the developers list instead. Best, Jim diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala index 0d71fdb..f31158c 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala @@ -95,6 +95,12 @@ private[spark] class PythonRDD[T: ClassTag]( readerException = e Try(worker.shutdownOutput()) // kill Python worker process + case e: java.net.SocketException = + // This can happen if a connection to the datasource, eg S3, resets + // or is otherwise broken +readerException = e +Try(worker.shutdownOutput()) // kill Python worker process + case e: IOException = // This can happen for legitimate reasons if the Python code stops returning data // before we are done passing elements through, e.g., for take(). Just log a message to On Wed, Apr 9, 2014 at 7:04 PM, Jim Blomo jim.bl...@gmail.com wrote: This dataset is uncompressed text at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Okay, thanks. Do you have any info on how large your records and data file are? I'd like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote: Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18: - When persist() used and batchSize=1, java.lang.OutOfMemoryError: Java heap space. To me this indicates a memory leak since Spark should simply be counting records of size 3MB - Without persist(), stdin writer to Python finished early hangs the application, unknown root cause I've recently rebuilt another SNAPSHOT, git commit 16b8308 with debugging turned on. This gives me the stacktrace on the new stdin problem: 14/04/09 22:22:45 DEBUG PythonRDD: stdin writer to Python finished early java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122)
Cache issue for iteration with broadcast
A new broadcast object will generated for every iteration step, it may eat up the memory and make persist fail. The broadcast object should not be removed because RDD may be recomputed. And I am trying to prevent recomputing RDD, it need old broadcast release some memory. I've tried to set spark.cleaner.ttl, but my task runs into Error(broadcast object not found), I think task is recomputed. I don't think this is a good idea, it makes my code depends on my environment much more. So I changed the persistLevel of my RDD to MEMORY_AND_DISK, but it runs into ERROR(broadcast object not found) too. And I remove the setting of spark.cleaner.ttl, finally. I think support for cache should be more friendly, and broadcast object should be cached, too. flush the old object to disk is much more essential than the new one. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Cache issue for iteration with broadcast
Code Here https://github.com/Earthson/sparklda/blob/dev/src/main/scala/net/earthson/nlp/lda/lda.scala#L121 Finally, iteration still runs into recomputing... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5351.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is any idea on architecture based on Spark + Spray + Akka
hello,ZhangYi I find ooyala's opensourced spark-jobserver, https://github.com/ooyala/spark-jobserver seems that they are also using akka and spray and spark, maybe helpful for you. On Mon, May 5, 2014 at 11:37 AM, ZhangYi yizh...@thoughtworks.com wrote: Hi all, Currently, our project is planning to adopt spark to be big data platform. For the client side, we decide expose REST api based on Spray. Our domain is focus on communication field for 3G and 4G user of processing some data analyst and statictics . Now, Spark + Spray is brand new for us, and we can't find any best practice via google. In our opinion, event-driven architecture is good choice for our project maybe. However, more idea is welcome. Thanks. -- ZhangYi (张逸) Developer tel: 15023157626 blog: agiledon.github.com weibo: tw张逸 Sent with Sparrow http://www.sparrowmailapp.com/?sig
Re: Cache issue for iteration with broadcast
I tried using serialization instead of broadcast, and my program exit with Error(beyond physical memory limits). The large object can not be released by GC? because it is needed for recomputing? So what is the recomended way to solve this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5354.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: NoSuchMethodError: breeze.linalg.DenseMatrix
Hi, DB, i think it's something related to sbt publishLocal if i remove the breeze dependency in my sbt file, breeze can not be found [error] /home/wxhsdp/spark/example/test/src/main/scala/test.scala:5: not found: object breeze [error] import breeze.linalg._ [error]^ here's my sbt file: name := Build Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.0.0-SNAPSHOT resolvers += Akka Repository at http://repo.akka.io/releases/; i run sbt publishLocal on the Spark tree. but if i manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib directory, sbt package is ok, i can run my app in workers without addJar what's the difference between add dependency in sbt after sbt publishLocal and manully put spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar in /lib directory? why can i run my app in worker without addJar this time? DB Tsai-2 wrote If you add the breeze dependency in your build.sbt project, it will not be available to all the workers. There are couple options, 1) use sbt assembly to package breeze into your application jar. 2) manually copy breeze jar into all the nodes, and have them in the classpath. 3) spark 1.0 has breeze jar in the spark flat assembly jar, so you don't need to add breeze dependency yourself. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 4, 2014 at 4:07 AM, wxhsdp lt; wxhsdp@ gt; wrote: Hi, i'am trying to use breeze linalg library for matrix operation in my spark code. i already add dependency on breeze in my build.sbt, and package my code sucessfully. when i run on local mode, sbt run local..., everything is ok but when turn to standalone mode, sbt run spark://127.0.0.1:7077..., error occurs 14/05/04 18:56:29 WARN scheduler.TaskSetManager: Loss was due to java.lang.NoSuchMethodError java.lang.NoSuchMethodError: breeze.linalg.DenseMatrix$.implOpMulMatrix_DMD_DMD_eq_DMD()Lbreeze/linalg/operators/DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$; in my opinion, everything needed is packaged to the jar file, isn't it? and does anyone used breeze before? is it good for matrix operation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-DenseMatrix-tp5310p5355.html Sent from the Apache Spark User List mailing list archive at Nabble.com.