Re: Setting queue for spark job on yarn
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi Sandy, Is there a programmatic way? We're building a platform as a service and need to assign it to different queues that can provide different scheduler approaches. Thanks, Ron Sent from my iPhone On May 20, 2014, at 1:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Ron, What version are you using? For 0.9, you need to set it outside your code with the SPARK_YARN_QUEUE environment variable. -Sandy On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, How does one submit a spark job to yarn and specify a queue? The code that successfully submits to yarn is: val conf = new SparkConf() val sc = new SparkContext(yarn-client, Simple App, conf) Where do I need to specify the queue? Thanks in advance for any help on this... Thanks, Ron
MLlib ALS-- Errors communicating with MapOutputTracker
Hello, I am currently using MLlib ALS to process a large volume of data, about 1.2 billion Rating(userId, productId, rates) triples. The dataset was sepatated into 4000 partitions for parallized computation on our yarn clusters. I encountered this error Errors communicating with MapOutputTracker, when trying to get the prediciton rates [model.predict(userproducts)] after iterations. val predictions = model.predict(usersProducts).map{ case Rating(user, product, rate) = ((user, product), rate) } I tried to separate the iteration process and the process of culating prediction rates value by storing the two feature matirces into file system first; and the loading them for prediction. This time, the error occurred at the stage of loading userFeatures. userfData: userId:[0.3,0.5,0.002,.] val userfTuple =userfData.map{ case (line) = { val arr = line.split(splitmark_1) val featureArr = arr(1).split(splitmark_2) (arr(0),featureArr) } } Here is part of the log: 14-05-21 14:37:17 WARN [Result resolver thread-0] TaskSetManager: Loss was due to org.apache.spark.SparkException org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:79) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:126) at org.apache.spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:43) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:244) at org.apache.spark.rdd.RDD.iterator(RDD.scala:235) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:90) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:89) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:727) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:723) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:220) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) --- I have tried several methods to solve this problem, one way was to decrease the number of partitions(from 4000 to 3000), another was to increase the memory of masters. Both worked, but it is still vital to track the underneath causes there, right? Could anyone help me to check this problem? Thanks a lot. Sue Cai -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-ALS-Errors-communicating-with-MapOutputTracker-tp6160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?
Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the corresponding receiver and making your own version of Alternatively, there is a open JIRAhttps://issues.apache.org/jira/browse/SPARK-1341about this implementing this functionality. You could give it a shot at implementing this in a generic that it can be used for all receivers ;) On Tue, May 20, 2014 at 8:31 PM, Francis.Hu francis...@reachjunction.comwrote: sparkers, Is there a better way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ? Thanks, Francis.Hu
Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?
Apologies for the premature send. Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the corresponding receiver and making your own version of Flume receiver / Kafka receiver. Alternatively, there is a open JIRAhttps://issues.apache.org/jira/browse/SPARK-1341 about this implementing this functionality. You could give it a shot at implementing this in a generic that it can be used for all receivers ;) TD On Wed, May 21, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the corresponding receiver and making your own version of Alternatively, there is a open JIRAhttps://issues.apache.org/jira/browse/SPARK-1341about this implementing this functionality. You could give it a shot at implementing this in a generic that it can be used for all receivers ;) On Tue, May 20, 2014 at 8:31 PM, Francis.Hu francis...@reachjunction.comwrote: sparkers, Is there a better way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ? Thanks, Francis.Hu
Re: advice on maintaining a production spark cluster?
if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote: Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There was an executor that launched that did fail, such as: 14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2 on worker worker-20140519155427-hdn3.int.meetup.com-50 038 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 because it is FAILED ... but other executors on other machines also failed without permanently disassociating. There are these messages which I don't know if they are related: 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaste r/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with confi guration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka ://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead letters encountered. This logging can be turned off or adjust ed with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.comwrote: Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point in time). Do you happen to have the logs from the Master that indicate that the Worker terminated? Is it just an Akka disassociation, or some exception? On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote: This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote: So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hdn3.int.meetup.com/10.3.6.23:46288 ] On the master side, there are lots and lots of messages of the form: 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 --j -- - Best Regards
ClassNotFoundException with Spark/Mesos (spark-shell works fine)
Hi, I have set up a cluster with Mesos (backed by Zookeeper) with three master and three slave instances. I set up Spark (git HEAD) for use with Mesos according to this manual: http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html Using the spark-shell, I can connect to this cluster and do simple RDD operations, but the same code in a Scala class and executed via sbt run-main works only partially. (That is, count() works, count() after flatMap() does not.) Here is my code: https://gist.github.com/tgpfeiffer/7d20a4d59ee6e0088f91 The file SparkExamplesScript.scala, when pasted into spark-shell, outputs the correct count() for the parallelized list comprehension, as well as for the flatMapped RDD. The file SparkExamplesMinimal.scala contains exactly the same code, and also the MASTER configuration and the Spark Executor are the same. However, while the count() for the parallelized list is displayed correctly, I receive the following error when asking for the count() of the flatMapped RDD: - 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting Stage 1 (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34), which has no missing parents 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 1 (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34) 14/05/21 09:47:49 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 8 tasks 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 8 on executor 20140520-102159-2154735808-5050-1108-1: mesos9-1 (PROCESS_LOCAL) 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 1779147 bytes in 37 ms 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Lost TID 8 (task 1.0:0) 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) - Can anyone explain to me where this comes from or how I might further track the problem down? Thanks, Tobias
Re: Ignoring S3 0 files exception
Noone has any idea ?It's really troublesome, it seems like i have no way to catch errors while an action is beeing processed and just ignore it.Here's a bit more details on what i'm doing: JavaRDD a = sc.textFile(s3n://+missingFilenamePattern) JavaRDD b = sc.textFile(s3n://+existingFilenamePattern) JavaRDD aPlusB = a.union(b);aPlusB.reduceByLey(MyReducer); // -- This throws the error I'd like to ignore the exception caused by a to process b without troubles.Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ignoring-S3-0-files-exception-tp6101p6163.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RDD union of a window in Dstream
Hi, I want to do union of all RDDs in each window of DStream. I found Dstream.union and haven't seen anything like DStream.windowRDDUnion. Is there any way around it? I want to find mean and SD of all values which comes under each sliding window for which I need to union all the RDDs in each window. This is not a running mean and SD. Regards, Laeeq
Log analysis
I am new to spark and we are developing a data science pipeline based on spark on ec2. So far we have been using python on spark standalone cluster. However, being a newbie I would like to know more about how can I do debugging (program level) from spark logs (is it stderr ?). I find it a bit difficult to debug since, spark itself has many messages there. Any ideas or suggestion regarding configuration change to facilitate this would be highly appreciated !! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Log-analysis-tp6168.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: count()-ing gz files gives java.io.IOException: incorrect header check
Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
pyspark.rdd.ResultIterable?
Hi, I'm noticing a difference between two installations of Spark. I'm pretty sure both are version 0.9.1. One is able to import pyspark.rdd.ResultIterable and the other isn't. Is this an environment problem or do we actually have two different versions of Spark? To be clear, on one box, one can do: In [1]: import pyspark In [2]: pyspark.rdd.ResultIterable Out[2]: pyspark.resultiterable.ResultIterable while on the other pyspark.rdd.ResultIterable is not found. Any ideas? -T.J.
Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)
Hi Tobias, Regarding my comment on closure serialization: I was discussing it with my fellow Sparkers here and I totally overlooked the fact that you need the class files to de-serialize the closures (or whatever) on the workers, so you always need the jar file delivered to the workers in order for it to work. The SparkREPL works differently. It uses some dark magic to send the working session to the workers. -kr, Gerard. On Wed, May 21, 2014 at 2:47 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi Tobias, I was curious about this issue and tried to run your example on my local Mesos. I was able to reproduce your issue using your current config: [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times (most recent failure: Exception failure: java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2) org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times (most recent failure: Exception failure: java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) Creating a simple jar from the job and providing it through the configuration seems to solve it: val conf = new SparkConf() .setMaster(mesos://my_ip:5050/) * .setJars(Seq(/sparkexample/target/scala-2.10/sparkexample_2.10-0.1.jar))* .setAppName(SparkExamplesMinimal) Resulting in: 14/05/21 12:03:45 INFO scheduler.DAGScheduler: Completed ResultTask(1, 1) 14/05/21 12:03:45 INFO scheduler.DAGScheduler: Stage 1 (count at SparkExamplesMinimal.scala:50) finished in 1.120 s 14/05/21 12:03:45 INFO spark.SparkContext: Job finished: count at SparkExamplesMinimal.scala:50, took 1.177091435 s count: 100 Why the closure serialization does not work with Mesos is beyond my current knowledge. Would be great to hear from the experts (cross-posting to dev for that) -kr, Gerard. On Wed, May 21, 2014 at 11:51 AM, Tobias Pfeiffer t...@preferred.jpwrote: Hi, I have set up a cluster with Mesos (backed by Zookeeper) with three master and three slave instances. I set up Spark (git HEAD) for use with Mesos according to this manual: http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html Using the spark-shell, I can connect to this cluster and do simple RDD operations, but the same code in a Scala class and executed via sbt run-main works only partially. (That is, count() works, count() after flatMap() does not.) Here is my code: https://gist.github.com/tgpfeiffer/7d20a4d59ee6e0088f91 The file SparkExamplesScript.scala, when pasted into spark-shell, outputs the correct count() for the parallelized list comprehension, as well as for the flatMapped RDD. The file SparkExamplesMinimal.scala contains exactly the same code, and also the MASTER configuration and the Spark Executor are the same. However, while the count() for the parallelized list is displayed correctly, I receive the following error when asking for the count() of the flatMapped RDD: - 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting Stage 1 (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34), which has no missing parents 14/05/21 09:47:49 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 1 (FlatMappedRDD[1] at flatMap at SparkExamplesMinimal.scala:34) 14/05/21 09:47:49 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 8 tasks 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Starting task 1.0:0 as TID 8 on executor 20140520-102159-2154735808-5050-1108-1: mesos9-1 (PROCESS_LOCAL) 14/05/21 09:47:49 INFO scheduler.TaskSetManager: Serialized task 1.0:0 as 1779147 bytes in 37 ms 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Lost TID 8 (task 1.0:0) 14/05/21 09:47:49 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: spark.SparkExamplesMinimal$$anonfun$2 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
Re: Python, Spark and HBase
Thanks Nick and Matei. I'll take a look at the patch and keep you updated. Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: advice on maintaining a production spark cluster?
After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to fix in the next couple of maintenance releases. On Wed, May 21, 2014 at 8:58 AM, Han JU ju.han.fe...@gmail.com wrote: I've seen also worker loss and that's way I asked a question about worker re-spawn. My typical case is there's some job got OOM exception. Then on the master UI some worker's state becomes DEAD. In the master's log, there's error like: ``` 14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkmas...@ec2-23-20-189-111.compute-1.amazonaws.com:7077] - [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]: Error [Association failed with [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572 ] 14/05/21 15:38:02 INFO master.Master: akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572 got disassociated, removing it. ``` On the `DEAD` worker machine, there's 2 spark processes, worker and executor backend: 16280 org.apache.spark.deploy.worker.Worker 25989 org.apache.spark.executor.CoarseGrainedExecutorBackend The bad thing is that in this case, a sbin/stop-all.sh and sbin/start-all.sh cannot bring back the DEAD worker since the worker process cannot be terminated (maybe due to the executor backend). I have to log in, kill -9 both worker process and the executor backend. I'm on 0.9.1 and using ec2-script. 2014-05-21 11:42 GMT+02:00 sagi zhpeng...@gmail.com: if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote: Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There was an executor that launched that did fail, such as: 14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2 on worker worker-20140519155427-hdn3.int.meetup.com-50 038 14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2 because it is FAILED ... but other executors on other machines also failed without permanently disassociating. There are these messages which I don't know if they are related: 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaste r/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3. 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with confi guration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/20 01:17:38 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka ://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead letters encountered. This logging can be turned off or adjust ed with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.comwrote: Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point in time). Do you happen to have the logs from the Master that indicate that the Worker terminated? Is it just an Akka disassociation, or some exception? On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote: This isn't helpful of me to say, but, I see the same sorts of problem and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote: So, for
Re: count()-ing gz files gives java.io.IOException: incorrect header check
Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 21 May 2014 14:26, Madhu ma...@madhu.com wrote: Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: File list read into single RDD
Thanks this really helps. As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by Spark so I’ll have to go back and look at how to do this in the HDFS api. Specifically I’ll need to examine the directory/file structure on one cluster then check some things on what is potentially another cluster before output. I have usually assumed only one HDFS instance so it may just be a matter of me being more careful and preserving full URIs. In the past I may have made assumptions that output is to the same dir tree as the input. Maybe it’s a matter of being more scrupulous about that assumption. It’s a bit hard to test this case since I have never really had access to two clusters so I’ll have to develop some new habits at least. On May 18, 2014, at 11:13 AM, Andrew Ash and...@andrewash.com wrote: Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's FileInputFormat.setInputPaths() call. There is no alternate storage system, Spark just delegates to Hadoop for the .textFile() call. Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so you can use Spark on data in S3 using s3:/// just the same as you would with HDFS. See Apache's documentation on S3 for more details. As far as interacting with a FileSystem (HDFS or other) to list files, delete files, navigate paths, etc. from your driver program, you should be able to just instantiate a FileSystem object and use the normal Hadoop APIs from there. The Apache getting started docs on reading/writing from Hadoop DFS should work the same for non-HDFS examples too. I do think we could use a little recipe in our documentation to make interacting with HDFS a bit more straightforward. Pat, if you get something that covers your case that you don't mind sharing, we can format it for including in future Spark docs. Cheers! Andrew On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote: Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? How does Spark implement its storage system? This seems to be an abstraction level beyond what is available in HDFS. In order to preserve that flexibility what APIs should I be using? It would be easy to say, HDFS only and use HDFS APIs but that would seem to limit things. Especially where you would like to read from one cluster and write to another. This is not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge. If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine the structure of the file system, I’m unclear how I should do it without sacrificing Spark’s flexibility. On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com wrote: Hi, You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 e.g.: sc.textFile('{/path/to/file1,/path/to/file2}') Christophe. On 29/04/2014 05:07, Nicholas Chammas wrote: Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not obvious. Nick 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지: Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile(). Nick On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list? Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention
Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?
Does anyone know if: ./bin/spark-shell --master yarn is running yarn-cluster or yarn-client by default? Base on source code: ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala if (args.deployMode == cluster args.master.startsWith(yarn)) { args.master = yarn-cluster}if (args.deployMode != cluster args.master.startsWith(yarn)) { args.master = yarn-client } It looks like the answer is yarn-cluster mode. I want to confirm this with the community, thanks.
Job Processing Large Data Set Got Stuck
I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was stuck. Any suggestions? Thanks. Spark_OK_48_pagerank.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Job Processing Large Data Set Got Stuck
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On Wed, May 21, 2014 at 11:23 AM, yxzhao yxz...@ualr.edu wrote: I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was stuck. Any suggestions? Thanks. Spark_OK_48_pagerank.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)
Hi Tobias, On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote: first, thanks for your explanations regarding the jar files! No prob :-) On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com wrote: I was discussing it with my fellow Sparkers here and I totally overlooked the fact that you need the class files to de-serialize the closures (or whatever) on the workers, so you always need the jar file delivered to the workers in order for it to work. So the closure as a function is serialized, sent across the wire, deserialized there, and *still* you need the class files? (I am not sure I understand what is actually sent over the network then. Does that serialization only contain the values that I close over?) I also had that mental lapse. Serialization refers to converting object (not class) state (current values) into a byte stream and de-serialization restores the bytes from the wire into an seemingly identical object at the receiving side (except for transient variables), for that, it requires the class definition of that object to know what it needs to instantiate, so yes, the compiled classes need to be given to the Spark driver and it will take care of dispatching them to the workers (much better than in the old RMI days ;-) If I understand correctly what you are saying, then the documentation at https://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html (list item 8) needs to be extended quite a bit, right? The mesos docs have been recently updated here: https://github.com/apache/spark/pull/756/files Don't know where the latest version from master is built/available. -kr, Gerard.
Re: Job Processing Large Data Set Got Stuck
Thanks Xiangrui, How to check and make sure the data is distributed evenly? Thanks again. On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n6187...@n3.nabble.com wrote: Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On Wed, May 21, 2014 at 11:23 AM, yxzhao [hidden email] wrote: I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was stuck. Any suggestions? Thanks. Spark_OK_48_pagerank.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6187.html To unsubscribe from Job Processing Large Data Set Got Stuck, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6189.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Job Processing Large Data Set Got Stuck
If the RDD is cached, you can check its storage information in the Storage tab of the Web UI. On Wed, May 21, 2014 at 12:31 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, How to check and make sure the data is distributed evenly? Thanks again. On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User List] [hidden email] wrote: Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On Wed, May 21, 2014 at 11:23 AM, yxzhao [hidden email] wrote: I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was stuck. Any suggestions? Thanks. Spark_OK_48_pagerank.log http://apache-spark-user-list.1001560.n3.nabble.com/file/n6185/Spark_OK_48_pagerank.log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Job-Processing-Large-Data-Set-Got-Stuck-tp6185p6187.html To unsubscribe from Job Processing Large Data Set Got Stuck, click here. NAML View this message in context: Re: Job Processing Large Data Set Got Stuck Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)
Here's the 1.0.0rc9 version of the docs: https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html I refreshed them with the goal of steering users more towards prebuilt packages than relying on compiling from source plus improving overall formatting and clarity, but not otherwise modifying the content. I don't expect any changes for rc10. It does seem like an issue though that classpath issues are preventing that from running. Just to check, have you given the exact some jar a shot when running against a standalone cluster? If it works in standalone, I think that's good evidence that there's an issue with the Mesos classloaders in master. I'm running into a similar issue with classpaths failing on Mesos but working in standalone, but I haven't coherently written up my observations yet so haven't gotten that to this list. I'd almost gotten to the point where I thought that my custom code needed to be included in the SPARK_EXECUTOR_URI but that can't possibly be correct. The Spark workers that are launched on Mesos slaves should start with the Spark core jars and then transparently get classes from custom code over the network, or at least that's who I thought it should work. For those who have been using Mesos in previous releases, you've never had to do that before have you? On Wed, May 21, 2014 at 3:30 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi Tobias, On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote: first, thanks for your explanations regarding the jar files! No prob :-) On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com wrote: I was discussing it with my fellow Sparkers here and I totally overlooked the fact that you need the class files to de-serialize the closures (or whatever) on the workers, so you always need the jar file delivered to the workers in order for it to work. So the closure as a function is serialized, sent across the wire, deserialized there, and *still* you need the class files? (I am not sure I understand what is actually sent over the network then. Does that serialization only contain the values that I close over?) I also had that mental lapse. Serialization refers to converting object (not class) state (current values) into a byte stream and de-serialization restores the bytes from the wire into an seemingly identical object at the receiving side (except for transient variables), for that, it requires the class definition of that object to know what it needs to instantiate, so yes, the compiled classes need to be given to the Spark driver and it will take care of dispatching them to the workers (much better than in the old RMI days ;-) If I understand correctly what you are saying, then the documentation at https://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html (list item 8) needs to be extended quite a bit, right? The mesos docs have been recently updated here: https://github.com/apache/spark/pull/756/files Don't know where the latest version from master is built/available. -kr, Gerard.
Re: unsubscribe
Does any one know how to configure the digest mailing list? For example, I want to receive daily digest, not every 10 messages. Thanks! On Mon, May 19, 2014 at 4:29 PM, Shangyu Luo lsy...@gmail.com wrote: Hi Andrew and Madhu, Thank you for your help here! Will unsubscribe through another address and may subscribe digest instead! Best, Shangyu On Sun, May 18, 2014 at 3:49 PM, Andrew Ash and...@andrewash.com wrote: Hi Shangyu (and everyone else looking to unsubscribe!), If you'd like to get off this mailing list, please send an email to user *-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list. How to use the Apache mailing list infrastructure is documented here: https://www.apache.org/foundation/mailinglists.html And the Spark User list specifically can be found here: http://mail-archives.apache.org/mod_mbox/spark-user/ Thanks! Andrew On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo lsy...@gmail.com wrote: Thanks! -- -- Shangyu, Luo Department of Computer Science Rice University -- Not Just Think About It, But Do It! -- Success is never final. -- Losers always whine about their best -- -- Shangyu, Luo Department of Computer Science Rice University -- Not Just Think About It, But Do It! -- Success is never final. -- Losers always whine about their best
Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)
Hi Andrew, Thanks for the current doc. I'd almost gotten to the point where I thought that my custom code needed to be included in the SPARK_EXECUTOR_URI but that can't possibly be correct. The Spark workers that are launched on Mesos slaves should start with the Spark core jars and then transparently get classes from custom code over the network, or at least that's who I thought it should work. For those who have been using Mesos in previous releases, you've never had to do that before have you? Regarding the delivery of the custom job code to Mesos, we have been using 'ADD_JARS' (in the command line) or 'SparkConfig.setJars(Seq[String]) with a fat jar packing all dependencies. That works as well on the Spark 'standalone' cluster, but we deploy mostly on Mesos, so I couldn't say about classloading difference between the two. -greetz, Gerard.
Re: Using Spark to analyze complex JSON
Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice]} It would be SUPER COOL if we could query that table in a way that is as natural as follows: SELECT DISTINCT nameFROM json_table; SELECT MAX(location.x)FROM json_table; SELECT likes[2] -- Ice Ice BabyFROM json_tableWHERE name = Nick; Of course, this is just a hand-wavy suggestion of how I’d like to be able to query JSON (particularly that last example) using SQL. I’m interested in seeing what y’all come up with. A large part of what my team does is make it easy for analysts to explore and query JSON data using SQL. We have a fairly complex home-grown process to do that and are looking to replace it with something more out of the box. So if you’d like more input on how users might use this feature, I’d be glad to chime in. Nick On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust mich...@databricks.comwrote: You can already extract fields from json data using Hive UDFs. We have an intern working on on better native support this summer. We will be sure to post updates once there is a working prototype. Michael On Tue, May 20, 2014 at 6:46 PM, Nick Chammas nicholas.cham...@gmail.comwrote: The Apache Drill http://incubator.apache.org/drill/ home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick -- View this message in context: Using Spark to analyze complex JSONhttp://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-analyze-complex-JSON-tp6146.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
tests that run locally fail when run through bamboo
I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info] StreamingZigZagSpec: 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.StreamCorruptedException: invalid type code: AC) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ... 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with intermittent empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] Operation timed out after 10042 ms (TestSuiteBase.scala:283) 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or directory)) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ... 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream w notification for each change *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 141.0:0 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: http://10.10.1.9:62793/broadcast_1) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at
Inconsistent RDD Sample size
I have a graph and am trying to take a random sample of vertices without replacement, using the RDD.sample() method verts are the vertices in the graph val verts = graph.vertices and executing this multiple times in a row verts.sample(false, 1.toDouble/v1.count.toDouble, System.currentTimeMillis).count yields different results roughly each time (albeit +/- a small % of the target) why does this happen? Looked at PartionwiseSampledRDD but can't figure it out Also, is there another method/technique to yield the same result each time? My understanding is that grabbing random indices may not be the best use of the RDD model -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: tests that run locally fail when run through bamboo
Just found this at the top of the log: 17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN o.e.j.u.component.AbstractLifeCycle - FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use build 21-May-2014 17:14:41 java.net.BindException: Address already in use Is there a way to set these connection up so that they don't all start on the same port (that's my guess for the root cause of the issue) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: May-21-14 4:58 PM To: u...@spark.incubator.apache.org; user@spark.apache.org Subject: tests that run locally fail when run through bamboo I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info] StreamingZigZagSpec: 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.StreamCorruptedException: invalid type code: AC) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ... 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with intermittent empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] Operation timed out after 10042 ms (TestSuiteBase.scala:283) 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or directory)) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ... 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream w notification for each change *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 141.0:0 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: http://10.10.1.9:62793/broadcast_1) 21-May-2014 16:33:09 [info] at
Re: Inconsistent RDD Sample size
It doesn't guarantee the exact sample size. If you fix the random seed, it would return the same result every time. -Xiangrui On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote: I have a graph and am trying to take a random sample of vertices without replacement, using the RDD.sample() method verts are the vertices in the graph val verts = graph.vertices and executing this multiple times in a row verts.sample(false, 1.toDouble/v1.count.toDouble, System.currentTimeMillis).count yields different results roughly each time (albeit +/- a small % of the target) why does this happen? Looked at PartionwiseSampledRDD but can't figure it out Also, is there another method/technique to yield the same result each time? My understanding is that grabbing random indices may not be the best use of the RDD model -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Problem with loading files: Loss was due to java.io.EOFException java.io.EOFException
The problem is solved after hadoop-core dependency added. But I think there is a misunderstanding about local files. I found this one: Note that if you've connected to a Spark master, it's possible that it will attempt to load the file on one of the different machines in the cluster, so make sure it's available on all the cluster machines. In general, in future you will want to put your data in HDFS, S3, or similar file systems to avoid this problem. http://docs.sigmoidanalytics.com/index.php/Using_the_Spark_Shell This means that you can't use local files with spark. I don't understand why, because after calling addFile() or textFile(), the file can be downloaded by every node on the cluster and became accessible. Anyway, if you got Loss was due to java.io.EOFException, you have to make sure that hadoop libs are available. dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version0.9.1/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-core/artifactId version2.0.0-mr1-cdh4.6.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-common/artifactId version2.0.0-cdh4.6.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.0.0-cdh4.6.0/version /dependency Cheers! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-loading-files-Loss-was-due-to-java-io-EOFException-java-io-EOFException-tp6090p6201.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode?
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. Date: Wed, 21 May 2014 11:07:55 -0700 Subject: Re: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster mode or yarn-client mode? From: and...@databricks.com To: user@spark.apache.org The answer is actually yarn-client. A quick way to find out: $ bin/spark-shell --master yarn --verbose From the system properties you can see spark.master is set to yarn-client. From the code, this is because args.deployMode is null, and so it's not equal to cluster and so it falls into the second if case you mentioned: if (args.deployMode != cluster args.master.startsWith(yarn)) { args.master = yarn-client} 2014-05-21 10:57 GMT-07:00 Andrew Lee alee...@hotmail.com: Does anyone know if: ./bin/spark-shell --master yarn is running yarn-cluster or yarn-client by default? Base on source code: ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala if (args.deployMode == cluster args.master.startsWith(yarn)) { args.master = yarn-cluster } if (args.deployMode != cluster args.master.startsWith(yarn)) { args.master = yarn-client } It looks like the answer is yarn-cluster mode. I want to confirm this with the community, thanks.
Re: tests that run locally fail when run through bamboo
This do happens sometimes, but it is a warning because Spark is designed try successive ports until it succeeds. So unless a cray number of successive ports are blocked (runaway processes?? insufficient clearing of ports by OS??), these errors should not be a problem for tests passing. On Wed, May 21, 2014 at 2:31 PM, Adrian Mocanu amoc...@verticalscope.comwrote: Just found this at the top of the log: 17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN o.e.j.u.component.AbstractLifeCycle - FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use build 21-May-2014 17:14:41 java.net.BindException: Address already in use Is there a way to set these connection up so that they don’t all start on the same port (that’s my guess for the root cause of the issue) *From:* Adrian Mocanu [mailto:amoc...@verticalscope.com] *Sent:* May-21-14 4:58 PM *To:* u...@spark.incubator.apache.org; user@spark.apache.org *Subject:* tests that run locally fail when run through bamboo I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info] StreamingZigZagSpec: 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.StreamCorruptedException: invalid type code: AC) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ... 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with intermittent empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] Operation timed out after 10042 ms (TestSuiteBase.scala:283) 21-May-2014 16:33:09 [info] - compute zigzag indicator in stream with 3 empty RDDs *** FAILED *** 21-May-2014 16:33:09 [info] org.apache.spark.SparkException: Job aborted: Task 1.0:1 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140521163241-1707/0f/shuffle_1_1_1 (No such file or directory)) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 21-May-2014 16:33:09 [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at scala.Option.foreach(Option.scala:236) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) 21-May-2014 16:33:09 [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) 21-May-2014 16:33:09 [info] ...
Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed. There were 2 anomalies... 1. ASM reported only that the application was "ACCEPTED". It never indicated that the application was "RUNNING." 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM: application identifier: application_1400696988985_0007 appId: 7 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: default appMasterRpcPort: -1 appStartTime: 1400709970857 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/ appUser: hduser Furthermore, it started a second container, running two partly overlapping drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure. 2. Each instance failed as indicated by the stderr log, finding that the filesystem was closed when trying to clean up the staging directories. 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver 14/05/21 16:08:24 INFO Executor: Finished task ID 1453 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2) 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1) 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s 14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s 14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250 14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler 14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted! 14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped 14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared 14/05/21 16:08:25 INFO BlockManager: BlockManager stopped 14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped 14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED 14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal. 14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1400696988985_0007 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/05/21 16:08:25 ERROR ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_1400696988985_0007 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371) at org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) There is nothing about the staging directory themselves that looks suspicious... drwx-- - hduser supergroup 0 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007 -rw-r--r-- 3 hduser supergroup 92881278 2014-05-21 16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/app.jar -rw-r--r-- 3 hduser supergroup 118900783 2014-05-21 16:06
Re: I want to filter a stream by a subclass.
You could do records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote: Hi. Firstly I'm a newb (to both Scala Spark). I have a stream, that contains multiple types of records, and I would like to create multiple streams based on that currently I have it set up as class ALL class Orange extends ALL class Apple extends ALL now I can easily add a filter ala val records:DStream[ALL] = ...mapper to build the classes off the wire... val orangeRecords = records.filter {_.isInstanceOf[Orange]} but I would like to have the line be a DStream[Orange] instead of a DStream[ALL] (So I can access the unique fields in the subclass). I'm using 0.9.1 if it matters. TIA Ian -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083
Run Apache Spark on Mini Cluster
Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender
Re: Run Apache Spark on Mini Cluster
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar upent...@gmail.comwrote: Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender
Re: A new resource for getting examples of Spark RDD API calls
Great, thanks for that tip. I will update the documents! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p6210.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: I want to filter a stream by a subclass.
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das tathagata.das1...@gmail.com wrote: records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a Scala-ish way would be records.flatMap(_ match { case i: Int= Some(i) case _ = None })
Re: I want to filter a stream by a subclass.
Thanks Tobias Tathagata. these are great. On Wed, May 21, 2014 at 8:02 PM, Tobias Pfeiffer t...@preferred.jp wrote: On Thu, May 22, 2014 at 8:07 AM, Tathagata Das tathagata.das1...@gmail.com wrote: records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a Scala-ish way would be records.flatMap(_ match { case i: Int= Some(i) case _ = None }) -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083
yarn-client mode question
As the yarn-client mode,will spark be deployed in the node of yarn? If it is deployed only in the client,can spark submit the job to yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: yarn-client mode question
Hi Sophia, In yarn-client mode, the node that submits the application can either be inside or outside of the cluster. This node also hosts the driver (SparkContext) of the application. All the executors, however, will be launched on nodes inside the YARN cluster. Andrew 2014-05-21 18:17 GMT-07:00 Sophia sln-1...@163.com: As the yarn-client mode,will spark be deployed in the node of yarn? If it is deployed only in the client,can spark submit the job to yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: yarn-client mode question
But,I don't understand this point,is it necessary to deploy slave node of spark in the yarn node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: RDD union of a window in Dstream
Hi, On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: I want to do union of all RDDs in each window of DStream. A window *is* a union of all RDDs in the respective time interval. The documentation says a DStream is represented as a sequence of RDDs. However, data from a certain time interval will always be contained in *one* RDD, not a sequence of RDDs (AFAIK). Regards Tobias
Re: Using Spark to analyze complex JSON
Hi, as far as I understand, if you create an RDD with a relational structure from your JSON, you should be able to do much of that already today. For example, take lift-json's deserializer and do something like val json_table: RDD[MyCaseClass] = json_data.flatMap(json = json.extractOpt[MyCaseClass]) then I guess you can use Spark SQL on that. (Something like your likes[2] query won't work, though, I guess.) Regards Tobias On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice] } It would be SUPER COOL if we could query that table in a way that is as natural as follows: SELECT DISTINCT name FROM json_table; SELECT MAX(location.x) FROM json_table; SELECT likes[2] -- Ice Ice Baby FROM json_table WHERE name = Nick; Of course, this is just a hand-wavy suggestion of how I’d like to be able to query JSON (particularly that last example) using SQL. I’m interested in seeing what y’all come up with. A large part of what my team does is make it easy for analysts to explore and query JSON data using SQL. We have a fairly complex home-grown process to do that and are looking to replace it with something more out of the box. So if you’d like more input on how users might use this feature, I’d be glad to chime in. Nick On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust mich...@databricks.com wrote: You can already extract fields from json data using Hive UDFs. We have an intern working on on better native support this summer. We will be sure to post updates once there is a working prototype. Michael On Tue, May 20, 2014 at 6:46 PM, Nick Chammas nicholas.cham...@gmail.com wrote: The Apache Drill home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick View this message in context: Using Spark to analyze complex JSON Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: yarn-client mode question
Seems you are asking that does spark related jar need to be deploy to yarn cluster manually before you launch application? Then, no , you don't, just like other yarn application. And it doesn't matter it is yarn-client or yarn-cluster mode.. Best Regards, Raymond Liu -Original Message- From: Sophia [mailto:sln-1...@163.com] Sent: Thursday, May 22, 2014 10:55 AM To: u...@spark.incubator.apache.org Subject: Re: yarn-client mode question But,I don't understand this point,is it necessary to deploy slave node of spark in the yarn node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
It sounds like something is closing the hdfs filesystem before everyone is really done with it. The filesystem gets cached and is shared so if someone closes it while other threads are still using it you run into this error. Is your application closing the filesystem? Are you using the event logging feature? Could you share the options you are running with? Yarn will retry the application depending on how the Application Master attempt fails (this is a configurable setting as to how many times it retries). That is probably the second driver you are referring to. But they shouldn't have overlapped as far as both being up at the same time. Is that the case you are seeing? Generally you want to look at why the first application attempt fails. Tom On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed. There were 2 anomalies... 1. ASM reported only that the application was ACCEPTED. It never indicated that the application was RUNNING. 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM: application identifier: application_1400696988985_0007 appId: 7 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: default appMasterRpcPort: -1 appStartTime: 1400709970857 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/ appUser: hduser Furthermore, it started a second container, running two partly overlapping drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure. 2. Each instance failed as indicated by the stderr log, finding that the filesystem was closed when trying to clean up the staging directories. 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver 14/05/21 16:08:24 INFO Executor: Finished task ID 1453 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2) 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1) 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s 14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s 14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250 14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler 14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted! 14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped 14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared 14/05/21 16:08:25 INFO BlockManager: BlockManager stopped 14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped 14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/05/21 16:08:25 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED 14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal. 14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1400696988985_0007 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/05/21 16:08:25 ERROR ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_1400696988985_0007 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:587) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:371) at org.apache.spark.deploy.yarn.ApplicationMaster$AppMasterShutdownHook.run(ApplicationMaster.scala:386) at
Re: ExternalAppendOnlyMap: Spilling in-memory map
Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark writes the data out to disk in a sorted order and reads it back from disk later on when it's needed. That's the job of the ExternalAppendOnlyMap. I wouldn't normally expect a conversion from Date to a Joda DateTime to take significantly more memory. But since you're using Kryo and classes should be registered with it, may may have forgotten to register DateTime with Kryo. If you don't register a class, it writes the class name at the beginning of every serialized instance, which for DateTime objects of size roughly 1 long, that's a ton of extra space and very inefficient. Can you confirm that DateTime is registered with Kryo? http://spark.apache.org/docs/latest/tuning.html#data-serialization On Wed, May 21, 2014 at 2:35 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I changed my application to use Joda time instead of java.util.Date and I started getting this: WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1 time so far) What does this mean? How can I fix this? Due to this a small job takes forever. Mohit. P.S.: I am using kyro serialization, have played around with several values of sparkRddMemFraction
Re: count()-ing gz files gives java.io.IOException: incorrect header check
One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO * Mobile: +44 789 990 7847 Email: mich...@tumra.com mich...@tumra.com Web: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email * *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 21 May 2014 14:26, Madhu ma...@madhu.com wrote: Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using Spark to analyze complex JSON
That's a good idea. So you're saying create a SchemaRDD by applying a function that deserializes the JSON and transforms it into a relational structure, right? The end goal for my team would be to expose some JDBC endpoint for analysts to query from, so once Shark is updated to use Spark SQL that would become possible without having to resort to using Hive at all. On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, as far as I understand, if you create an RDD with a relational structure from your JSON, you should be able to do much of that already today. For example, take lift-json's deserializer and do something like val json_table: RDD[MyCaseClass] = json_data.flatMap(json = json.extractOpt[MyCaseClass]) then I guess you can use Spark SQL on that. (Something like your likes[2] query won't work, though, I guess.) Regards Tobias On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Looking forward to that update! Given a table of JSON objects like this one: { name: Nick, location: { x: 241.6, y: -22.5 }, likes: [ice cream, dogs, Vanilla Ice] } It would be SUPER COOL if we could query that table in a way that is as natural as follows: SELECT DISTINCT name FROM json_table; SELECT MAX(location.x) FROM json_table; SELECT likes[2] -- Ice Ice Baby FROM json_table WHERE name = Nick; Of course, this is just a hand-wavy suggestion of how I’d like to be able to query JSON (particularly that last example) using SQL. I’m interested in seeing what y’all come up with. A large part of what my team does is make it easy for analysts to explore and query JSON data using SQL. We have a fairly complex home-grown process to do that and are looking to replace it with something more out of the box. So if you’d like more input on how users might use this feature, I’d be glad to chime in. Nick On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust mich...@databricks.com wrote: You can already extract fields from json data using Hive UDFs. We have an intern working on on better native support this summer. We will be sure to post updates once there is a working prototype. Michael On Tue, May 20, 2014 at 6:46 PM, Nick Chammas nicholas.cham...@gmail.com wrote: The Apache Drill home page has an interesting heading: Liberate Nested Data. Is there any current or planned functionality in Spark SQL or Shark to enable SQL-like querying of complex JSON? Nick View this message in context: Using Spark to analyze complex JSON Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: yarn-client mode question
Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: count()-ing gz files gives java.io.IOException: incorrect header check
Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281https://issues.apache.org/jira/browse/HADOOP-5281, affects Hadoop 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I know there is some funkiness in how Hadoop is versioned, so I’m not sure if this issue is relevant to 1.0.4. Were you able to resolve your issue by changing your version of Hadoop? How did you do that? Nick On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com wrote: One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.comwrote: Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO * Mobile: +44 789 990 7847 Email: mich...@tumra.com mich...@tumra.com Web: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email * *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 21 May 2014 14:26, Madhu ma...@madhu.com wrote: Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Run Apache Spark on Mini Cluster
It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64 to 128 GB RAM - 2 X 3TB disk (RAID 1+0) - Start with a 5 node setup and scale out as needed - If your load is Mapreduce over HDFS, then run YRAN - If your load is HBase over HDFS, scale depending on the computational and storage needs - If you are running Spark over HDFS, scale appropriately - you might need more memory in the worker nodes - In any case, have a topology and the processes that they would run. As Soumya suggests, you can prototype at an appropriate scale using AWS. Cheers k/. On Wed, May 21, 2014 at 5:14 PM, Upender Nimbekar upent...@gmail.comwrote: Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender
Best way to deploy a jar to spark cluster?
Hi, I'm quite new and recetly started to try spark. I've setup a single node spark cluster and followed the tutorials in Quick Start. But I've come across some issues. The thing I was trying to do is to try the java api and run it on the single-node cluster. I followed the Quick Start/A Standalone App in Java and successfully ran it using maven. But when I was trying to use ./bin/spark-class org.apache.spark.deploy.Client launch to submit the jar, I found there are a driver and an app running on the cluster. For running using maven directly, I only saw the app running. So I was thinking if I could build a jar with all the dependencise in order to distribute and run it usie just java -cp my.jar MainClass Arguments. But I came across the Exception in thread main com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version' issue. And I tried to specify the org.apache.spark as provided in the pom.xml. I can build the jar. But when executing using the java -cp my.jar, it just report cannot find the spark dependencies. And using the ./bin/spark-class org.apache.spark.deploy.Client launch method just go back to have a driver and an app at the same time. So I'm wondering what's the best way to generate a jar with dependencies and submit it to the spark cluster as a single app? Could somebody give me some advice on this? Thank you! Best Regards, Min Li
Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 / HDP(?) ? We may be able to reproduce this in that case. TD On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote: It sounds like something is closing the hdfs filesystem before everyone is really done with it. The filesystem gets cached and is shared so if someone closes it while other threads are still using it you run into this error. Is your application closing the filesystem? Are you using the event logging feature? Could you share the options you are running with? Yarn will retry the application depending on how the Application Master attempt fails (this is a configurable setting as to how many times it retries). That is probably the second driver you are referring to. But they shouldn't have overlapped as far as both being up at the same time. Is that the case you are seeing? Generally you want to look at why the first application attempt fails. Tom On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed. There were 2 anomalies... 1. ASM reported only that the application was ACCEPTED. It never indicated that the application was RUNNING. 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM: application identifier: application_1400696988985_0007 appId: 7 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: default appMasterRpcPort: -1 appStartTime: 1400709970857 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/http://sleepycat:8088/proxy/application_1400696988985_0007/ appUser: hduser Furthermore, it *started a second container*, running two partly *overlapping* drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure. 2. Each instance failed as indicated by the stderr log, finding that the *filesystem was closed* when trying to clean up the staging directories. 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver 14/05/21 16:08:24 INFO Executor: Finished task ID 1453 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2) 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1) 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s 14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s 14/05/21 16:08:24 INFO SparkUI: Stopped Spark web UI at http://dhcp-brm-bl1-215-1e-east-10-135-123-92.usdhcp.oraclecorp.com:42250 14/05/21 16:08:24 INFO DAGScheduler: Stopping DAGScheduler 14/05/21 16:08:25 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/05/21 16:08:25 INFO ConnectionManager: Selector thread was interrupted! 14/05/21 16:08:25 INFO ConnectionManager: ConnectionManager stopped 14/05/21 16:08:25 INFO MemoryStore: MemoryStore cleared 14/05/21 16:08:25 INFO BlockManager: BlockManager stopped 14/05/21 16:08:25 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/05/21 16:08:25 INFO BlockManagerMaster: BlockManagerMaster stopped 14/05/21 16:08:25 INFO SparkContext: Successfully stopped SparkContext 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/05/21 16:08:25 INFO ApplicationMaster: *finishApplicationMaster with SUCCEEDED* 14/05/21 16:08:25 INFO ApplicationMaster: AppMaster received a signal. 14/05/21 16:08:25 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1400696988985_0007 14/05/21 16:08:25 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/05/21 16:08:25 ERROR *ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_1400696988985_0007* *java.io.IOException: Filesystem closed* at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591) at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:587) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at