Re: Lost executors
Just to close the loop, it seems no issues pop up when i submit the job using 'spark submit' so that the driver process also runs on a container in the YARN cluster. In the above, the driver was running on the gateway machine through which the job was submitted, which led to quite a few issues. On Tue, Nov 18, 2014 at 5:01 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Sandy, Good point - i forgot about NM logs. When i looked up the NM logs, i only see the following statements that align with the driver side log about lost executor. Many executors show the same log statement at the same time, so it seems like the decision to kill many if not all executors happened centrally, and all executors got notified somehow: 14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013 14/11/18 00:18:25 INFO Executor: Executor killed task 2013 In general, i also see quite a few instances of the following exception across many executors/nodes. : 14/11/17 23:58:00 INFO HadoopRDD: Input split: hdfs dir path/sorted_keys-1020_3-r-00255.deflate:0+415841 14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket java.net.ConnectException: connect(2) error: Connection refused when trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket' at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250) at org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158) at org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721) at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441) at org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780) at org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714) at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.isEmpty(Iterator.scala:256) at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:51) at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:50) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
Re: Lost executors
Hi Pala, Do you have access to your YARN NodeManager logs? Are you able to check whether they report killing any containers for exceeding memory limits? -Sandy On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark shell. I am running a job that essentially reads a bunch of HBase keys, looks up HBase data, and performs some filtering and aggregation. The job works fine in smaller datasets, but when i try to execute on the full dataset, the job never completes. The few symptoms i notice are: a. The job shows progress for a while and then starts throwing lots of the following errors: 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor 906 disconnected, so removing it* 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost executor 906 on machine name: remote Akka client disassociated* 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager BlockManagerId(9186, machine name, 54600, 0) with no recent heart beats: 82313ms exceeds 45000ms* Looking at the logs, the job never recovers from these errors, and continues to show errors about lost executors and launching new executors, and this just continues for a long time. Could this be because the executors are running out of memory? In terms of memory usage, the intermediate data could be large (after the HBase lookup), but partial and fully aggregated data set size should be quite small - essentially a bunch of ids and counts ( 1 mil in total). b. In the Spark UI, i am seeing the following errors (redacted for brevity), not sure if they are transient or real issue: java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed out} ... org.apache.spark.util.Utils$.fetchFile(Utils.scala:349) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330) org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) ... java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:724) I was trying to get more data to investigate but haven't been able to figure out how to enable logging on the executors. The Spark UI appears stuck and i only see driver side logs in the jobhistory directory specified in the job. Thanks, pala
Re: Lost executors
Sandy, Good point - i forgot about NM logs. When i looked up the NM logs, i only see the following statements that align with the driver side log about lost executor. Many executors show the same log statement at the same time, so it seems like the decision to kill many if not all executors happened centrally, and all executors got notified somehow: 14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013 14/11/18 00:18:25 INFO Executor: Executor killed task 2013 In general, i also see quite a few instances of the following exception across many executors/nodes. : 14/11/17 23:58:00 INFO HadoopRDD: Input split: hdfs dir path/sorted_keys-1020_3-r-00255.deflate:0+415841 14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket java.net.ConnectException: connect(2) error: Connection refused when trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket' at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250) at org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158) at org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721) at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441) at org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780) at org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714) at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.isEmpty(Iterator.scala:256) at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:51) at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:50) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at
Re: Lost executors
After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of memory if I tried to cache() the RDD, but I would hope that persist() is implemented so that it would stream to disk without trying to materialize too much data in RAM. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with that much heap are taking a lot of the memory. Persist as DISK_ONLY should indeed stream data from disk, so I don't think that will be a problem. Matei On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote: After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of memory if I tried to cache() the RDD, but I would hope that persist() is implemented so that it would stream to disk without trying to materialize too much data in RAM. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've usually found that lowering the executor memory size helps. Shivaram On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with that much heap are taking a lot of the memory. Persist as DISK_ONLY should indeed stream data from disk, so I don't think that will be a problem. Matei On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote: After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of memory if I tried to cache() the RDD, but I would hope that persist() is implemented so that it would stream to disk without trying to materialize too much data in RAM. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
To add to the pile of information we're asking you to provide, what version of Spark are you running? 2014-08-13 11:11 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : If the JVM heap size is close to the memory limit the OS sometimes kills the process under memory pressure. I've usually found that lowering the executor memory size helps. Shivaram On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with that much heap are taking a lot of the memory. Persist as DISK_ONLY should indeed stream data from disk, so I don't think that will be a problem. Matei On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote: After a lot of grovelling through logs, I found out that the Nagios monitor process detected that the machine was almost out of memory, and killed the SNAP executor process. So why is the machine running out of memory? Each node has 128GB of RAM, 4 executors, about 40GB of data. It did run out of memory if I tried to cache() the RDD, but I would hope that persist() is implemented so that it would stream to disk without trying to materialize too much data in RAM. Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size would indeed run out of memory (the machine has 110GB). And in fact they would get repeatedly restarted and killed until eventually Spark gave up. I'll try with a smaller limit, but it'll be a while - somehow my HDFS got seriously corrupted so I need to rebuild my HDP cluster... Thanks, Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
Hi Ravi, Setting SPARK_MEMORY doesn't do anything. I believe you confused it with SPARK_MEM, which is now deprecated. You should set SPARK_EXECUTOR_MEMORY instead, or spark.executor.memory as a config in conf/spark-defaults.conf. Assuming you haven't set the executor memory through a different mechanism, your executors will quickly run out of memory with the default of 512m. Let me know if setting this does the job. If so, you can even persist the RDDs to memory as well to get better performance, though this depends on your workload. -Andrew 2014-08-13 11:38 GMT-07:00 rpandya r...@iecommerce.com: I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size would indeed run out of memory (the machine has 110GB). And in fact they would get repeatedly restarted and killed until eventually Spark gave up. I'll try with a smaller limit, but it'll be a while - somehow my HDFS got seriously corrupted so I need to rebuild my HDP cluster... Thanks, Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
Same here Ravi. See my post on a similar thread. Are you running on YARN client? On Aug 7, 2014 2:56 PM, rpandya r...@iecommerce.com wrote: I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose fix it would be appreciated. There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per machine (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails. The relevant code is: val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _, seqDictBcast.value)) val result = reads.coalesce(numMachines * coresPerMachine * 4, true).persist(StorageLevel.DISK_ONLY_2) log.info(SNAP output DebugString:\n + result.toDebugString) log.info(produced + result.count + reads) The toDebugString output is: 2014-08-07 18:50:43 INFO SnapInputStage:198 - SNAP output DebugString: MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions) CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions) ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions) MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10 partitions) MapPartitionsRDD[6] at mapPartitionsWithIndex at SnapInputStage.scala:195 (10 partitions) MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions) CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10 partitions) NewHadoopRDD[2] at newAPIHadoopFile at SnapInputStage.scala:182 (3003 partitions) The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and writes 25GB per task. The next 640-partition stage is where the failures occur. Here are the first few errors from a recent run (sorted by time): work/hpcraviplvm10/app-20140807185713-/14/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm10/app-20140807185713-/27/stderr: 14/08/07 20:32:18 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(hpcraviplvm1,49545) work/hpcraviplvm1/app-20140807185713-/9/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm2/app-20140807185713-/24/stderr:14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvm2/app-20140807185713-/36/stderr:14/08/07 20:32:18 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(hpcraviplvm1,49545) work/hpcraviplvma1/app-20140807185713-/26/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-/15/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-/18/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-/23/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found work/hpcraviplvma2/app-20140807185713-/33/stderr: 14/08/07 20:32:18 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found Thanks, Ravi Pandya Microsoft Research -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
Hi Avishek, I'm running on a manual cluster setup, and all the code is Scala. The load averages don't seem high when I see these failures (about 12 on a 16-core machine). Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p11819.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost executors
Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka client disassociated This is in an interactive shell session. I don't know a lot about Yarn plumbing and am wondering if there's some constraint in play -- executors can't be idle for too long or they get cleared out. Any insights here?
Re: Lost executors
hi Andrew, Thanks for your note. Yes, I see a stack trace now. It seems to be an issue with python interpreting a function I wish to apply to an RDD. The stack trace is below. The function is a simple factorial: def f(n): if n == 1: return 1 return n * f(n-1) and I'm trying to use it like this: tf = sc.textFile(...) tf.map(lambda line: line and len(line)).map(f).collect() I get the following error, which does not occur if I use a built-in function, like math.sqrt TypeError: __import__() argument 1 must be string, not X# stacktrace follows WARN TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 123, in dump_stream for obj in iterator: File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 180, in _batched for item in iterator: File ipython-input-39-0f0dafaf1ed4, line 2, in f TypeError: __import__() argument 1 must be string, not X# at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or and...@databricks.com wrote: Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka client disassociated This is in an interactive shell session. I don't know a lot about Yarn plumbing and am wondering if there's some constraint in play -- executors can't be idle for too long or they get cleared out. Any insights here?
Re: Lost executors
And... PEBCAK I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but it was on a python 2.6 install on the remote nodes, hence incompatible with what the master was sending. Have set this to point to the correct version everywhere and it works. Apologies for the false alarm. On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman eric.d.fried...@gmail.com wrote: hi Andrew, Thanks for your note. Yes, I see a stack trace now. It seems to be an issue with python interpreting a function I wish to apply to an RDD. The stack trace is below. The function is a simple factorial: def f(n): if n == 1: return 1 return n * f(n-1) and I'm trying to use it like this: tf = sc.textFile(...) tf.map(lambda line: line and len(line)).map(f).collect() I get the following error, which does not occur if I use a built-in function, like math.sqrt TypeError: __import__() argument 1 must be string, not X# stacktrace follows WARN TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 123, in dump_stream for obj in iterator: File /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py, line 180, in _batched for item in iterator: File ipython-input-39-0f0dafaf1ed4, line 2, in f TypeError: __import__() argument 1 must be string, not X# at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or and...@databricks.com wrote: Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka client disassociated This is in an interactive shell session. I don't know a lot about Yarn plumbing and am wondering if there's some constraint in play -- executors can't be idle for too long or they get cleared out. Any insights here?