Re: Lost executors

2014-11-20 Thread Pala M Muthaia
Just to close the loop, it seems no issues pop up when i submit the job
using 'spark submit' so that the driver process also runs on a container in
the YARN cluster.

In the above, the driver was running on the gateway machine through which
the job was submitted, which led to quite a few issues.

On Tue, Nov 18, 2014 at 5:01 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Sandy,

 Good point - i forgot about NM logs.

 When i looked up the NM logs, i only see the following statements that
 align with the driver side log about lost executor. Many executors show the
 same log statement at the same time, so it seems like the decision to kill
 many if not all executors happened centrally, and all executors got
 notified somehow:

 14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013
 14/11/18 00:18:25 INFO Executor: Executor killed task 2013


 In general, i also see quite a few instances of the following exception 
 across many executors/nodes. :

 14/11/17 23:58:00 INFO HadoopRDD: Input split: hdfs dir 
 path/sorted_keys-1020_3-r-00255.deflate:0+415841

 14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket
 java.net.ConnectException: connect(2) error: Connection refused when trying 
 to connect to '/srv/var/hadoop/runs/hdfs/dn_socket'
   at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
   at 
 org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
   at 
 org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158)
   at 
 org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721)
   at 
 org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441)
   at 
 org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780)
   at 
 org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714)
   at 
 org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395)
   at 
 org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
   at java.io.DataInputStream.read(DataInputStream.java:149)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
   at java.io.InputStream.read(InputStream.java:101)
   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
   at 
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
   at 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
   at 
 org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184)
   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
   at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
   at 
 $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:51)
   at 
 $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:50)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   

Re: Lost executors

2014-11-18 Thread Sandy Ryza
Hi Pala,

Do you have access to your YARN NodeManager logs?  Are you able to check
whether they report killing any containers for exceeding memory limits?

-Sandy

On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi,

 I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
 shell.

 I am running a job that essentially reads a bunch of HBase keys, looks up
 HBase data, and performs some filtering and aggregation. The job works fine
 in smaller datasets, but when i try to execute on the full dataset, the job
 never completes. The few symptoms i notice are:

 a. The job shows progress for a while and then starts throwing lots of the
 following errors:

 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
 906 disconnected, so removing it*
 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
 org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
 executor 906 on machine name: remote Akka client disassociated*

 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
  org.apache.spark.storage.BlockManagerMasterActor - *Removing
 BlockManager BlockManagerId(9186, machine name, 54600, 0) with no recent
 heart beats: 82313ms exceeds 45000ms*

 Looking at the logs, the job never recovers from these errors, and
 continues to show errors about lost executors and launching new executors,
 and this just continues for a long time.

 Could this be because the executors are running out of memory?

 In terms of memory usage, the intermediate data could be large (after the
 HBase lookup), but partial and fully aggregated data set size should be
 quite small - essentially a bunch of ids and counts ( 1 mil in total).



 b. In the Spark UI, i am seeing the following errors (redacted for
 brevity), not sure if they are transient or real issue:

 java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed 
 out}
 ...
 org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 ...
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)




 I was trying to get more data to investigate but haven't been able to
 figure out how to enable logging on the executors. The Spark UI appears
 stuck and i only see driver side logs in the jobhistory directory specified
 in the job.


 Thanks,
 pala





Re: Lost executors

2014-11-18 Thread Pala M Muthaia
Sandy,

Good point - i forgot about NM logs.

When i looked up the NM logs, i only see the following statements that
align with the driver side log about lost executor. Many executors show the
same log statement at the same time, so it seems like the decision to kill
many if not all executors happened centrally, and all executors got
notified somehow:

14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013
14/11/18 00:18:25 INFO Executor: Executor killed task 2013


In general, i also see quite a few instances of the following
exception across many executors/nodes. :

14/11/17 23:58:00 INFO HadoopRDD: Input split: hdfs dir
path/sorted_keys-1020_3-r-00255.deflate:0+415841

14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket
java.net.ConnectException: connect(2) error: Connection refused when
trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket'
at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
at 
org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441)
at 
org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780)
at 
org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
at 
$line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:51)
at 
$line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:50)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at 

Re: Lost executors

2014-08-13 Thread rpandya
After a lot of grovelling through logs, I found out that the Nagios monitor
process detected that the machine was almost out of memory, and killed the
SNAP executor process.

So why is the machine running out of memory? Each node has 128GB of RAM, 4
executors, about 40GB of data. It did run out of memory if I tried to
cache() the RDD, but I would hope that persist() is implemented so that it
would stream to disk without trying to materialize too much data in RAM.

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-08-13 Thread Matei Zaharia
What is your Spark executor memory set to? (You can see it in Spark's web UI at 
http://driver:4040 under the executors tab). One thing to be aware of is that 
the JVM never really releases memory back to the OS, so it will keep filling up 
to the maximum heap size you set. Maybe 4 executors with that much heap are 
taking a lot of the memory.

Persist as DISK_ONLY should indeed stream data from disk, so I don't think that 
will be a problem.

Matei

On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:

After a lot of grovelling through logs, I found out that the Nagios monitor 
process detected that the machine was almost out of memory, and killed the 
SNAP executor process. 

So why is the machine running out of memory? Each node has 128GB of RAM, 4 
executors, about 40GB of data. It did run out of memory if I tried to 
cache() the RDD, but I would hope that persist() is implemented so that it 
would stream to disk without trying to materialize too much data in RAM. 

Ravi 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



Re: Lost executors

2014-08-13 Thread Shivaram Venkataraman
If the JVM heap size is close to the memory limit the OS sometimes kills
the process under memory pressure. I've usually found that lowering the
executor memory size helps.

Shivaram


On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 What is your Spark executor memory set to? (You can see it in Spark's web
 UI at http://driver:4040 under the executors tab). One thing to be
 aware of is that the JVM never really releases memory back to the OS, so it
 will keep filling up to the maximum heap size you set. Maybe 4 executors
 with that much heap are taking a lot of the memory.

 Persist as DISK_ONLY should indeed stream data from disk, so I don't think
 that will be a problem.

 Matei

 On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:

 After a lot of grovelling through logs, I found out that the Nagios
 monitor
 process detected that the machine was almost out of memory, and killed the
 SNAP executor process.

 So why is the machine running out of memory? Each node has 128GB of RAM, 4
 executors, about 40GB of data. It did run out of memory if I tried to
 cache() the RDD, but I would hope that persist() is implemented so that it
 would stream to disk without trying to materialize too much data in RAM.

 Ravi



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lost executors

2014-08-13 Thread Andrew Or
To add to the pile of information we're asking you to provide, what version
of Spark are you running?


2014-08-13 11:11 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu
:

 If the JVM heap size is close to the memory limit the OS sometimes kills
 the process under memory pressure. I've usually found that lowering the
 executor memory size helps.

 Shivaram


 On Wed, Aug 13, 2014 at 11:01 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 What is your Spark executor memory set to? (You can see it in Spark's web
 UI at http://driver:4040 under the executors tab). One thing to be
 aware of is that the JVM never really releases memory back to the OS, so it
 will keep filling up to the maximum heap size you set. Maybe 4 executors
 with that much heap are taking a lot of the memory.

 Persist as DISK_ONLY should indeed stream data from disk, so I don't
 think that will be a problem.

 Matei

 On August 13, 2014 at 6:49:11 AM, rpandya (r...@iecommerce.com) wrote:

 After a lot of grovelling through logs, I found out that the Nagios
 monitor
 process detected that the machine was almost out of memory, and killed
 the
 SNAP executor process.

 So why is the machine running out of memory? Each node has 128GB of RAM,
 4
 executors, about 40GB of data. It did run out of memory if I tried to
 cache() the RDD, but I would hope that persist() is implemented so that
 it
 would stream to disk without trying to materialize too much data in RAM.

 Ravi



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12032.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Lost executors

2014-08-13 Thread rpandya
I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
would indeed run out of memory (the machine has 110GB). And in fact they
would get repeatedly restarted and killed until eventually Spark gave up.

I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
seriously corrupted so I need to rebuild my HDP cluster...

Thanks,

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-08-13 Thread Andrew Or
Hi Ravi,

Setting SPARK_MEMORY doesn't do anything. I believe you confused it with
SPARK_MEM, which is now deprecated. You should set SPARK_EXECUTOR_MEMORY
instead, or spark.executor.memory as a config in
conf/spark-defaults.conf. Assuming you haven't set the executor memory
through a different mechanism, your executors will quickly run out of
memory with the default of 512m.

Let me know if setting this does the job. If so, you can even persist the
RDDs to memory as well to get better performance, though this depends on
your workload.

-Andrew


2014-08-13 11:38 GMT-07:00 rpandya r...@iecommerce.com:

 I'm running Spark 1.0.1 with SPARK_MEMORY=60g, so 4 executors at that size
 would indeed run out of memory (the machine has 110GB). And in fact they
 would get repeatedly restarted and killed until eventually Spark gave up.

 I'll try with a smaller limit, but it'll be a while - somehow my HDFS got
 seriously corrupted so I need to rebuild my HDP cluster...

 Thanks,

 Ravi



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p12050.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lost executors

2014-08-08 Thread Avishek Saha
Same here Ravi. See my post on a similar thread.

Are you running on YARN client?
On Aug 7, 2014 2:56 PM, rpandya r...@iecommerce.com wrote:

 I'm running into a problem with executors failing, and it's not clear
 what's
 causing it. Any suggestions on how to diagnose  fix it would be
 appreciated.

 There are a variety of errors in the logs, and I don't see a consistent
 triggering error. I've tried varying the number of executors per machine
 (1/4/16 per 16-core/128GB machine w/200GB free disk) and it still fails.

 The relevant code is:
 val reads = fastqAsText.mapPartitionsWithIndex(runner.mapReads(_, _,
 seqDictBcast.value))
 val result = reads.coalesce(numMachines * coresPerMachine * 4,
 true).persist(StorageLevel.DISK_ONLY_2)
 log.info(SNAP output DebugString:\n + result.toDebugString)
 log.info(produced  + result.count +  reads)

 The toDebugString output is:
 2014-08-07 18:50:43 INFO  SnapInputStage:198 - SNAP output DebugString:
 MappedRDD[10] at coalesce at SnapInputStage.scala:197 (640 partitions)
   CoalescedRDD[9] at coalesce at SnapInputStage.scala:197 (640 partitions)
 ShuffledRDD[8] at coalesce at SnapInputStage.scala:197 (640 partitions)
   MapPartitionsRDD[7] at coalesce at SnapInputStage.scala:197 (10
 partitions)
 MapPartitionsRDD[6] at mapPartitionsWithIndex at
 SnapInputStage.scala:195 (10 partitions)
   MappedRDD[4] at map at SnapInputStage.scala:188 (10 partitions)
 CoalescedRDD[3] at coalesce at SnapInputStage.scala:188 (10
 partitions)
   NewHadoopRDD[2] at newAPIHadoopFile at
 SnapInputStage.scala:182 (3003 partitions)

 The 10-partition stage works fine, takes about 1.4 hours, reads 40GB and
 writes 25GB per task. The next 640-partition stage is where the failures
 occur.

 Here are the first few errors from a recent run (sorted by time):
 work/hpcraviplvm10/app-20140807185713-/14/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvm10/app-20140807185713-/27/stderr:   14/08/07 20:32:18
 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
 block(s)
 from ConnectionManagerId(hpcraviplvm1,49545)
 work/hpcraviplvm1/app-20140807185713-/9/stderr: 14/08/07 20:32:18
   ERROR
 ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvm2/app-20140807185713-/24/stderr:14/08/07 20:32:18
   ERROR
 ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvm2/app-20140807185713-/36/stderr:14/08/07 20:32:18
   ERROR
 SendingConnection: Exception while reading SendingConnection to
 ConnectionManagerId(hpcraviplvm1,49545)
 work/hpcraviplvma1/app-20140807185713-/26/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvma2/app-20140807185713-/15/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvma2/app-20140807185713-/18/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvma2/app-20140807185713-/23/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found
 work/hpcraviplvma2/app-20140807185713-/33/stderr:   14/08/07 20:32:18
 ERROR ConnectionManager: Corresponding SendingConnectionManagerId not found

 Thanks,

 Ravi Pandya
 Microsoft Research



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Lost executors

2014-08-08 Thread rpandya
Hi Avishek,

I'm running on a manual cluster setup, and all the code is Scala. The load
averages don't seem high when I see these failures (about 12 on a 16-core
machine).

Ravi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lost-executors-tp11722p11819.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Lost executors

2014-07-23 Thread Andrew Or
Hi Eric,

Have you checked the executor logs? It is possible they died because of
some exception, and the message you see is just a side effect.

Andrew


2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com:

 I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
  Cluster resources are available to me via Yarn and I am seeing these
 errors quite often.

 ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka
 client disassociated


 This is in an interactive shell session.  I don't know a lot about Yarn
 plumbing and am wondering if there's some constraint in play -- executors
 can't be idle for too long or they get cleared out.


 Any insights here?



Re: Lost executors

2014-07-23 Thread Eric Friedman
hi Andrew,

Thanks for your note.  Yes, I see a stack trace now.  It seems to be an
issue with python interpreting a function I wish to apply to an RDD.  The
stack trace is below.  The function is a simple factorial:

def f(n):
  if n == 1: return 1
  return n * f(n-1)

and I'm trying to use it like this:

tf = sc.textFile(...)
tf.map(lambda line: line and len(line)).map(f).collect()

I get the following error, which does not occur if I use a built-in
function, like math.sqrt

 TypeError: __import__() argument 1 must be string, not X#

stacktrace follows



WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException

org.apache.spark.api.python.PythonException: Traceback (most recent call
last):

  File
/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py,
line 77, in main

serializer.dump_stream(func(split_index, iterator), outfile)

  File
/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
line 191, in dump_stream

self.serializer.dump_stream(self._batched(iterator), stream)

  File
/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
line 123, in dump_stream

for obj in iterator:

  File
/hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
line 180, in _batched

for item in iterator:

  File ipython-input-39-0f0dafaf1ed4, line 2, in f

TypeError: __import__() argument 1 must be string, not X#



 at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)

at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)





On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or and...@databricks.com wrote:

 Hi Eric,

 Have you checked the executor logs? It is possible they died because of
 some exception, and the message you see is just a side effect.

 Andrew


 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com:

 I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
  Cluster resources are available to me via Yarn and I am seeing these
 errors quite often.

 ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka
 client disassociated


 This is in an interactive shell session.  I don't know a lot about Yarn
 plumbing and am wondering if there's some constraint in play -- executors
 can't be idle for too long or they get cleared out.


 Any insights here?





Re: Lost executors

2014-07-23 Thread Eric Friedman
And... PEBCAK

I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but
it was on a python 2.6 install on the remote nodes, hence incompatible with
what the master was sending.  Have set this to point to the correct version
everywhere and it works.

Apologies for the false alarm.


On Wed, Jul 23, 2014 at 8:40 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:

 hi Andrew,

 Thanks for your note.  Yes, I see a stack trace now.  It seems to be an
 issue with python interpreting a function I wish to apply to an RDD.  The
 stack trace is below.  The function is a simple factorial:

 def f(n):
   if n == 1: return 1
   return n * f(n-1)

 and I'm trying to use it like this:

 tf = sc.textFile(...)
 tf.map(lambda line: line and len(line)).map(f).collect()

 I get the following error, which does not occur if I use a built-in
 function, like math.sqrt

  TypeError: __import__() argument 1 must be string, not X#

 stacktrace follows



 WARN TaskSetManager: Loss was due to
 org.apache.spark.api.python.PythonException

 org.apache.spark.api.python.PythonException: Traceback (most recent call
 last):

   File
 /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/worker.py,
 line 77, in main

 serializer.dump_stream(func(split_index, iterator), outfile)

   File
 /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
 line 191, in dump_stream

 self.serializer.dump_stream(self._batched(iterator), stream)

   File
 /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
 line 123, in dump_stream

 for obj in iterator:

   File
 /hadoop/d11/yarn/nm/usercache/eric_d_friedman/filecache/26/spark-assembly-1.0.1-hadoop2.2.0.jar/pyspark/serializers.py,
 line 180, in _batched

 for item in iterator:

   File ipython-input-39-0f0dafaf1ed4, line 2, in f

 TypeError: __import__() argument 1 must be string, not X#



  at
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)

 at
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)

 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)

 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)





 On Wed, Jul 23, 2014 at 11:07 AM, Andrew Or and...@databricks.com wrote:

 Hi Eric,

 Have you checked the executor logs? It is possible they died because of
 some exception, and the message you see is just a side effect.

 Andrew


 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com:

 I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
  Cluster resources are available to me via Yarn and I am seeing these
 errors quite often.

 ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote
 Akka client disassociated


 This is in an interactive shell session.  I don't know a lot about Yarn
 plumbing and am wondering if there's some constraint in play -- executors
 can't be idle for too long or they get cleared out.


 Any insights here?