Those logs you included are from the Spark executor processes, as opposed
to the YARN NodeManager processes.

If you don't think you have access to the NodeManager logs, I would try
setting spark.yarn.executor.memoryOverhead to something like 1024 or 2048
and seeing if that helps.  If it does, it's because YARN was killing the
containers.

-Sandy

On Thu, Oct 2, 2014 at 6:48 AM, Mike Bernico <mike.bern...@gmail.com> wrote:

> Hello Xiangrui and Sandy,
>
> Thanks for jumping in to help.
>
> So, first thing...   After my email last night I reran my code using 10
> executors, 2G each, and everything ran okay.   So, that's good, but I'm
> still curious as to what I was doing wrong.
>
> For Xiangrui's questions:
>
> My training set is 49174 observations x 61497 terms in a sparse vector
> from spark's tf/idf transform.  The partition size is 1025, which isn't
> something I've tuned, I'm guessing it's related to input splits.  I've
> never called coalesce, etc.
>
> For Sandy's:
>
> I do not see any memory errors in the yarn logs other than this
> occasionally:
>
> 14/10/01 19:25:54 INFO storage.MemoryStore: Will not store rdd_11_195 as
> it would require dropping another block from the same RDD
> 14/10/01 19:25:54 WARN spark.CacheManager: Not enough space to cache
> partition rdd_11_195 in memory! Free memory is 236314377 bytes.
> 14/10/01 19:25:57 INFO executor.Executor: Finished task 195.0 in stage 2.0
> (TID 1220). 1134 bytes result sent to driver
>
> The only other badness I see in those logs is:
>
> 14/10/01 19:40:35 INFO network.SendingConnection: Initiating connection to
> [<hostname removed> :57359
> <http://rpl0000001273.opr.etlab.test.statefarm.org/10.233.51.34:57359>]
> 14/10/01 19:40:35 WARN network.SendingConnection: Error finishing
> connection to <hostname removed>:57359
> <http://rpl0000001273.opr.etlab.test.statefarm.org/10.233.51.34:57359>
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
>         at
> org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
>         at
> org.apache.spark.network.ConnectionManager$$anon$8.run(ConnectionManager.scala:226)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
>
>
> I'm guessing those are from after the executors have died their mysterious
> death.  I'm happy ot send you the entire log if you'd like.
>
> Thanks!
>
>
> On Thu, Oct 2, 2014 at 2:02 AM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Hi Mike,
>>
>> Do you have access to your YARN NodeManager logs?  When executors die
>> randomly on YARN, it's often because they use more memory than allowed for
>> their YARN container.  You would see messages to the effect of "container
>> killed because physical memory limits exceeded".
>>
>> -Sandy
>>
>> On Wed, Oct 1, 2014 at 8:46 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>>> The cost depends on the feature dimension, number of instances, number
>>> of classes, and number of partitions. Do you mind sharing those
>>> numbers? -Xiangrui
>>>
>>> On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico <mike.bern...@gmail.com>
>>> wrote:
>>> > Hi Everyone,
>>> >
>>> > I'm working on training mllib's Naive Bayes to classify TF/IDF
>>> vectoried
>>> > docs using Spark 1.1.0.
>>> >
>>> > I've gotten this to work fine on a smaller set of data, but when I
>>> increase
>>> > the number of vectorized documents  I get hung up on training.  The
>>> only
>>> > messages I'm seeing are below.  I'm pretty new to spark and I don't
>>> really
>>> > know where to go next to troubleshoot this.
>>> >
>>> > I'm running spark in yarn like this:
>>> > spark-shell --master yarn-client --executor-memory 7G --driver memory
>>> 7G
>>> > --num-executors 3
>>> >
>>> > I have three workers, each with 64G of ram and 8 cores.
>>> >
>>> >
>>> >
>>> > scala> val model = NaiveBayes.train(training, lambda = 1.0)
>>> > 14/10/01 19:40:34 ERROR YarnClientClusterScheduler: Lost executor 2 on
>>> > rpl0000001273.<removed>: remote Akka client disassociated
>>> > 14/10/01 19:40:34 WARN TaskSetManager: Lost task 195.0 in stage 5.0
>>> (TID
>>> > 2940, rpl0000001273.<removed>): ExecutorLostFailure (executor lost)
>>> > 14/10/01 19:40:34 WARN TaskSetManager: Lost task 190.0 in stage 5.0
>>> (TID
>>> > 2782, rpl0000001272.<removed>): FetchFailed(BlockManagerId(2,
>>> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=190)
>>> > 14/10/01 19:40:35 WARN TaskSetManager: Lost task 195.1 in stage 5.0
>>> (TID
>>> > 2941, rpl0000001272.<removed>): FetchFailed(BlockManagerId(2,
>>> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=195)
>>> > 14/10/01 19:40:36 WARN TaskSetManager: Lost task 185.0 in stage 5.0
>>> (TID
>>> > 2780, rpl0000001277.<removed>): FetchFailed(BlockManagerId(2,
>>> > rpl0000001273.<removed>, 57359, 0), shuffleId=1, mapId=0, reduceId=185)
>>> > 14/10/01 19:46:24 ERROR YarnClientClusterScheduler: Lost executor 1 on
>>> > rpl0000001272.<removed>: remote Akka client disassociated
>>> > 14/10/01 19:46:24 WARN TaskSetManager: Lost task 78.0 in stage 5.1 (TID
>>> > 3377, rpl0000001272.<removed>): ExecutorLostFailure (executor lost)
>>> > 14/10/01 19:46:25 WARN TaskSetManager: Lost task 79.0 in stage 5.1 (TID
>>> > 3378, rpl0000001273.<removed>): FetchFailed(BlockManagerId(1,
>>> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=5, reduceId=220)
>>> > 14/10/01 19:46:25 WARN TaskSetManager: Lost task 78.1 in stage 5.1 (TID
>>> > 3379, rpl0000001273.<removed>): FetchFailed(BlockManagerId(1,
>>> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=5, reduceId=215)
>>> > 14/10/01 19:46:29 WARN TaskSetManager: Lost task 73.0 in stage 5.1 (TID
>>> > 3372, rpl0000001277.<removed>): FetchFailed(BlockManagerId(1,
>>> > rpl0000001272.<removed>, 60926, 0), shuffleId=1, mapId=9, reduceId=210)
>>> > 14/10/01 19:57:27 ERROR YarnClientClusterScheduler: Lost executor 3 on
>>> > rpl0000001277.<removed>: remote Akka client disassociated
>>> > 14/10/01 19:57:27 WARN TaskSetManager: Lost task 177.0 in stage 5.2
>>> (TID
>>> > 4015, rpl0000001277.<removed>): ExecutorLostFailure (executor lost)
>>> > 14/10/01 19:57:27 ERROR ConnectionManager: Corresponding
>>> SendingConnection
>>> > to ConnectionManagerId(rpl0000001277.<removed>,41425) not found
>>> > 14/10/01 19:57:30 WARN TaskSetManager: Lost task 182.0 in stage 5.2
>>> (TID
>>> > 4020, rpl0000001272.<removed>): FetchFailed(BlockManagerId(3,
>>> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=2, reduceId=340)
>>> > 14/10/01 19:57:30 WARN TaskSetManager: Lost task 177.1 in stage 5.2
>>> (TID
>>> > 4022, rpl0000001272.<removed>): FetchFailed(BlockManagerId(3,
>>> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=2, reduceId=335)
>>> > 14/10/01 19:57:36 WARN TaskSetManager: Lost task 183.0 in stage 5.2
>>> (TID
>>> > 4021, rpl0000001273.<removed>): FetchFailed(BlockManagerId(3,
>>> > rpl0000001277.<removed>, 41425, 0), shuffleId=1, mapId=8, reduceId=345)
>>> > 14/10/01 20:20:22 ERROR YarnClientClusterScheduler: Lost executor 4 on
>>> > rpl0000001273.<removed>: remote Akka client disassociated
>>> > 14/10/01 20:20:22 WARN TaskSetManager: Lost task 527.0 in stage 5.3
>>> (TID
>>> > 5159, rpl0000001273.<removed>): ExecutorLostFailure (executor lost)
>>> > 14/10/01 20:20:23 WARN TaskSetManager: Lost task 517.0 in stage 5.3
>>> (TID
>>> > 5149, rpl0000001272.<removed>): FetchFailed(BlockManagerId(4,
>>> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=6, reduceId=690)
>>> > 14/10/01 20:20:23 WARN TaskSetManager: Lost task 527.1 in stage 5.3
>>> (TID
>>> > 5160, rpl0000001272.<removed>): FetchFailed(BlockManagerId(4,
>>> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=5, reduceId=700)
>>> > 14/10/01 20:20:25 WARN TaskSetManager: Lost task 522.0 in stage 5.3
>>> (TID
>>> > 5154, rpl0000001277.<removed>): FetchFailed(BlockManagerId(4,
>>> > rpl0000001273.<removed>, 51049, 0), shuffleId=1, mapId=5, reduceId=695)
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to