Hi Ted,

sorry I should have been a bit more consistent in my cut and paste (there are nine nodes +driver) - I was concentrating on S9/6 (these logs are from that box - 10.1.201.165). S1/4 lines are:

15/12/29 18:49:45 INFO CoarseMesosSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-10-1-202-114.ec2.internal:19891) with ID f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4 15/12/29 18:49:45 INFO ExecutorAllocationManager: New executor f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4 has registered (new total is 6) 15/12/29 18:49:45 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-1-202-114.ec2.internal:14257 with 13.8 GB RAM, BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, ip-10-1-202-114.ec2.internal, 14257) 15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID 2149, ip-10-1-200-232.ec2.internal): FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4, ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21, message= org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014, execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)

I've tried to see how I can increase the idle timeout of the mesosExternalShuffleClient.registerDriverWithShuffleService as thats seems to be the core issue.


On 29/12/2015 21:17, Ted Yu wrote:
Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ?

In the snippet you posted, I don't see registration of this Executor.

Cheers

On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett <adr...@opensignal.com <mailto:adr...@opensignal.com>> wrote:

    We're seeing an "Executor is not registered" error on a Spark
    (1.6.0rc4, mesos-0.26) cluster.  It seems as if the logic in
    MesosExternalShuffleService.scala isn't working for some reason
    (new in 1.6 I believe).

    spark application sees this:
    ...
    15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully
    registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with
    external shuffle service.
    15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully
    registered app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with
    external shuffle service.
    15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered
    executor NettyRpcEndpointRef(null)
    (ip-10-1-201-165.ec2.internal:37660) with ID
    f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
    15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
    f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new
    total is 1)
    15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering
    block manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM,
    BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
    ip-10-1-201-165.ec2.internal, 53854)
    15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering
    block manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM,
    BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1,
    ip-10-1-201-132.ec2.internal, 12793)
    15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
    f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new
    total is 2)
    ...
    15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0
    in memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB,
    free: 13.8GB)
    15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to
    send map output locations for shuffle 1 to
    ip-10-1-202-121.ec2.internal:59734
    15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output
    statuses for shuffle 1 is 1671814 bytes
    15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to
    send map output locations for shuffle 1 to
    ip-10-1-201-165.ec2.internal:37660
    ...
    15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage
    1.0 (TID 2191, ip-10-1-200-232.ec2.internal, partition
    63,PROCESS_LOCAL, 2171 bytes)
    15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0
    (TID 2149, ip-10-1-200-232.ec2.internal):
    FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4,
    ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5,
    reduceId=21, message=
    org.apache.spark.shuffle.FetchFailedException:
    java.lang.RuntimeException: Executor is not registered
    (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
    execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
            at
    
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
    ...
    15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage
    0 (reduceByKey at
    /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48)
    and ShuffleMapStage 1 (reduceByKey at
    /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50)
    due to fetch failure
    15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0
    (TID 2140, ip-10-1-200-232.ec2.internal):
    FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
    ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6,
    reduceId=12, message=
    org.apache.spark.shuffle.FetchFailedException:
    java.lang.RuntimeException: Executor is not registered
    (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
    execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
            at
    
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)


    shuffle service itself (on executor's IP sees:
    15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received
    registration request from app
    a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 (remote address
    /10.1.200.165:37889 <http://10.1.200.165:37889>). (that's the
    driver IP)
    15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown
    /10.1.201.165:52562 <http://10.1.201.165:52562> disconnected.
    (executor IP)
    15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler:
    Application a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected
    (address was /10.1.200.165:37889 <http://10.1.200.165:37889>).
    (driver IP again)
    15/12/29 18:58:07 ERROR TransportRequestHandler: Error while
    invoking RpcHandler#receive() on RPC id 6244044000322436908
    java.lang.RuntimeException: Executor is not registered
    (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
    execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
            at
    
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)

    At first I wondered if reducing
    spark.shuffle.io.numConnectionsPerPeer back down to 1 (from 4)
    would help - maybe it wasn't keeping track of the number of
    connections.    However now with the extra debug I wonder if the
    driver is disconnecting after 2mins and the mesos shuffle service
    takes this as a sign that the whole app has finished and tidies up?

    I couldn't increase the time from 120s to anything higher even
    when I set this on the driver (will retry setting that on the
    shuffle service):
    spark.network.timeout 180s
    spark.shuffle.io.connectionTimeout 240s

    Adrian
-- *Adrian Bridgett*



--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road, Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>| LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________

Reply via email to