Hello spark-users,

I am testing the behavior of remote job submission with ec2/spark_ec2.py in 
spark distribution 1.5.2.  I submit SparkPi to a remote ec2 instance using 
spark-submit using the "standalone mode" (spark://) protocol.  Connecting to 
the master via ssh works, but submission fails.  The server logs report:

Association with remote system [akka.tcp://sparkDriver@192.168.0.4:58498] has 
failed

Use case: run Zeppelin to develop test, and save code on local machine with 
spark local, but intermittently connect to an EC2 cluster to scale out.  Thus 
ssh to the master first for job submission is not acceptable.

Please find below a complete reproduction.

My questions:

1) Is this kind of remote submission over standalone mode port 7077 supported?
2) What is the root cause of the protocol failure?
3) Is there a spark-env.sh or other server-side setting which will make the 
remote submission work?

Regards,


Jeff Henrikson


    # reproduction shown with:
        - "jq" json query
        - pyenv, virtualenv
        - awscli

    # set configuration

        export VPC=. . .                                   # your VPC
        export SPARK_HOME=. . ./spark-1.5.2-bin-hadoop2.6  # Just the binary 
spark distribution 1.5.2
        export IP4_SOURCE=. . .                            # the IP of the 
gateway for my internet access
        export KP=. . .                                    # the name of a 
keypair
        # throughout, cluster is named "cluster2"
        # region is us-west-2
        # keypair given is ~/.ssh/$KP and registered in us-west-2 as $KP

    # setup python/virtualenv

        pushd $SPARK_HOME
        pyenv local 2.7.6
        cd $SPARK_HOME/ec2

        virtualenv ../venv

        ../venv/bin/pip install awscli

    # launch cluster
        ../venv/bin/python spark_ec2.py --vpc-id=$VPC --region=us-west-2 
--instance-type=t2.medium --key-pair=KP -i ~/.ssh/KP launch cluster2

    # authorize firewall port 7077

        SG_MASTER=$(../venv/bin/aws ec2 describe-security-groups | jq -r 
'.SecurityGroups[] | select(.["GroupName"] == "cluster2-master") | .GroupId')
        ../venv/bin/aws ec2 authorize-security-group-ingress --group-id 
$SG_MASTER --protocol tcp --port 7077 --cidr $IP4_SOURCE/32

    # verify connectivity to master port 7077
        nc -v $DNS_MASTER 7077
            ec2-. . . 7077 open

    # locate ec2 public dns name
        export DNS_MASTER=$(../venv/bin/aws ec2 describe-instances | jq -r 
'.Reservations[].Instances[] | select(.SecurityGroups[].GroupName == 
"cluster2-master") | .PublicDnsName')

    # submit job
        SPARK_HOME/bin/spark-submit --master spark://$DNS_MASTER:7077 
--driver-memory 1g     --executor-memory 1g     --executor-cores 1 --class 
org.apache.spark.examples.SparkPi 
$SPARK_HOME/lib/spark-examples-1.5.2-hadoop2.6.0.jar


    # destroy cluster
        ../venv/bin/python spark_ec2.py --vpc-id=vpc-d35651b6 
--region=us-west-2 --instance-type=t2.medium --key-pair=KP -i ~/.ssh/KP destroy 
cluster2


    # actual result:
        16/02/23 12:28:36 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20160223201742-0000/21 on hostPort 172.31.13.146:40392 with 1 cores, 1024.0 
MB RAM
        16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/20 is now LOADING
        16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/21 is now LOADING
        16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/20 is now RUNNING
        16/02/23 12:28:36 INFO AppClient$ClientEndpoint: Executor updated: 
app-20160223201742-0000/21 is now RUNNING
        16/02/23 12:28:43 WARN TaskSchedulerImpl: Initial job has not accepted 
any resources; check your cluster UI to ensure that workers are registered and 
have sufficient resources

    # expected result:
        Pi is approximately . . .

    # actual result:
        # tail logs

            ssh -i ~/.ssh/KP root@$DNS_MASTER
            sudo tail -f -n0 /root/spark/logs/*

        16/02/23 20:42:42 INFO Master: 192.168.0.4:58498 got disassociated, 
removing it.
        16/02/23 20:42:42 WARN ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkDriver@192.168.0.4:58498] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].


Reply via email to