[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534565#comment-16534565
 ] 

Saisai Shao commented on SPARK-24723:
-------------------------------------

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>    
  // Write iter to disk.    ???    
  // Wait until all tasks finished writing.    
  context.barrier()    
  // The 0-th task launches an MPI job.    
  if (context.partitionId() == 0) {
    // generate and propagate ssh keys.
    // Wait for keys to set up in other tasks.

    val hosts = context.getTaskInfos().map(_.host)      
    // Set up MPI machine file using host infos.      ???        
    // Launch the MPI job by calling mpirun.      ???    
  } else {
    // get and setup public key
    // notify 0-th task that pubic key is setup.
  }      
  // Wait until the MPI job finished.    
  context.barrier()

  // Delete SSH key and revert the environment.      
  // Collect output and return.    ???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> --------------------------------------------------------
>
>                 Key: SPARK-24723
>                 URL: https://issues.apache.org/jira/browse/SPARK-24723
>             Project: Spark
>          Issue Type: Story
>          Components: ML, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to