Hi Vinay, IIRC, YARN will have the host's Hadoop environments set in container launch script by default. And in the submarine case, the user's worker command is used to generate a worker script which is invoked in the container launch script. If submarine doesn't override the default Hadoop environment variable, the HDFS read/write in the container might fail due to not found or incorrect Hadoop location. So even a Docker image is built with correct Hadoop environment set, it seems also needs this override to use HDFS library in a container. This seems caused by YARN's Docker support and the submarine is doing a workaround here.
The submarine is evolving rapidly, please share your thoughts if it's uncomfortable for you. Thanks, Zhankun On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <vinu.k...@gmail.com> wrote: > Hi Zhankun, > Thanks for the reply. > > Regarding Question 1 : Okay.. I understand, Let me try configuring > multiple input path place holders and refer the same in the worker launch > command. > > Regarding Question 2 : > What I did not understand is why YARN has to set anything related to > Hadoop which runs inside the container. The Hadoop environment and the > worker code to read the same is completely isolated to the docker > container. In that case, the worker scripts should know where the > HADOOP_HOME is inside the container right.? There is another argument > called *--checkpoint_path* which acts as a path where all the outputs > (models or datasets) which are resulted as part of the execution of the > worker code inside the docker container. Hence, *--input_path* acts as > entry point which will be localized and *--checkpoint_path *acts as exit > point, where both these paths are hdfs paths which runs outside the docker > container. So why YARN should know the hadoop configuration which is inside > the container.? > > Thanks and regards > Vinay Kashyap > > On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <tangzhan...@gmail.com> > wrote: > >> Hi Vinay, >> >> For question one, IIRC, we cannot set multiple "*--input_path" *flag at >> present. The "--input_path" is designed originally as a placeholder to >> store a path and then the path is used to replace "%input_path%" in worker >> command like "python worker.sh -input %input_path% ..". >> So from this perspective, you can directly append the other input paths >> to your worker command in your own way. >> >> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by >> default. So submarine provides the environment variable to be set in the >> worker's launch script if the worker wants to access HDFS. >> And there's no data plane relation between outside Hadoop and the >> container except YARN will localize resources for the container. >> >> Hope this can answer your questions. >> >> Best Regards, >> Zhankun >> >> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vinu.k...@gmail.com> wrote: >> >>> Hi all, >>> >>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run >>> TensorFlow jobs in a docker container. >>> I would like to understand few details regarding Read/Write HDFS data >>> during/after application launch/execution. Have highlighted the questions >>> line. >>> >>> When launching the application which reads input from HDFS, we configure >>> *--input_path* to a hdfs path, as mentioned in the standard example. >>> >>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \ >>> --name tf-job-001 --docker_image <your docker image> \ >>> --input_path hdfs://default/dataset/cifar-10-data \ >>> --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \ >>> --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \ >>> --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \ >>> --num_workers 2 \ >>> --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd >>> for worker ..." \ >>> --num_ps 2 \ >>> --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \ >>> >>> *Question 1 : What if I have more than 1 dataset in a separate HDFS >>> paths? Can --input_path take multiple paths in any fashion or is it >>> expected to maintain all the datasets under one path.?* >>> >>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image" >>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker >>> image". >>> >>> *Question 2 : What is the exact expectation here.? In the sense, is >>> there any relation/connection with the Hadoop running outside the docker >>> container.? I guess read HDFS data into the docker container happens during >>> Container localization, but how does output data write back happens to HDFS >>> running outside the docker container.?* >>> >>> Assuming a scenario where Application 1 creates a model and Application >>> 2 performs scoring. Both the applications run in a separate docker >>> containers. I would like the understand how does the data read and write >>> across applications happen in this case. >>> Would be of great help if anyone can be guide me understanding this or >>> direct me to a blog or write up which explains the above. >>> >>> *Thanks and regards* >>> *Vinay Kashyap* >>> >> > > -- > *Thanks and regards* > *Vinay Kashyap* >