Hi Vinay,

IIRC, YARN will have the host's Hadoop environments set in container launch
script by default. And in the submarine case, the user's worker command is
used to generate a worker script which is invoked in the container launch
script. If submarine doesn't override the default Hadoop environment
variable, the HDFS read/write in the container might fail due to not found
or incorrect Hadoop location.
So even a Docker image is built with correct Hadoop environment set, it
seems also needs this override to use HDFS library in a container. This
seems caused by YARN's Docker support and the submarine is doing a
workaround here.

The submarine is evolving rapidly, please share your thoughts if it's
uncomfortable
for you.

Thanks,
Zhankun

On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <vinu.k...@gmail.com> wrote:

> Hi Zhankun,
> Thanks for the reply.
>
> Regarding Question 1 : Okay.. I understand, Let me try configuring
> multiple input path place holders and refer the same in the worker launch
> command.
>
> Regarding Question 2 :
> What I did not understand is why YARN has to set anything related to
> Hadoop which runs inside the container. The Hadoop environment and the
> worker code to read the same is completely isolated to the docker
> container. In that case, the worker scripts should know where the
> HADOOP_HOME is inside the container right.? There is another argument
> called *--checkpoint_path* which acts as a path where all the outputs
> (models or datasets) which are resulted as part of the execution of the
> worker code inside the docker container. Hence, *--input_path* acts as
> entry point which will be localized and *--checkpoint_path *acts as exit
> point, where both these paths are hdfs paths which runs outside the docker
> container. So why YARN should know the hadoop configuration which is inside
> the container.?
>
> Thanks and regards
> Vinay Kashyap
>
> On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <tangzhan...@gmail.com>
> wrote:
>
>> Hi Vinay,
>>
>> For question one, IIRC, we cannot set multiple "*--input_path" *flag at
>> present. The "--input_path" is designed originally as a placeholder to
>> store a path and then the path is used to replace "%input_path%" in worker
>> command like "python worker.sh -input %input_path% ..".
>> So from this perspective, you can directly append the other input paths
>> to your worker command in your own way.
>>
>> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
>> default. So submarine provides the environment variable to be set in the
>> worker's launch script if the worker wants to access HDFS.
>> And there's no data plane relation between outside Hadoop and the
>> container except YARN will localize resources for the container.
>>
>> Hope this can answer your questions.
>>
>> Best Regards,
>> Zhankun
>>
>> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vinu.k...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
>>> TensorFlow jobs in a docker container.
>>> I would like to understand few details regarding Read/Write HDFS data
>>> during/after application launch/execution. Have highlighted the questions
>>> line.
>>>
>>> When launching the application which reads input from HDFS, we configure
>>> *--input_path* to a hdfs path, as mentioned in the standard example.
>>>
>>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>>>  --name tf-job-001 --docker_image <your docker image> \
>>>  --input_path hdfs://default/dataset/cifar-10-data \
>>>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>>>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>>>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>>>  --num_workers 2 \
>>>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd
>>> for worker ..." \
>>>  --num_ps 2 \
>>>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>>>
>>> *Question 1 : What if I have more than 1 dataset in a separate HDFS
>>> paths? Can --input_path take multiple paths in any fashion or is it
>>> expected to maintain all the datasets under one path.?*
>>>
>>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
>>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
>>> image".
>>>
>>> *Question 2 : What is the exact expectation here.? In the sense, is
>>> there any relation/connection with the Hadoop running outside the docker
>>> container.? I guess read HDFS data into the docker container happens during
>>> Container localization, but how does output data write back happens to HDFS
>>> running outside the docker container.?*
>>>
>>> Assuming a scenario where Application 1 creates a model and Application
>>> 2 performs scoring. Both the applications run in a separate docker
>>> containers. I would like the understand how does the data read and write
>>> across applications happen in this case.
>>> Would be of great help if anyone can be guide me understanding this or
>>> direct me to a blog or write up which explains the above.
>>>
>>> *Thanks and regards*
>>> *Vinay Kashyap*
>>>
>>
>
> --
> *Thanks and regards*
> *Vinay Kashyap*
>

Reply via email to