execution

zhankun tang Mon, 25 Feb 2019 02:51:49 -0800

Hi Vinay,

IIRC, YARN will have the host's Hadoop environments set in container launch
script by default. And in the submarine case, the user's worker command is
used to generate a worker script which is invoked in the container launch
script. If submarine doesn't override the default Hadoop environment
variable, the HDFS read/write in the container might fail due to not found
or incorrect Hadoop location.
So even a Docker image is built with correct Hadoop environment set, it
seems also needs this override to use HDFS library in a container. This
seems caused by YARN's Docker support and the submarine is doing a
workaround here.


The submarine is evolving rapidly, please share your thoughts if it's
uncomfortable
for you.

Thanks,
Zhankun

On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <vinu.k...@gmail.com> wrote:

> Hi Zhankun,
> Thanks for the reply.
>
> Regarding Question 1 : Okay.. I understand, Let me try configuring
> multiple input path place holders and refer the same in the worker launch
> command.
>
> Regarding Question 2 :
> What I did not understand is why YARN has to set anything related to
> Hadoop which runs inside the container. The Hadoop environment and the
> worker code to read the same is completely isolated to the docker
> container. In that case, the worker scripts should know where the
> HADOOP_HOME is inside the container right.? There is another argument
> called *--checkpoint_path* which acts as a path where all the outputs
> (models or datasets) which are resulted as part of the execution of the
> worker code inside the docker container. Hence, *--input_path* acts as
> entry point which will be localized and *--checkpoint_path *acts as exit
> point, where both these paths are hdfs paths which runs outside the docker
> container. So why YARN should know the hadoop configuration which is inside
> the container.?
>
> Thanks and regards
> Vinay Kashyap
>
> On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <tangzhan...@gmail.com>
> wrote:
>
>> Hi Vinay,
>>
>> For question one, IIRC, we cannot set multiple "*--input_path" *flag at
>> present. The "--input_path" is designed originally as a placeholder to
>> store a path and then the path is used to replace "%input_path%" in worker
>> command like "python worker.sh -input %input_path% ..".
>> So from this perspective, you can directly append the other input paths
>> to your worker command in your own way.
>>
>> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
>> default. So submarine provides the environment variable to be set in the
>> worker's launch script if the worker wants to access HDFS.
>> And there's no data plane relation between outside Hadoop and the
>> container except YARN will localize resources for the container.
>>
>> Hope this can answer your questions.
>>
>> Best Regards,
>> Zhankun
>>
>> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vinu.k...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
>>> TensorFlow jobs in a docker container.
>>> I would like to understand few details regarding Read/Write HDFS data
>>> during/after application launch/execution. Have highlighted the questions
>>> line.
>>>
>>> When launching the application which reads input from HDFS, we configure
>>> *--input_path* to a hdfs path, as mentioned in the standard example.
>>>
>>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>>>  --name tf-job-001 --docker_image <your docker image> \
>>>  --input_path hdfs://default/dataset/cifar-10-data \
>>>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>>>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>>>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>>>  --num_workers 2 \
>>>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd
>>> for worker ..." \
>>>  --num_ps 2 \
>>>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>>>
>>> *Question 1 : What if I have more than 1 dataset in a separate HDFS
>>> paths? Can --input_path take multiple paths in any fashion or is it
>>> expected to maintain all the datasets under one path.?*
>>>
>>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
>>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
>>> image".
>>>
>>> *Question 2 : What is the exact expectation here.? In the sense, is
>>> there any relation/connection with the Hadoop running outside the docker
>>> container.? I guess read HDFS data into the docker container happens during
>>> Container localization, but how does output data write back happens to HDFS
>>> running outside the docker container.?*
>>>
>>> Assuming a scenario where Application 1 creates a model and Application
>>> 2 performs scoring. Both the applications run in a separate docker
>>> containers. I would like the understand how does the data read and write
>>> across applications happen in this case.
>>> Would be of great help if anyone can be guide me understanding this or
>>> direct me to a blog or write up which explains the above.
>>>
>>> *Thanks and regards*
>>> *Vinay Kashyap*
>>>
>>
>
> --
> *Thanks and regards*
> *Vinay Kashyap*
>

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Reply via email to