[ 
https://issues.apache.org/jira/browse/SUBMARINE-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated SUBMARINE-47:
------------------------------------
    Attachment: SUBMARINE-47.004.patch

> Provide an implementation to parse configuration values from a YAML file
> ------------------------------------------------------------------------
>
>                 Key: SUBMARINE-47
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-47
>             Project: Hadoop Submarine
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: SUBMARINE-47.001.patch, SUBMARINE-47.002.patch, 
> SUBMARINE-47.002.patch, SUBMARINE-47.003.patch, SUBMARINE-47.003.patch, 
> SUBMARINE-47.004.patch, valid-config.yaml
>
>
> In the Submarine design doc 
> ([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]),
>  under section "Job commands" (submarine job run), the document talks about 
> the need to implement a YAML parser as an alternative configuration source of 
> Submarine.
>  This jira provides an implementation of this YAML configuration parser.
> *Some details on the implementation:*
> *Data model and YAML library:* 
>  SnakeYAML was the library of choice, as we already include it in other 
> Hadoop maven modules.
>  SnakeYAML requires data classes to parse the YAML structure into Java 
> classes, this is very similar to parsing JSON data.
>  The data classes are under this package: 
> org.apache.hadoop.yarn.submarine.client.cli.param.yaml
>  The data model is the following: 
>  1. On the root level, we have the class YamlConfigFile. This class has 
> fields for configs, roles, scheduling, etc.
>  2. Class Configs is a class that stores various config values.
>  3. We have the class Roles, that stores the role configurations, currently 
> we only have Worker and Ps (ParameterServer).
>  4. There's a base class for Roles, called Role. For some reasons, SnakeYAML 
> did not like this class as an abstract class, but it's not that important to 
> have it like that, so I stopped caring.
> There are several other data classes, please note that all of these classes 
> are trying to "mirror" the example YAML file from the design doc, so you can 
> check that for more details.
> *Testing:* 
>  Added a new class called TestRunJobCliParsingYaml that exclusively verifies 
> the correctness of YAML parsing.
>  I tried to define the names of the testcases as descriptive and easy to 
> understand as possible.
>  Covering some edge cases was also a quite important factor, like checking 
> how parsing handles non-existing files, empty files, etc.
> *ParameterHolder:*
> In order to minimize changes in RunJobParameters, I introduced a 
> ParameterHolder class that is passed 
> RunJobParameters#updateParametersByParsedCommandline instead of the parsed 
> CLI object we had before. This way, I could use the ParameterHolder to store 
> the parsed CLI values along with parsed YAML values, so it's a single source 
> of configuration. This class can act as a "decision point" about future needs 
> on value precedences: For example, if we want to have precedence for some 
> configs coming from the CLI over the same configs coming from the YAML file 
> (or vica-versa), it is easily doable with the current implementation.
>  Currently, all values coming from the YAML config are in precedence.
> *What CLI options are covered by YAML:*
> Every CLI option have it's YAML counterpart, so the coverage is 100%.
>  See this method for all the available options: RunJobCli#generateOptions.
>  However, the design doc contains 2 fields (called reservation and placement) 
> under the 'scheduling' section that are not added to the YAML data classes. 
> The reason for this is that I haven't found any implementation for these 
> properties with the original CLI parser, either.
> *Questions:* 
>  1. Do we want to allow mixing of CLI configs and YAML configs? We could 
> either fail-fast if a config is defined twice. With my current 
> implementation, the values coming from YAML always have precedence over CLI 
> values, so defining a config twice is not an error case yet.
>  2. What types of roles should we allow? Is there any additional apart from 
> Worker and Ps?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to