[ https://issues.apache.org/jira/browse/SUBMARINE-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794139#comment-16794139 ]
Xun Liu commented on SUBMARINE-47: ---------------------------------- Thank you very much for your contribution. :) > Provide an implementation to parse configuration values from a YAML file > ------------------------------------------------------------------------ > > Key: SUBMARINE-47 > URL: https://issues.apache.org/jira/browse/SUBMARINE-47 > Project: Hadoop Submarine > Issue Type: New Feature > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Attachments: SUBMARINE-47.001.patch, SUBMARINE-47.002.patch, > valid-config.yaml > > > In the Submarine design doc > ([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]), > under section "Job commands" (submarine job run), the document talks about > the need to implement a YAML parser as an alternative configuration source of > Submarine. > This jira provides an implementation of this YAML configuration parser. > *Some details on the implementation:* > *Data model and YAML library:* > SnakeYAML was the library of choice, as we already include it in other > Hadoop maven modules. > SnakeYAML requires data classes to parse the YAML structure into Java > classes, this is very similar to parsing JSON data. > The data classes are under this package: > org.apache.hadoop.yarn.submarine.client.cli.param.yaml > The data model is the following: > 1. On the root level, we have the class YamlConfigFile. This class has > fields for configs, roles, scheduling, etc. > 2. Class Configs is a class that stores various config values. > 3. We have the class Roles, that stores the role configurations, currently > we only have Worker and Ps (ParameterServer). > 4. There's a base class for Roles, called Role. For some reasons, SnakeYAML > did not like this class as an abstract class, but it's not that important to > have it like that, so I stopped caring. > There are several other data classes, please note that all of these classes > are trying to "mirror" the example YAML file from the design doc, so you can > check that for more details. > *Testing:* > Added a new class called TestRunJobCliParsingYaml that exclusively verifies > the correctness of YAML parsing. > I tried to define the names of the testcases as descriptive and easy to > understand as possible. > Covering some edge cases was also a quite important factor, like checking > how parsing handles non-existing files, empty files, etc. > *ParameterHolder:* > In order to minimize changes in RunJobParameters, I introduced a > ParameterHolder class that is passed > RunJobParameters#updateParametersByParsedCommandline instead of the parsed > CLI object we had before. This way, I could use the ParameterHolder to store > the parsed CLI values along with parsed YAML values, so it's a single source > of configuration. This class can act as a "decision point" about future needs > on value precedences: For example, if we want to have precedence for some > configs coming from the CLI over the same configs coming from the YAML file > (or vica-versa), it is easily doable with the current implementation. > Currently, all values coming from the YAML config are in precedence. > *What CLI options are covered by YAML:* > Every CLI option have it's YAML counterpart, so the coverage is 100%. > See this method for all the available options: RunJobCli#generateOptions. > However, the design doc contains 2 fields (called reservation and placement) > under the 'scheduling' section that are not added to the YAML data classes. > The reason for this is that I haven't found any implementation for these > properties with the original CLI parser, either. > *Questions:* > 1. Do we want to allow mixing of CLI configs and YAML configs? We could > either fail-fast if a config is defined twice. With my current > implementation, the values coming from YAML always have precedence over CLI > values, so defining a config twice is not an error case yet. > 2. What types of roles should we allow? Is there any additional apart from > Worker and Ps? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)