[ 
https://issues.apache.org/jira/browse/SUBMARINE-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated SUBMARINE-47:
------------------------------------
    Description: 
In the Submarine design doc 
([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]),
 under section "Job commands" (submarine job run), the document talks about the 
need to implement a YAML parser as an alternative configuration source of 
Submarine.
 This jira provides an implementation of this YAML configuration parser.

*Some details on the implementation:*

*Data model and YAML library:* 
 SnakeYAML was the library of choice, as we already include it in other Hadoop 
maven modules.
 SnakeYAML requires data classes to parse the YAML structure into Java classes, 
this is very similar to parsing JSON data.
 The data classes are under this package: 
org.apache.hadoop.yarn.submarine.client.cli.param.yaml
 The data model is the following: 
 1. On the root level, we have the class YamlConfigFile. This class has fields 
for configs, roles, scheduling, etc.
 2. Class Configs is a class that stores various config values.
 3. We have the class Roles, that stores the role configurations, currently we 
only have Worker and Ps (ParameterServer).
 4. There's a base class for Roles, called Role. For some reasons, SnakeYAML 
did not like this class as an abstract class, but it's not that important to 
have it like that, so I stopped caring.

There are several other data classes, please note that all of these classes are 
trying to "mirror" the example YAML file from the design doc, so you can check 
that for more details.

*Testing:* 
 Added a new class called TestRunJobCliParsingYaml that exclusively verifies 
the correctness of YAML parsing.
 I tried to define the names of the testcases as descriptive and easy to 
understand as possible.
 Covering some edge cases was also a quite important factor, like checking how 
parsing handles non-existing files, empty files, etc.

*ParameterHolder:*

In order to minimize changes in RunJobParameters, I introduced a 
ParameterHolder class that is passed 
RunJobParameters#updateParametersByParsedCommandline instead of the parsed CLI 
object we had before. This way, I could use the ParameterHolder to store the 
parsed CLI values along with parsed YAML values, so it's a single source of 
configuration. This class can act as a "decision point" about future needs on 
value precedences: For example, if we want to have precedence for some configs 
coming from the CLI over the same configs coming from the YAML file (or 
vica-versa), it is easily doable with the current implementation.
 Currently, all values coming from the YAML config are in precedence.

*What CLI options are covered by YAML:*

Every CLI option have it's YAML counterpart, so the coverage is 100%.
 See this method for all the available options: RunJobCli#generateOptions.
 However, the design doc contains 2 fields (called reservation and placement) 
under the 'scheduling' section that are not added to the YAML data classes. The 
reason for this is that I haven't found any implementation for these properties 
with the original CLI parser, either.

*Questions:* 
 1. Do we want to allow mixing of CLI configs and YAML configs? We could either 
fail-fast if a config is defined twice. With my current implementation, the 
values coming from YAML always have precedence over CLI values, so defining a 
config twice is not an error case yet.
 2. What types of roles should we allow? Is there any additional apart from 
Worker and Ps?

 

  was:
In the Submarine design doc 
([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]),
 under section "Job commands" (submarine job run), the document talks about the 
need to implement a YAML parser as an alternative configuration source of 
Submarine.
 This jira provides an implementation of this YAML configuration parser.

*Some details on the implementation:*

*Data model and YAML library:* 
 SnakeYAML was the library of choice, as we already include it in other Hadoop 
maven modules.
 SnakeYAML requires data classes to parse the YAML structure into Java classes, 
this is very similar to parsing JSON data.
 The data classes are under this package: 
org.apache.hadoop.yarn.submarine.client.cli.param.yaml
 The data model is the following: 
 1. On the root level, we have the class YamlConfigFile. This class has fields 
for configs, roles, scheduling, etc.
 2. Class Configs is a class that stores various config values.
 3. We have the class Roles, that stores the role configurations, currently we 
only have Worker and Ps (ParameterServer).
 4. There's a base class for Roles, called Role. For some reasons, SnakeYAML 
did not like this class as an abstract class, but it's not that important to 
have it like that, so I stopped caring.

There are several other data classes, please note that all of these classes are 
trying to "mirror" the example YAML file from the design doc, so you can check 
that for more details.

*Testing:* 
 Added a new class called TestRunJobCliParsingYaml that exclusively verifies 
the correctness of YAML parsing.
 I tried to defined the names of the testcases as descriptive and easy to 
understand as possible.
 Covering some edge cases was also a quite important factor, like checking how 
parsing handles non-existing files, empty files, etc.

*ParameterHolder:*

In order to minimize changes in RunJobParameters, I introduced a 
ParameterHolder class that is passed 
RunJobParameters#updateParametersByParsedCommandline instead of the parsed CLI 
object we had before. This way, I could use the ParameterHolder to store the 
parsed CLI values along with parsed YAML values, so it's a single source of 
configuration. This class can act as a "decision point" about future needs on 
value precedences: For example, if we want to have precedence for some configs 
coming from the CLI over the same configs coming from the YAML file (or 
vica-versa), it is easily doable with the current implementation.
 Currently, all values coming from the YAML config are in precedence.

*What CLI options are covered by YAML:*

Every CLI option have it's YAML counterpart, so the coverage is 100%.
 See this method for all the available options: RunJobCli#generateOptions.
 However, the design doc contains 2 fields (called reservation and placement) 
under the 'scheduling' section that are not added to the YAML data classes. The 
reason for this is that I haven't found any implementation for these properties 
with the original CLI parser, either.

*Questions:* 
 1. Do we want to allow mixing of CLI configs and YAML configs? We could either 
fail-fast if a config is defined twice. With my current implementation, the 
values coming from YAML always have precedence over CLI values, so defining a 
config twice is not an error case yet.
 2. What types of roles should we allow? Is there any additional apart from 
Worker and Ps?

 


> Provide an implementation to parse configuration values from a YAML file
> ------------------------------------------------------------------------
>
>                 Key: SUBMARINE-47
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-47
>             Project: Hadoop Submarine
>          Issue Type: New Feature
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: SUBMARINE-47.001.patch
>
>
> In the Submarine design doc 
> ([https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.klmv5hrbgx9l]),
>  under section "Job commands" (submarine job run), the document talks about 
> the need to implement a YAML parser as an alternative configuration source of 
> Submarine.
>  This jira provides an implementation of this YAML configuration parser.
> *Some details on the implementation:*
> *Data model and YAML library:* 
>  SnakeYAML was the library of choice, as we already include it in other 
> Hadoop maven modules.
>  SnakeYAML requires data classes to parse the YAML structure into Java 
> classes, this is very similar to parsing JSON data.
>  The data classes are under this package: 
> org.apache.hadoop.yarn.submarine.client.cli.param.yaml
>  The data model is the following: 
>  1. On the root level, we have the class YamlConfigFile. This class has 
> fields for configs, roles, scheduling, etc.
>  2. Class Configs is a class that stores various config values.
>  3. We have the class Roles, that stores the role configurations, currently 
> we only have Worker and Ps (ParameterServer).
>  4. There's a base class for Roles, called Role. For some reasons, SnakeYAML 
> did not like this class as an abstract class, but it's not that important to 
> have it like that, so I stopped caring.
> There are several other data classes, please note that all of these classes 
> are trying to "mirror" the example YAML file from the design doc, so you can 
> check that for more details.
> *Testing:* 
>  Added a new class called TestRunJobCliParsingYaml that exclusively verifies 
> the correctness of YAML parsing.
>  I tried to define the names of the testcases as descriptive and easy to 
> understand as possible.
>  Covering some edge cases was also a quite important factor, like checking 
> how parsing handles non-existing files, empty files, etc.
> *ParameterHolder:*
> In order to minimize changes in RunJobParameters, I introduced a 
> ParameterHolder class that is passed 
> RunJobParameters#updateParametersByParsedCommandline instead of the parsed 
> CLI object we had before. This way, I could use the ParameterHolder to store 
> the parsed CLI values along with parsed YAML values, so it's a single source 
> of configuration. This class can act as a "decision point" about future needs 
> on value precedences: For example, if we want to have precedence for some 
> configs coming from the CLI over the same configs coming from the YAML file 
> (or vica-versa), it is easily doable with the current implementation.
>  Currently, all values coming from the YAML config are in precedence.
> *What CLI options are covered by YAML:*
> Every CLI option have it's YAML counterpart, so the coverage is 100%.
>  See this method for all the available options: RunJobCli#generateOptions.
>  However, the design doc contains 2 fields (called reservation and placement) 
> under the 'scheduling' section that are not added to the YAML data classes. 
> The reason for this is that I haven't found any implementation for these 
> properties with the original CLI parser, either.
> *Questions:* 
>  1. Do we want to allow mixing of CLI configs and YAML configs? We could 
> either fail-fast if a config is defined twice. With my current 
> implementation, the values coming from YAML always have precedence over CLI 
> values, so defining a config twice is not an error case yet.
>  2. What types of roles should we allow? Is there any additional apart from 
> Worker and Ps?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to