[jira] [Comment Edited] (YARN-10422) Create the script responsible for collecting the bundle data

Szilard Nemeth (Jira) Tue, 20 Oct 2020 01:12:00 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213170#comment-17213170
 ]


Szilard Nemeth edited comment on YARN-10422 at 10/20/20, 8:10 AM:
------------------------------------------------------------------

Hi [~bteke] ,

Thanks for working on this.
 I think in general, this looks good.
 I know this is PoC, but anyway I'm leaving my comments:

1. Where do you want to indicate the expected python version (interpreter)? Do 
you have some plans to add a document + adding this requirement to the python 
file (diagnostics_collector.py) as well?
 To me it doesn't really matter which Python version we are targeting, but we 
need to keep in mind there are huge differences between Python 2 vs. 3

2. I find it more cleaner to define a main method (you can call whatever you 
want but I think main is just fine) and call it with:
{code:java}
if __name__ == '__main__':
    main()
{code}
I usually place this at the bottom of the file, see this as an example:
 
[https://github.com/szilard-nemeth/google-chrome-toolkit/blob/ccbef22e8438420e299fa81c7d0627a191f1eb18/googlechrometoolkit/main.py#L367]
 I don't know how conventional is this, but I find it cleaner than just adding 
the code to the "global context", like you did starting from line 219.

3. In the help text of --command, you have: "Initiate the diagnostic 
information collecton", there's a typo in word 'collection'.

4. You can use shorthands for expressions like:
{code:java}
if RM_ADDRESS is None:
{code}
just by using
{code:java}
if not RM_ADDRESS:
{code}
5. Can you print what command will be executed in a generic way?
{code:java}
func = ISSUE_MAP[args.command]
# Print the command here
print(func())
{code}
6. Nit: ISSUE_MAP can be moved to the top of the file.

7. Can you add documentation to method 'application_failed'? I mean you have 
some list of the functionality, but it would be better to have complete 
sentences / explanation of what the function is doing.

8. Nit: The following code can be shortened as well:
{code:java}
if args.arguments is None or len(args.arguments) is 0:
{code}
to:
{code:java}
if not args.arguments or len(args.arguments) is 0:
{code}
or simply to:
{code:java}
if not args.arguments:
{code}
I think Python gives Falsy value for both cases: When the object is None or 
when an iterable object is "empty".
 You may try this out in a Python console first :)

9. It would be useful if create_output_dir would log something when it creates 
the directory. As you are not using logger just simple print statements, it's 
your call if you want to do this or not.

10. run_command would be more readable to me if the command string would be 
passed with a single argument and inside the method you can split it by spaces 
easily and pass it to subprocess.Popen. I mean, you can hide the implementation 
details of subprocess.Popen this way.

11. Nit: Can you extract the constant part of the JHS / RM URLs?
 I mean these are repeated:
 - /ws/v1/history/mapreduce/jobs/
 - /ws/v1/cluster/apps/

12. Generic question: What if the urllib2.urlopen calls fail, i.e. the servers 
are not available, any network error happened, etc? 
 I think this should be handled in one place with proper exit code / logging.

13. The following line can cause issues if the config does not contain the NM 
http address:
{code:java}
nm_address = job_attempts.find(".//nodeHttpAddress").text
{code}
Similarly for line:
{code:java}
nm_address = app_info.find("amHostHttpAddress").text
{code}
I know this is highly unlikely, but you need to handle this case somehow. I 
think the call to .text would raise a
{code:java}
AttributeError: 'NoneType' object has no attribute 'text'
{code}
or something similar.

14. You are using 'node_log' as the filename of the RM / NM logs. The call to
{code:java}
write_output(os.path.join(output_path, "node_log"), "nodemanager_log",
                     get_node_logs(nm_address, NM_LOG_REGEX))
{code}
will simply overwrite the previous file. This happens for both "job" and "app" 
cases.

15. You may add some error handling for this line:
{code:java}
output_path = create_output_dir(os.path.join(TEMP_DIR, 
"node_failure_{}".format(node_id.split(":")[0])))
{code}
For example, if the node id is not in the expected (colon-separated) format, 
you should indicate this with an error.


was (Author: snemeth):
Hi @bteke,

Thanks for working on this.
I think in general, this looks good.
I know this is PoC, but anyway I'm leaving my comments:


1. Where do you want to indicate the expected python version (interpreter)? Do 
you have some plans to add a document + adding this requirement to the python 
file (diagnostics_collector.py) as well?
To me it doesn't really matter which Python version we are targeting, but we 
need to keep in mind there are huge differences between Python 2 vs. 3

2. I find it more cleaner to define a main method (you can call whatever you 
want but I think main is just fine) and call it with: 
{code}
if __name__ == '__main__':
    main()
{code}
I usually place this at the bottom of the file, see this as an example:
https://github.com/szilard-nemeth/google-chrome-toolkit/blob/ccbef22e8438420e299fa81c7d0627a191f1eb18/googlechrometoolkit/main.py#L367
I don't know how conventional is this, but I find it cleaner than just adding 
the code to the "global context", like you did starting from line 219.

3. In the help text of --command, you have: "Initiate the diagnostic 
information collecton", there's a typo in word 'collection'.

4. You can use shorthands for expressions like: 
{code}
if RM_ADDRESS is None:
{code}

just by using 
{code}
if not RM_ADDRESS:
{code}

5. Can you print what command will be executed in a generic way?

{code}
func = ISSUE_MAP[args.command]
# Print the command here
print(func())
{code}

6. Nit: ISSUE_MAP can be moved to the top of the file.

7. Can you add documentation to method 'application_failed'? I mean you have 
some list of the functionality, but it would be better to have complete 
sentences / explanation of what the function is doing.

8. Nit: The following code can be shortened as well: 
{code}
if args.arguments is None or len(args.arguments) is 0:
{code}

to: 
{code}
if not args.arguments or len(args.arguments) is 0:
{code}

or simply to:
{code}
if not args.arguments:
{code}

I think Python gives Falsy value for both cases: When the object is None or 
when an iterable object is "empty".
You may try this out in a Python console first :)

9. It would be useful if create_output_dir would log something when it creates 
the directory. As you are not using logger just simple print statements, it's 
your call if you want to do this or not.

10. run_command would be more readable to me if the command string would be 
passed with a single argument and inside the method you can split it by spaces 
easily and pass it to subprocess.Popen. I mean, you can hide the implementation 
details of subprocess.Popen this way.

11. Nit: Can you extract the constant part of the JHS / RM URLs?
I mean these are repeated: 
- /ws/v1/history/mapreduce/jobs/
- /ws/v1/cluster/apps/

12. Generic question: What if the urllib2.urlopen calls fail, i.e. the servers 
are not available, any network error happened, etc? 
I think this should be handled in one place with proper exit code / logging.

13. The following line can cause issues if the config does not contain the NM 
http address: 
{code}
nm_address = job_attempts.find(".//nodeHttpAddress").text
{code}

Similarly for line: 
{code}
nm_address = app_info.find("amHostHttpAddress").text
{code}

I know this is highly unlikely, but you need to handle this case somehow. I 
think the call to .text would raise a 
{code}
AttributeError: 'NoneType' object has no attribute 'text'
{code}
or something similar.

14. You are using 'node_log' as the filename of the RM / NM logs. The call to 
{code}
write_output(os.path.join(output_path, "node_log"), "nodemanager_log",
                     get_node_logs(nm_address, NM_LOG_REGEX))
{code}
will simply overwrite the previous file. This happens for both "job" and "app" 
cases.

15. You may add some error handling for this line: 
{code}
output_path = create_output_dir(os.path.join(TEMP_DIR, 
"node_failure_{}".format(node_id.split(":")[0])))
{code}

For example, if the node id is not in the expected (colon-separated) format, 
you should indicate this with an error.

> Create the script responsible for collecting the bundle data
> ------------------------------------------------------------
>
>                 Key: YARN-10422
>                 URL: https://issues.apache.org/jira/browse/YARN-10422
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Benjamin Teke
>            Assignee: Benjamin Teke
>            Priority: Major
>         Attachments: YARN-10422.POC.001.patch, YARN-10422.POC.002.patch, 
> YARN-10422.POC.003.patch
>
>
> The script should provide the list of diagnostic use-cases described in 
> YARN-10421. If a request comes in to the YarnDiagnosticCollector servlet, the 
> script will be invoked. It collects all the information required for that 
> diagnostic category and saves it into a configurable directory as a 
> compressed tar file. 
> An example of how the script could look like:
> {code:java}
> if [$1 = "listcommonissues"]    
>   echo "1, Application Failed"    
>   echo "2, Application Hanging"    
>   echo "3, Scheduler Related Issue"    
>   echo "4, RM failure to start"    
>   echo "5, NM failure to start"
> elif [$1 = "collect"]
>   if [$2 == 1]       
>     appId = $3      
>     mkdir /tmp/$appId
>     yarn logs -applicationId $appId > /tmp/$appId/joblogs   
>     curl <JHS>/{appId}/conf > /tmp/$appId/conf      
>     curl <RM>/logs | grep container > /tmp/$appId/rmlogs      
>     curl <NM>/logs | grep container > /tmp/$appId/nmlogs      
>     outputpath = /tmp/$appId    
>   elif      ...    
>   elif      ...    
> fi   tar and compress outputpath.{code}
>  
> During class load YarnDiagnosticsCollector reads the list of common issues 
> from the script and keeps it in memory. On every startup of YARN UI2 
> diagnostics page, it fetches the list from the servlet and displays them. The 
> servlet should handle the script changes, so if a new diagnostic case is 
> added,  a YARN UI2 reload should show it. This way the users can easily plug 
> new categories without any UI2 or Servlet code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10422) Create the script responsible for collecting the bundle data

Reply via email to