[jira] [Commented] (YARN-9398) Javadoc error on FPGA related java files

2019-03-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797267#comment-16797267
 ] 

Peter Bacsko commented on YARN-9398:


I'll try to fix this quickly including the warnings.

> Javadoc error on FPGA related java files
> 
>
> Key: YARN-9398
> URL: https://issues.apache.org/jira/browse/YARN-9398
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: YARN-9398.001.patch
>
>
> {code}
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @param for conf
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @return
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @param for timeout
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @return
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:64:
>  warning: no @return
> [ERROR]   String getFpgaType();
> [ERROR]  ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @return
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @throws for 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/IntelFpgaOpenclPlugin.java:156:
>  error: bad HTML entity
> [ERROR]*  Helper class to run aocl diagnose & determine major/minor 
> numbers.
> {code}
> YARN-9266 introduced some javadoc compilation errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9356) Add more tests to ratio method in TestResourceCalculator

2019-03-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797175#comment-16797175
 ] 

Peter Bacsko commented on YARN-9356:


+1 (non-binding)

> Add more tests to ratio method in TestResourceCalculator 
> -
>
> Key: YARN-9356
> URL: https://issues.apache.org/jira/browse/YARN-9356
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9356.001.patch
>
>
> TestResourceCalculator has some edge-case testcases to verify how division by 
> zero is handled with ResourceCalculator.
> We need other basic tests like we have for other ResourceCalculator methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9398) Javadoc error on FPGA related java files

2019-03-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797190#comment-16797190
 ] 

Peter Bacsko commented on YARN-9398:


Shall we fix warnings in a separate JIRA?

> Javadoc error on FPGA related java files
> 
>
> Key: YARN-9398
> URL: https://issues.apache.org/jira/browse/YARN-9398
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: YARN-9398.001.patch
>
>
> {code}
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @param for conf
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @return
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @param for timeout
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @return
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:64:
>  warning: no @return
> [ERROR]   String getFpgaType();
> [ERROR]  ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @return
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @throws for 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/IntelFpgaOpenclPlugin.java:156:
>  error: bad HTML entity
> [ERROR]*  Helper class to run aocl diagnose & determine major/minor 
> numbers.
> {code}
> YARN-9266 introduced some javadoc compilation errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797023#comment-16797023
 ] 

Peter Bacsko edited comment on YARN-9267 at 3/20/19 10:32 AM:
--

[~devaraj.k] the reason I introduced the method reference is because I wanted 
to avoid unnecessary file creation/deletion in the unit tests, so I can replace 
it with a simple piece of code which returns a string. It's usually a good 
practice to do in unit tests, but I'm not a fundamentalist so I can go a 
file-creation way if you think it's better. Or we keep this solution and add a 
comment to make it clear.

Regarding the loop: that's a valid comment.


was (Author: pbacsko):
[~devaraj.k] the reason I introduced the method reference is because I wanted 
to avoid unnecessary file creation/deletion in the unit tests, so I can replace 
it with a simple code which returns a string. It's usually a good practice to 
do in unit tests, but I'm not a fundamentalist so I can go a file-creation way 
if you think it's better. Or we keep this solution and add a comment to make it 
clear.

Regarding the loop: that's a valid comment.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch, YARN-9267-007.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797023#comment-16797023
 ] 

Peter Bacsko commented on YARN-9267:


[~devaraj.k] the reason I introduced the method reference is because I wanted 
to avoid unnecessary file creation/deletion in the unit tests, so I can replace 
it with a simple code which returns a string. It's usually a good practice to 
do in unit tests, but I'm not a fundamentalist so I can go a file-creation way 
if you think it's better. Or we keep this solution and add a comment to make it 
clear.

Regarding the loop: that's a valid comment.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch, YARN-9267-007.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9358) Add javadoc to new methods introduced in FSQueueMetrics with YARN-9322

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796143#comment-16796143
 ] 

Peter Bacsko commented on YARN-9358:


Just minor things:
* "fair share for queue" --> I think "fair share of a queue" sounds better
* "The {@link Resource} object given" --> "The given resource object" sounds 
better

> Add javadoc to new methods introduced in FSQueueMetrics with YARN-9322
> --
>
> Key: YARN-9358
> URL: https://issues.apache.org/jira/browse/YARN-9358
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9358.001.patch, YARN-9358.002.patch, 
> YARN-9358.003.patch
>
>
> This is a follow-up for YARN-9322, covering javadoc changes as discussed with 
> [~templedf] earlier.
> As discussed with Daniel, we need to add javadoc for the new methods 
> introduced with YARN-9322 and also for the modified methods. 
> The javadoc should refer to the fact that Resource Types are also included in 
> the Resource object in case of get/set as well.
> The methods are: 
> 1. getFairShare / setFairShare
> 2. getSteadyFairShare / setSteadyFairShare
> 3. getMinShare / setMinShare
> 4. getMaxShare / setMaxShare
> 5. getMaxAMShare / setMaxAMShare
> 6. getAMResourceUsage / setAMResourceUsage
> Moreover, a javadoc could be added to the constructor of FSQueueMetrics as 
> well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796011#comment-16796011
 ] 

Peter Bacsko commented on YARN-9267:


[~devaraj.k] I think now you can submit the patch.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch, YARN-9267-007.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9268) General improvements in FpgaDevice

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795944#comment-16795944
 ] 

Peter Bacsko commented on YARN-9268:


Just as I thought, it fails to apply. It should be good after YARN-9267.

> General improvements in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch, YARN-9268-004.patch
>
>
> Need to fix the following in the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) General improvements in FpgaDevice

2019-03-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Attachment: YARN-9268-004.patch

> General improvements in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch, YARN-9268-004.patch
>
>
> Need to fix the following in the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-007.patch

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch, YARN-9267-007.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9398) Javadoc error on FPGA related java files

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795928#comment-16795928
 ] 

Peter Bacsko commented on YARN-9398:


Thanks [~eyang] for reporting. Will fix these soon.

> Javadoc error on FPGA related java files
> 
>
> Key: YARN-9398
> URL: https://issues.apache.org/jira/browse/YARN-9398
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Peter Bacsko
>Priority: Major
>
> {code}
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @param for conf
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @return
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @param for timeout
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @return
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:64:
>  warning: no @return
> [ERROR]   String getFpgaType();
> [ERROR]  ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @return
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @throws for 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/IntelFpgaOpenclPlugin.java:156:
>  error: bad HTML entity
> [ERROR]*  Helper class to run aocl diagnose & determine major/minor 
> numbers.
> {code}
> YARN-9266 introduced some javadoc compilation errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9398) Javadoc error on FPGA related java files

2019-03-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-9398:
--

Assignee: Peter Bacsko

> Javadoc error on FPGA related java files
> 
>
> Key: YARN-9398
> URL: https://issues.apache.org/jira/browse/YARN-9398
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Peter Bacsko
>Priority: Major
>
> {code}
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @param for conf
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:46:
>  warning: no @return
> [ERROR]   boolean initPlugin(Configuration conf);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @param for timeout
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:51:
>  warning: no @return
> [ERROR]   boolean diagnose(int timeout);
> [ERROR]   ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/AbstractFpgaVendorPlugin.java:64:
>  warning: no @return
> [ERROR]   String getFpgaType();
> [ERROR]  ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @return
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/FpgaDiscoverer.java:119:
>  warning: no @throws for 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException
> [ERROR]   public List discover()
> [ERROR] ^
> [ERROR] 
> /home/eyang/test/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/fpga/IntelFpgaOpenclPlugin.java:156:
>  error: bad HTML entity
> [ERROR]*  Helper class to run aocl diagnose & determine major/minor 
> numbers.
> {code}
> YARN-9266 introduced some javadoc compilation errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9268) General improvements in FpgaDevice

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795918#comment-16795918
 ] 

Peter Bacsko commented on YARN-9268:


[~devaraj.k] this patch depends on the previous ones (the same applies to the 
remaining ones). I'll upload a new one which can be tried after YARN-9267 is on 
trunk, but the Jenkins build will most likely fail and need a re-trigger.

> General improvements in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch
>
>
> Need to fix the following in the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795916#comment-16795916
 ] 

Peter Bacsko commented on YARN-9267:


[~devaraj.k] yes there was a commit in the meantime which caused conflicts. 
Please try patch v7.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch, YARN-9267-007.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-006.patch

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795068#comment-16795068
 ] 

Peter Bacsko edited comment on YARN-9267 at 3/18/19 2:39 PM:
-

[~devaraj.k] please check patch v6 which should be free of checkstyle issues.

Just to be safe, I added two SHA-256 related tests to 
{{TestFpgaResourceHandler}}.


was (Author: pbacsko):
[~devaraj.k] please check patch v6 which should be free of checkstyle issues.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795068#comment-16795068
 ] 

Peter Bacsko commented on YARN-9267:


[~devaraj.k] please check patch v6 which should be free of checkstyle issues.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch, 
> YARN-9267-006.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-005.patch

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch, YARN-9267-005.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794947#comment-16794947
 ] 

Peter Bacsko commented on YARN-9267:


[~devaraj.k] thanks for the tip, will check this out.

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-14 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-004.patch

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch, YARN-9267-004.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) General improvements in FpgaResourceHandlerImpl

2019-03-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Summary: General improvements in FpgaResourceHandlerImpl  (was: Various 
fixes are needed in FpgaResourceHandlerImpl)

> General improvements in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) General improvements in FpgaDevice

2019-03-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Summary: General improvements in FpgaDevice  (was: Various fixes are needed 
in FpgaDevice)

> General improvements in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch
>
>
> Need to fix the following in the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-03-11 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789441#comment-16789441
 ] 

Peter Bacsko commented on YARN-9267:


Thanks [~tangzhankun] - this patch needs a rebase, so don't commit it just yet, 
I'll upload a new version after YARN-9266 is committed.

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-03-08 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-008.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch, YARN-9266-005.patch, 
> YARN-9266-006.patch, YARN-9266-007.patch, YARN-9266-008.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-03-07 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-009.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch, YARN-9265-008.patch, 
> YARN-9265-009.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-03-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787300#comment-16787300
 ] 

Peter Bacsko commented on YARN-9265:


Patch v9: addressed failing unit test, ASF license problem and checkstyle stuff.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch, YARN-9265-008.patch, 
> YARN-9265-009.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-03-07 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-008.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch, YARN-9265-008.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9322) Store metrics for custom resource types into FSQueueMetrics and query them in FairSchedulerQueueInfo

2019-02-21 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774338#comment-16774338
 ] 

Peter Bacsko commented on YARN-9322:


+1 lgtm (non-binding)

> Store metrics for custom resource types into FSQueueMetrics and query them in 
> FairSchedulerQueueInfo
> 
>
> Key: YARN-9322
> URL: https://issues.apache.org/jira/browse/YARN-9322
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screen Shot 2019-02-21 at 12.06.46.png, 
> YARN-9322.001.patch
>
>
> YARN-8842 implemented storing and exposing of metrics of custom resources.
> FSQueueMetrics should have a similar implementation.
> All metrics stored in this class should have their custom resource 
> counterpart.
> In a consequence of metrics were not stored for custom resource type, 
> FairSchedulerQueueInfo haven't contained those values therefore the UI v1 
> could not show them, obviously. 
> See that gpu is missing from the value of  "AM Max Resources" on the attached 
> screenshot.
> Additionally, the callees of the following methods (in class 
> FairSchedulerQueueInfo) should consider to query values for custom resource 
> types too: 
> getMaxAMShareMB
> getMaxAMShareVCores
> getAMResourceUsageMB
> getAMResourceUsageVCores



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773193#comment-16773193
 ] 

Peter Bacsko commented on YARN-9265:


I made a slight modification in {{FpgaDiscoverer.discover()}}, replaced the 
existing iterator-based logic with some nice streams/lambda logic.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-20 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773193#comment-16773193
 ] 

Peter Bacsko edited comment on YARN-9265 at 2/20/19 4:56 PM:
-

I made a slight modification in {{FpgaDiscoverer.discover()}}, replaced the 
existing iterator-based logic with some nice streams/lambda.


was (Author: pbacsko):
I made a slight modification in {{FpgaDiscoverer.discover()}}, replaced the 
existing iterator-based logic with some nice streams/lambda logic.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-20 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-007.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch, YARN-9265-007.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-02-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771997#comment-16771997
 ] 

Peter Bacsko commented on YARN-9264:


[~sunilg] [~tangzhankun] please review the first three patch: YARN-9265, 
YARN-9266 and YARN-9267. 

After committing YARN-9265, I'll perform a rebase if necessary.

> [Umbrella] Follow-up on IntelOpenCL FPGA plugin
> ---
>
> Key: YARN-9264
> URL: https://issues.apache.org/jira/browse/YARN-9264
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The Intel FPGA resource type support was released in Hadoop 3.1.0.
> Right now the plugin implementation has some deficiencies that need to be 
> fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771960#comment-16771960
 ] 

Peter Bacsko commented on YARN-9267:


[~snemeth] you can check it again.

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-006.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-003.patch

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771240#comment-16771240
 ] 

Peter Bacsko commented on YARN-9267:


Thanks for the comment [~snemeth].

"FpgaDevice.hash should include in its name somehow that this is the hash of 
the aocx file."
OK.

" I would move the method FpgaResourceHandlerImpl#getSha256 to some more common 
place, this seems to be a helper function that could be used elsewhere."
I've been thinking about this. I can move it, but mainly for testability. 

" I would also modify the error message"
The exception with stack trace is logged, so I think we're fine.

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-005.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771097#comment-16771097
 ] 

Peter Bacsko commented on YARN-9265:


Fixed chekstyle + findbugs issues.

Regarding Findbugs, I removed the {{synchronized}} modifier from all methods. 
They're called either from {{ResourcePluginManager.initialize()}} or 
{{ResourceHandlerChain.bootstrap()}}, both occur during initialization on a 
single thread.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-004.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-003.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-18 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771005#comment-16771005
 ] 

Peter Bacsko commented on YARN-9265:


Thanks [~zsiegl] it's a reasonable improvement. Will update the patch with this.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-18 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-007.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch, YARN-9266-005.patch, 
> YARN-9266-006.patch, YARN-9266-007.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-15 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-006.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch, YARN-9266-005.patch, 
> YARN-9266-006.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-15 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769392#comment-16769392
 ] 

Peter Bacsko commented on YARN-9266:


Test failure is unrelated, see https://issues.apache.org/jira/browse/YARN-7145

Will address the remaining checkstyle problem.

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch, YARN-9266-005.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-15 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-005.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch, YARN-9266-005.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768648#comment-16768648
 ] 

Peter Bacsko commented on YARN-9266:


"Some of the refactors that you performed are not included in the description 
of the jira. Could you update it? I'm thinking of for example moving the parser 
to separate file and fixing checkstyles."
I added checkstyle. Moving the parser is mentioned, albeit not explicitly 
("parseDiagnoseInfo() is too heavyweight – it should be in its own class for 
better testability")

"Actually fixing checkstyles in general should be avoided as it's making the 
backports harder and making the git history dirtier. If you could static import 
the assertTrue/False/Equals functions, replace the 
assert.AssertTrue/False/Equals ones and ALSO fixing checkstyles, I can get away 
with that."
This is generally true, however, FPGA is a bit different IMO.
1. It's still considered beta, not really used by anyone
2. The code is relatively new
3. Changes are very isolated
4. It will likely be deprecated by the pluggable device framework (not sure 
when)

Having said that, I can revert the checkstyle changes. I'd rather keep it and 
focus on static importing the asserts though.

"Can we move the comments in function preStart to a javadoc?"
Let's do this in YARN-9267.

"Wildcard imports"
This might be another personal preference thing, but I just don't like it. I 
try eliminate them every time I see one :D

AoclDiagnosticOutputParser.java --> mostly valid comments (I didn't touch the 
parsing logic on purpose, but these are small changes)

IntelFpgaOpenclPlugin.java
1. Another "*" to eliminate :)
2. "Do we need the conf Configuration object of AbstractFpgaVendorPlugin at 
all?" - It's needed in the implementation of {{initPlugin()}}. But 
{{setConf()}} / {{getConf()}} can go.
3. Exception object -> yep, let's log into the console
4. msg variable -> agree, unnecessary

I'll do these changes tomorrow.



> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768590#comment-16768590
 ] 

Peter Bacsko commented on YARN-9098:


After some discussion with Szilard, now I can +1 this (non-binding).

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch, YARN-9098.005.patch, YARN-9098.006.patch, 
> YARN-9098.007.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code should be extracted from these places and be added in a new 
> class as this is a separate responsibility.
> As the output of the file parser is a Map>, it's better 
> to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance.
> ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is 
> responsible for producing the same results (Map>) to 
> store cgroups data, it does not operate on mtab file, but looking at the 
> filesystem for cgroup settings. As the output is the same, CGroupsMountConfig 
> should be used here, too.
> Again, this could should not be part of ResourceHandlerModule as it is a 
> different responsibility.
> One more thing which is strongly related to the methods above is 
> CGroupsHandlerImpl.initializeFromMountConfig: This method processes the 
> result of a parsed mtab file or a parsed cgroups filesystem data and stores 
> file system paths for all available controllers. This method invokes 
> findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl 
> and CGroupsLCEResourcesHandler, so it should be moved to a single place. To 
> store filesystem path and controller mappings, a new domain object could be 
> introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768437#comment-16768437
 ] 

Peter Bacsko commented on YARN-9266:


Thanks [~adam.antal] I'll go through your suggestions one-by-one and I'll make 
the modifications if necessary.

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-14 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Description: 
Problems identified in this class:
 * {{InnerShellExecutor}} ignores the timeout parameter
 * {{configureIP()}} uses printStackTrace() instead of logging
 * {{configureIP()}} does not log the output of aocl if the exit code != 0
 * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
for better testability
 * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
also matches)
 * method name {{downloadIP()}} is misleading – it actually tries to finds the 
file. Everything is downloaded (localized) at this point.
 * {{@VisibleForTesting}} methods should be package private
 * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} class
 * checkstyle fixes

  was:
Problems identified in this class:
 * {{InnerShellExecutor}} ignores the timeout parameter
 * {{configureIP()}} uses printStackTrace() instead of logging
 * {{configureIP()}} does not log the output of aocl if the exit code != 0
 * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
for better testability
 * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
also matches)
 * method name {{downloadIP()}} is misleading – it actually tries to finds the 
file. Everything is downloaded (localized) at this point.
 * {{@VisibleForTesting}} methods should be package private
 * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} class


> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class
>  * checkstyle fixes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9135) NM State store ResourceMappings serialization are tested with Strings instead of real Device objects

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768360#comment-16768360
 ] 

Peter Bacsko commented on YARN-9135:


{quote}Using the immutable type in the definition of the methods informs 
clients that the can't modify the collection{quote}

What I found is that people have different preference over this: 
[https://stackoverflow.com/questions/38087900/is-it-better-to-return-an-immutablemap-or-a-map]

I'm still leaning towards returning a more generic Map. BTW if you use 
{{Collections.unmodifiableMap()}} like this, it behaves the same way:

{{Map copy = Collections.unmodifiableMap(new HashMap(original));  // returns Map}}

Anyway I'm ok to +1 it if you want to keep it.

> NM State store ResourceMappings serialization are tested with Strings instead 
> of real Device objects
> 
>
> Key: YARN-9135
> URL: https://issues.apache.org/jira/browse/YARN-9135
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9135.001.patch, YARN-9135.003.patch, 
> YARN-9135.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768340#comment-16768340
 ] 

Peter Bacsko commented on YARN-9138:


+1 non-binding

> Test error handling of nvidia-smi binary execution of GpuDiscoverer
> ---
>
> Key: YARN-9138
> URL: https://issues.apache.org/jira/browse/YARN-9138
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9138.001.patch, YARN-9138.002.patch, 
> YARN-9138.003.patch
>
>
> The code that executes nvidia-smi (doing GPU device auto-discovery) don't 
> have much test coverage.
> This patch adds tests to this part of the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9139) Simplify initializer code of GpuDiscoverer

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768333#comment-16768333
 ] 

Peter Bacsko commented on YARN-9139:


+1 non-binding

> Simplify initializer code of GpuDiscoverer
> --
>
> Key: YARN-9139
> URL: https://issues.apache.org/jira/browse/YARN-9139
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9139.001.patch, YARN-9139.002.patch, 
> YARN-9139.003.patch, YARN-9139.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9235) If linux container executor is not set for a GPU cluster GpuResourceHandlerImpl is not initialized and NPE is thrown

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768331#comment-16768331
 ] 

Peter Bacsko commented on YARN-9235:


[~bsteinbach] can you add a simple unit test for this scenario?

> If linux container executor is not set for a GPU cluster 
> GpuResourceHandlerImpl is not initialized and NPE is thrown
> 
>
> Key: YARN-9235
> URL: https://issues.apache.org/jira/browse/YARN-9235
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Major
> Attachments: YARN-9235.001.patch
>
>
> If GPU plugin is enabled for the NodeManager, it is possible to run jobs with 
> GPU.
> However, if LinuxContainerExecutor is not configured, an NPE is thrown when 
> calling 
> {code:java}
> GpuResourcePlugin.getNMResourceInfo{code}
> Also, there are no warns in the log if GPU is misconfigured like this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768158#comment-16768158
 ] 

Peter Bacsko commented on YARN-9264:


Suggested order of committing the patches: YARN-9265 and YARN-9266 should go in 
first. Then I'll verify them on a local machine with an FPGA card.

If everything is OK, we can proceed with rest.

> [Umbrella] Follow-up on IntelOpenCL FPGA plugin
> ---
>
> Key: YARN-9264
> URL: https://issues.apache.org/jira/browse/YARN-9264
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The Intel FPGA resource type support was released in Hadoop 3.1.0.
> Right now the plugin implementation has some deficiencies that need to be 
> fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-14 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9270:
---
Attachment: YARN-9270-003.patch

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch, YARN-9270-002.patch, 
> YARN-9270-003.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-14 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768139#comment-16768139
 ] 

Peter Bacsko commented on YARN-9270:


Patch v3 - handled checkstyle issues.

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch, YARN-9270-002.patch, 
> YARN-9270-003.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-14 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9269:
---
Attachment: YARN-9269-003.patch

> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9269-001.patch, YARN-9269-002.patch, 
> YARN-9269-003.patch
>
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private
>  * get rid of {{*}} imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767351#comment-16767351
 ] 

Peter Bacsko commented on YARN-9118:


"If I put those method names into a newline, it looks really weird"

Just use {{@SuppressWarnings("checkstyle:linelength")}} if it doesn't make 
sense.

> Handle issues with parsing user defined GPU devices in GpuDiscoverer
> 
>
> Key: YARN-9118
> URL: https://issues.apache.org/jira/browse/YARN-9118
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9118.001.patch, YARN-9118.002.patch, 
> YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, 
> YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch
>
>
> getGpusUsableByYarn has the following issues: 
> - Duplicate GPU device definitions are not denied: This seems to be the 
> biggest issue as it could increase the number of devices on the node if the 
> device ID is defined 2 or more times.
> - An empty-string is accepted, it works like the user would not want to use 
> auto-discovery and haven't defined any GPU devices: This will result in an 
> empty device list, but the empty-string check is never explicitly there in 
> the code, so this behavior just coincidental.
> - Number validation does not happen on GPU device IDs (separated by commas)
> Many testcases are added as the coverage was already very low.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767351#comment-16767351
 ] 

Peter Bacsko edited comment on YARN-9118 at 2/13/19 4:03 PM:
-

"If I put those method names into a newline, it looks really weird"

Just use {{@SuppressWarnings("checkstyle:linelength")}} if that's the case


was (Author: pbacsko):
"If I put those method names into a newline, it looks really weird"

Just use {{@SuppressWarnings("checkstyle:linelength")}} if it doesn't make 
sense.

> Handle issues with parsing user defined GPU devices in GpuDiscoverer
> 
>
> Key: YARN-9118
> URL: https://issues.apache.org/jira/browse/YARN-9118
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9118.001.patch, YARN-9118.002.patch, 
> YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, 
> YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch
>
>
> getGpusUsableByYarn has the following issues: 
> - Duplicate GPU device definitions are not denied: This seems to be the 
> biggest issue as it could increase the number of devices on the node if the 
> device ID is defined 2 or more times.
> - An empty-string is accepted, it works like the user would not want to use 
> auto-discovery and haven't defined any GPU devices: This will result in an 
> empty device list, but the empty-string check is never explicitly there in 
> the code, so this behavior just coincidental.
> - Number validation does not happen on GPU device IDs (separated by commas)
> Many testcases are added as the coverage was already very low.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767298#comment-16767298
 ] 

Peter Bacsko commented on YARN-9098:


Maybe it's just nitpicking, but...

{noformat}
  public List getPathsForController(String controller) {
return mappings.entrySet().stream()
.filter(e -> e.getValue().contains(controller))
.map(Map.Entry::getKey)
.collect(Collectors.toList());
  }
{noformat}

Is it ok to use {{contains()}} here? If cpu and cpuacct are mounted to two 
different directories, then we might return wrong path for cpu, no? Usually 
they're mounted to the same directory like {{/sys/fs/cgroup/cpu,cpuacct}} but 
it's something to think about.

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch, YARN-9098.005.patch, YARN-9098.006.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code should be extracted from these places and be added in a new 
> class as this is a separate responsibility.
> As the output of the file parser is a Map>, it's better 
> to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance.
> ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is 
> responsible for producing the same results (Map>) to 
> store cgroups data, it does not operate on mtab file, but looking at the 
> filesystem for cgroup settings. As the output is the same, CGroupsMountConfig 
> should be used here, too.
> Again, this could should not be part of ResourceHandlerModule as it is a 
> different responsibility.
> One more thing which is strongly related to the methods above is 
> CGroupsHandlerImpl.initializeFromMountConfig: This method processes the 
> result of a parsed mtab file or a parsed cgroups filesystem data and stores 
> file system paths for all available controllers. This method invokes 
> findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl 
> and CGroupsLCEResourcesHandler, so it should be moved to a single place. To 
> store filesystem path and controller mappings, a new domain object could be 
> introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9123) Clean up and split testcases in TestNMWebServices for GPU support

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767291#comment-16767291
 ] 

Peter Bacsko commented on YARN-9123:


" testGetNMResourceInfoFailBecauseOfUnknownPlugin is a bit lengthy: 47 
character."

I think this is fine (seen much worse). Another name could be sth like 
{{testGetNMResourceInfoWhenPluginIsUnknown}} which is also a popular naming 
scheme (I mean using "when").

Talking about repetitions, this could be extracted too:
{noformat}
ClientResponse response = r.path("ws").path("v1").path("node").path(
   
"resources").path("resource-2").accept(MediaType.APPLICATION_JSON).get(
   ClientResponse.class);
{noformat}


> Clean up and split testcases in TestNMWebServices for GPU support
> -
>
> Key: YARN-9123
> URL: https://issues.apache.org/jira/browse/YARN-9123
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-9123.001.patch, YARN-9123.002.patch, 
> YARN-9123.003.patch, YARN-9123.004.patch
>
>
> The following testcases can be cleaned up a bit: 
> TestNMWebServices#testGetNMResourceInfo - Can be split up to 3 different cases
> TestNMWebServices#testGetYarnGpuResourceInfo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9135) NM State store ResourceMappings serialization are tested with Strings instead of real Device objects

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767287#comment-16767287
 ] 

Peter Bacsko commented on YARN-9135:


Thanks for updating the patch [~snemeth].

Please make sure that these methods return a standard {{Map}} 
instead of {{ImmutableMap}} (the more generic the better).
{{public ImmutableMap getNodeVsCpus()}}
{{public ImmutableMap getNodeVsCpus()}}



> NM State store ResourceMappings serialization are tested with Strings instead 
> of real Device objects
> 
>
> Key: YARN-9135
> URL: https://issues.apache.org/jira/browse/YARN-9135
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9135.001.patch, YARN-9135.003.patch, 
> YARN-9135.004.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9133) Make tests more easy to comprehend in TestGpuResourceHandler

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767279#comment-16767279
 ] 

Peter Bacsko commented on YARN-9133:


+1 (non-binding)



> Make tests more easy to comprehend in TestGpuResourceHandler
> 
>
> Key: YARN-9133
> URL: https://issues.apache.org/jira/browse/YARN-9133
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9133.001.patch, YARN-9133.001.patch, 
> YARN-9133.002.patch, YARN-9133.003.patch, YARN-9133.004.patch, 
> YARN-9133.005.patch
>
>
> Tests are not quite easy to read: 
> - Some more helper methods would improve readability.
> - Eliminating the boolean flag that controls if docker is used would also 
> improve readability and clarity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263
 ] 

Peter Bacsko commented on YARN-9138:


[~snemeth] now you can remove this unnecessary code-paths:

{noformat}
if (Shell.WINDOWS) {
...
} else {
...
{noformat}

> Test error handling of nvidia-smi binary execution of GpuDiscoverer
> ---
>
> Key: YARN-9138
> URL: https://issues.apache.org/jira/browse/YARN-9138
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9138.001.patch, YARN-9138.002.patch
>
>
> The code that executes nvidia-smi (doing GPU device auto-discovery) don't 
> have much test coverage.
> This patch adds tests to this part of the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263
 ] 

Peter Bacsko edited comment on YARN-9138 at 2/13/19 2:47 PM:
-

[~snemeth]

1. now you can remove these unnecessary code-paths:

{noformat}
if (Shell.WINDOWS) {
...
} else {
...
{noformat}

2. OK, I know this is annoying, but could you static import assert calls? We 
use it everywhere else, so let's be consistent.

3. String "PATH" is used multiple times, it's worth making it static final. 
Same applies to "u+x".


was (Author: pbacsko):
[~snemeth] now you can remove these unnecessary code-paths:

{noformat}
if (Shell.WINDOWS) {
...
} else {
...
{noformat}

> Test error handling of nvidia-smi binary execution of GpuDiscoverer
> ---
>
> Key: YARN-9138
> URL: https://issues.apache.org/jira/browse/YARN-9138
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9138.001.patch, YARN-9138.002.patch
>
>
> The code that executes nvidia-smi (doing GPU device auto-discovery) don't 
> have much test coverage.
> This patch adds tests to this part of the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263
 ] 

Peter Bacsko edited comment on YARN-9138 at 2/13/19 2:38 PM:
-

[~snemeth] now you can remove these unnecessary code-paths:

{noformat}
if (Shell.WINDOWS) {
...
} else {
...
{noformat}


was (Author: pbacsko):
[~snemeth] now you can remove this unnecessary code-paths:

{noformat}
if (Shell.WINDOWS) {
...
} else {
...
{noformat}

> Test error handling of nvidia-smi binary execution of GpuDiscoverer
> ---
>
> Key: YARN-9138
> URL: https://issues.apache.org/jira/browse/YARN-9138
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9138.001.patch, YARN-9138.002.patch
>
>
> The code that executes nvidia-smi (doing GPU device auto-discovery) don't 
> have much test coverage.
> This patch adds tests to this part of the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9139) Simplify initializer code of GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767257#comment-16767257
 ] 

Peter Bacsko commented on YARN-9139:


[~snemeth]

1. Please fix the remaining checkstyle issues
2. Why is {{TestFpgaDiscoverer}} class is referenced in 
{{TestGpuResourceHandler.java}} ?
3. Repeated use of {{Configuration conf = createDefaultConfig();}} - extract 
{{conf}} to a class variable and initialize once


> Simplify initializer code of GpuDiscoverer
> --
>
> Key: YARN-9139
> URL: https://issues.apache.org/jira/browse/YARN-9139
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9139.001.patch, YARN-9139.002.patch, 
> YARN-9139.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767226#comment-16767226
 ] 

Peter Bacsko commented on YARN-9118:


Minor:
{{Configuration conf = new Configuration(false);}} - this line keeps repeating 
in the tests. How about making {{conf}} a class variable and instantiating it 
in {{setup()}}?

Otherwise +1 non-binding.



> Handle issues with parsing user defined GPU devices in GpuDiscoverer
> 
>
> Key: YARN-9118
> URL: https://issues.apache.org/jira/browse/YARN-9118
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9118.001.patch, YARN-9118.002.patch, 
> YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, 
> YARN-9118.006.patch, YARN-9118.007.patch
>
>
> getGpusUsableByYarn has the following issues: 
> - Duplicate GPU device definitions are not denied: This seems to be the 
> biggest issue as it could increase the number of devices on the node if the 
> device ID is defined 2 or more times.
> - An empty-string is accepted, it works like the user would not want to use 
> auto-discovery and haven't defined any GPU devices: This will result in an 
> empty device list, but the empty-string check is never explicitly there in 
> the code, so this behavior just coincidental.
> - Number validation does not happen on GPU device IDs (separated by commas)
> Many testcases are added as the coverage was already very low.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9217) Nodemanager will fail to start if GPU is misconfigured on the node or GPU drivers missing

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767203#comment-16767203
 ] 

Peter Bacsko commented on YARN-9217:


Minor comments:
1. Do we need a separate variable here?
{noformat}
70  if (usableGpus.isEmpty()) {
71String message = "GPU is enabled on the NodeManager, but couldn't 
find "
72+ "any usable GPU devices, please double check 
configuration.";
73LOG.warn(message);
{noformat}

2. Similar thing in GpuNodeResourceUpdateHandler
{noformat}
if (usableGpus.isEmpty()) {
  String message = "GPU is enabled, but couldn't find any usable GPUs on the "
  + "NodeManager.";
  LOG.warn(message);
{noformat}

3. I would rename {{checkErrorNumber()}} to {{checkErrorCount()}}

4. By the way -- is it reasonable to perform GPU discovery in a loop? What's 
the idea here? Is "nvidia-smi" flaky sometimes? What condition are we trying to 
avoid? I realized that this part of the code existed before, but still... 
anyone? :) 

5. {{NvidiaBinaryHelper}} - {{@returns}} clause is missing in the JavaDoc

6. {{NvidiaBinaryHelper}} - this class is very small. If it's introduced for 
testing purposes, I strongly recommend using a replaceable lamba function, like 
this:

{noformat}
Function> gpuDeviceRetriever = 
this::getGpuDeviceInformation;
...
@VisibleForTesting
void setGpuDeviceRetriever(Function> 
func) {
  this.gpuDeviceRetriever = func;
}
...
lastDiscoveredGpuInformation = gpuDeviceRetriever.apply(pathOfGpuBinary);
{noformat}

Then you can set your own retrieving logic in the test. Lambdas can't throw 
exceptions, so you have to wrap incorrect return values in {{Optional}}.

*Fundamental question*: is this the way how we want to use thig plugin? Just 
asking because we might accidentally mask erratic behavior. Eg. a Hadoop user 
might think that he has a cluster with 10 GPUs. In reality, the plugin failed 
to detect some cards, and only 5 NMs support GPU scheduling. If it's not 
explicitly displayed, the user might be under the impression that 10 GPUs are 
ready to run YARN workloads. This can be very misleading.

At the very least, a fail-fast method should be considered.

> Nodemanager will fail to start if GPU is misconfigured on the node or GPU 
> drivers missing
> -
>
> Key: YARN-9217
> URL: https://issues.apache.org/jira/browse/YARN-9217
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Major
> Attachments: YARN-9217.001.patch, YARN-9217.002.patch, 
> YARN-9217.003.patch, YARN-9217.004.patch
>
>
> Nodemanager will not start
> 1. If Autodiscovery is enabled:
>  * If nvidia-smi path is misconfigured or the file does not exist.
>  * There is 0 GPU found
>  * If the file exists but it is not pointing to an nvidia-smi
>  * if the binary is ok but there is an IOException
> 2. If the manually configured GPU devices are misconfigured
>  * Any index:minor number format failure will cause a problem
>  * 0 configured device will cause a problem
>  * NumberFormatException is not handled
> It would be a better option to add warnings about the configuration, set 0 
> available GPUs and let the node work and run non-gpu jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767153#comment-16767153
 ] 

Peter Bacsko commented on YARN-9270:


Uploaded v2. Changes:
* FpgaDiscoverer is no longer singleton
* Removed unnecessary synchronized methods (checked the call hierarchy)

"We request the instance of the FpgaDiscoverer 5 times, and then call the 
setResourceHanderPlugin on it with the same parameter (openclPlugin)"
This is no longer relevant now.

"Also could you move the previous comments/description of the test cases to the 
new tests' javadoc?"
Removed those altogether. Tests are short now, should be obvious what they do.

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch, YARN-9270-002.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Description: 
Need to fix the following in the class {{FpgaDevice}}:
 * It implements {{Comparable}}, but returns 0 in every case. There is no 
natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
this seems too forced and unnecessary.We think this class should not implement 
{{Comparable}} at all, at least not like that.
 * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
one, these are never needed in the code. Secondly, temp and power usage changes 
constantly. It's pointless to store these in this POJO.
 * {{serialVersionUID}} is 1L - let's generate a number for this
 * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
uniquely identifies the card, then let's demand them in the constructor and 
don't store Integers that can be null.

  was:
Need to fix the following the class {{FpgaDevice}}:
 * It implements {{Comparable}}, but returns 0 in every case. There is no 
natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
this seems too forced and unnecessary.We think this class should not implement 
{{Comparable}} at all, at least not like that.
 * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
one, these are never needed in the code. Secondly, temp and power usage changes 
constantly. It's pointless to store these in this POJO.
 * {{serialVersionUID}} is 1L - let's generate a number for this
 * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
uniquely identifies the card, then let's demand them in the constructor and 
don't store Integers that can be null.


> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch
>
>
> Need to fix the following in the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9270:
---
Attachment: YARN-9270-002.patch

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch, YARN-9270-002.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-13 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767085#comment-16767085
 ] 

Peter Bacsko commented on YARN-9270:


" could we remove the wildcard import import java.util.*."
Certainly, let's do this in YARN-9266.

"don't see why the constructor of Configuration is called with false"
[...]
"Also the 5th testcase (testLinuxFpgaResourceDiscoverPluginWithSdkRootSet) uses 
another Conifiguration object in the original testcase"

I think the idea here is that the original conf object was created with "false" 
so that it doesn't load the default values, but in that particular test (5th), 
we do. I see no significant difference though. Just tried it, test result is 
the same. 

I'm also thinking about making {{FpgaDiscoverer}} non-singleton. It's much 
better to test that way.

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-12 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Attachment: YARN-9268-003.patch

> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch, 
> YARN-9268-003.patch
>
>
> Need to fix the following the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-12 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-004.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch, YARN-9266-004.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-12 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-002.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-11 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-002.patch

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-11 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-003.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch, 
> YARN-9266-003.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-11 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-001.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-11 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764841#comment-16764841
 ] 

Peter Bacsko commented on YARN-9265:


I started to work on this.

I decided to handle the script output the same way as we do with the property. 
So the same format is expected from the script.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763083#comment-16763083
 ] 

Peter Bacsko edited comment on YARN-9265 at 2/7/19 9:16 PM:


[~snemeth] yes that is a possibility. A property value has to be compact, so 
it's pretty much necessary to squeeze the data into a single line. The script 
might use a different output format, like:


{noformat}
acl0 243 0
acl1 243 1
{noformat}

I'm fine with either of those.


was (Author: pbacsko):
[~snemeth] yes that is a possibility. A property value has to be compact, so 
it's pretty much necessary to squeeze data into a single line. The script might 
use a different output format, like:


{noformat}
acl0 243 0
acl1 243 1
{noformat}

I'm fine with either of those.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763083#comment-16763083
 ] 

Peter Bacsko commented on YARN-9265:


[~snemeth] yes that is a possibility. A property value has to be compact, so 
it's pretty much necessary to squeeze data into a single line. The script might 
use a different output format, like:


{noformat}
acl0 243 0
acl1 243 1
{noformat}

I'm fine with either of those.

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9270) Minor cleanup in TestFpgaDiscoverer

2019-02-07 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9270:
---
Attachment: YARN-9270-001.patch

> Minor cleanup in TestFpgaDiscoverer
> ---
>
> Key: YARN-9270
> URL: https://issues.apache.org/jira/browse/YARN-9270
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9270-001.patch
>
>
> Let's do some cleanup in this class.
> * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split 
> up to 5 different tests, because it tests 5 different scenarios.
> * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a 
> {{Function}} in the plugin class like {{Function envProvider 
> = System::getenv()}} plus a setter method which allows the test to modify 
> {{envProvider}}. Much simpler and straightfoward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762946#comment-16762946
 ] 

Peter Bacsko commented on YARN-9269:


Thanks [~adam.antal] for the comments.

1. {{allowedFpgas}} is populated only once and never modified again. It's 
called from the {{bootstrap()}} method when the NM starts up. So I add the 
elements to a local list then wrap it in an immutable list.

2. Yeah but that technically cannot happen. Also, we no longer retrieve the 
device object from a list - we have it already as a method argument. Anyway 
I'll have to double check that this is still OK (I haven't tested any 
modifications on a cluster yet).

3. Unused stuff in FpgaDevice: yeah, there's a separate Jira for that: 
YARN-9268. We store stuff which is unnecessary. Only aliasDevName, minor and 
major are important.

4. Ok, I will do the rename + remove the extra spaces.

 

> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9269-001.patch, YARN-9269-002.patch
>
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private
>  * get rid of {{*}} imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762595#comment-16762595
 ] 

Peter Bacsko edited comment on YARN-9266 at 2/7/19 11:54 AM:
-

Patch v2 - removed {{aliasMap}}, fixed a lot of checkstyle issues.


was (Author: pbacsko):
Patvh v2 - removed {{aliasMap}}, fixed a lot of checkstyle issues.

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-07 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Attachment: YARN-9266-002.patch

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-07 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762595#comment-16762595
 ] 

Peter Bacsko commented on YARN-9266:


Patvh v2 - removed {{aliasMap}}, fixed a lot of checkstyle issues.

> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch, YARN-9266-002.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-06 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761821#comment-16761821
 ] 

Peter Bacsko edited comment on YARN-9265 at 2/6/19 3:08 PM:


Today we had a meeting with [~snemeth], [~shuzirra], [~zsiegl], [~adam.antal].

Right now it's difficult to reliably parse "aocl diagnose" because it's 
intended for humans to read. Also, figuring out the device file can be a 
challenge too. Our proposal is:
 * for backward compatibility reasons, let's keep the existing parsing logic
 * introduce a script-based solution: we define an external script in a 
property ({{yarn.nodemanager.resource-plugins.fpga.device-discovery-script}}) 
which provides info about the available FPGA cards in the system. This is 
already a working approach in Hadoop: determining network topology and disk 
checker (if provided) are also invoked as external programs. Obviously the 
script  should print its output in a strict format.
 * if users don't want to mess around with the script, they can provide the 
available devices as a property (like 
{{yarn.nodemanager.resource-plugins.fpga.available-devices}}) value as defined 
above: {{acl0/243:0,acl1/243:1}}. Actually the external script might return 
this in the same format.

Priority if multiple methods are enabled:
# List of devices defined by the user
# Script execution
# Default parsing method


was (Author: pbacsko):
Today we had a meeting with [~snemeth], [~shuzirra], [~zsiegl], [~adam.antal].

Right now it's difficult to reliably parse "aocl diagnose" because it's 
intended for humans to read. Also, figuring out the device file can be a 
challenge too. Our proposal is:
 * for backward compatibility reasons, let's keep the existing parsing logic
 * introduce a script-based solution: we define an external script in a 
property ({{yarn.nodemanager.resource-plugins.fpga.device-discovery-script}}) 
which provides info about the available FPGA cards in the system. This is 
already a working approach in Hadoop: determining network topology and disk 
checker (if provided) are also invoked as external programs. Obviously the 
script  should print its output in a strict format.
 * if users don't want to mess around with the script, they can provide the 
available devices as a property (like 
{{yarn.nodemanager.resource-plugins.fpga.available-devices}}) value as defined 
above: {{acl0/243:0,acl1/243:1}}. Actually the external script might return 
this in the same format.

Priority if multiple methods are enabled:
* List of devices defined by the user
* Script execution
* Default parsing method

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaR

[jira] [Assigned] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-06 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-9265:
--

Assignee: Peter Bacsko

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-06 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761821#comment-16761821
 ] 

Peter Bacsko commented on YARN-9265:


Today we had a meeting with [~snemeth], [~shuzirra], [~zsiegl], [~adam.antal].

Right now it's difficult to reliably parse "aocl diagnose" because it's 
intended for humans to read. Also, figuring out the device file can be a 
challenge too. Our proposal is:
 * for backward compatibility reasons, let's keep the existing parsing logic
 * introduce a script-based solution: we define an external script in a 
property ({{yarn.nodemanager.resource-plugins.fpga.device-discovery-script}}) 
which provides info about the available FPGA cards in the system. This is 
already a working approach in Hadoop: determining network topology and disk 
checker (if provided) are also invoked as external programs. Obviously the 
script  should print its output in a strict format.
 * if users don't want to mess around with the script, they can provide the 
available devices as a property (like 
{{yarn.nodemanager.resource-plugins.fpga.available-devices}}) value as defined 
above: {{acl0/243:0,acl1/243:1}}. Actually the external script might return 
this in the same format.

Priority if multiple methods are enabled:
* List of devices defined by the user
* Script execution
* Default parsing method

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atl

[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-06 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9269:
---
Attachment: YARN-9269-002.patch

> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9269-001.patch, YARN-9269-002.patch
>
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private
>  * get rid of {{*}} imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-06 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9269:
---
Attachment: YARN-9269-001.patch

> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9269-001.patch
>
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private
>  * get rid of {{*}} imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Attachment: YARN-9268-002.patch

> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch, YARN-9268-002.patch
>
>
> Need to fix the following the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9269:
---
Description: 
Some stuff that we observed:
 * {{addFpga()}} - we check for duplicate devices, but we don't print any 
error/warning if there's any.
 * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is this 
method even needed? We already receive an {{FpgaDevice}} instance in 
{{updateFpga()}} which I believe is the same that we're looking up.
 * variable {{IPIDpreference}} is confusing
 * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
{{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
{{HashMap}} suffice?
 * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
 * {{allowedFpgas}} should be an immutable list
 * {{@VisibleForTesting}} methods should be package private
 * get rid of {{*}} imports

  was:
Some stuff that we observed:
 * {{addFpga()}} - we check for duplicate devices, but we don't print any 
error/warning if there's any.
 * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is this 
method even needed? We already receive an {{FpgaDevice}} instance in 
{{updateFpga()}} which I believe is the same that we're looking up.
 * variable {{IPIDpreference}} is confusing
 * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
{{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
{{HashMap}} suffice?
 * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
 * {{allowedFpgas}} should be an immutable list
 * {{@VisibleForTesting}} methods should be package private


> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private
>  * get rid of {{*}} imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9269) Minor cleanup in FpgaResourceAllocator

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9269:
---
Description: 
Some stuff that we observed:
 * {{addFpga()}} - we check for duplicate devices, but we don't print any 
error/warning if there's any.
 * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is this 
method even needed? We already receive an {{FpgaDevice}} instance in 
{{updateFpga()}} which I believe is the same that we're looking up.
 * variable {{IPIDpreference}} is confusing
 * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
{{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
{{HashMap}} suffice?
 * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
 * {{allowedFpgas}} should be an immutable list
 * {{@VisibleForTesting}} methods should be package private

  was:
Some stuff that we observed:

* {{addFpga()}} - we check for duplicate devices, but we don't print any 
error/warning if there's any.
* {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is this 
method even needed? We already receive an {{FpgaDevice}} instance in 
{{updateFpga()}} which I believe is the same that we're looking up.
* variable {{IPIDpreference}} is confusing
* {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
{{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
{{HashMap}} suffice?
* {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
* {{allowedFpgas}} should be an immutable list


> Minor cleanup in FpgaResourceAllocator
> --
>
> Key: YARN-9269
> URL: https://issues.apache.org/jira/browse/YARN-9269
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Some stuff that we observed:
>  * {{addFpga()}} - we check for duplicate devices, but we don't print any 
> error/warning if there's any.
>  * {{findMatchedFpga()}} should be called {{findMatchingFpga()}}. Also, is 
> this method even needed? We already receive an {{FpgaDevice}} instance in 
> {{updateFpga()}} which I believe is the same that we're looking up.
>  * variable {{IPIDpreference}} is confusing
>  * {{availableFpga}} / {{usedFpgaByRequestor}} are instances of 
> {{LinkedHashMap}}. What's the rationale behind this? Doesn't a simple 
> {{HashMap}} suffice?
>  * {{usedFpgaByRequestor}} should be renamed, naming is a bit unclear
>  * {{allowedFpgas}} should be an immutable list
>  * {{@VisibleForTesting}} methods should be package private



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-05 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760766#comment-16760766
 ] 

Peter Bacsko commented on YARN-9268:


Patch has dependency (at least) on YARN-9266 so it has to be rebased once that 
is committed.

> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch
>
>
> Need to fix the following the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9264:
---
Affects Version/s: (was: 3.1.1)
   3.1.0

> [Umbrella] Follow-up on IntelOpenCL FPGA plugin
> ---
>
> Key: YARN-9264
> URL: https://issues.apache.org/jira/browse/YARN-9264
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The Intel FPGA resource type support was released in Hadoop 3.1.0.
> Right now the plugin implementation has some deficiencies that need to be 
> fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Attachment: YARN-9268-001.patch

> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9268-001.patch
>
>
> Need to fix the following the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9266) Various fixes are needed in IntelFpgaOpenclPlugin

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9266:
---
Description: 
Problems identified in this class:
 * {{InnerShellExecutor}} ignores the timeout parameter
 * {{configureIP()}} uses printStackTrace() instead of logging
 * {{configureIP()}} does not log the output of aocl if the exit code != 0
 * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
for better testability
 * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
also matches)
 * method name {{downloadIP()}} is misleading – it actually tries to finds the 
file. Everything is downloaded (localized) at this point.
 * {{@VisibleForTesting}} methods should be package private
 * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} class

  was:
Problems identified in this class:

* {{InnerShellExecutor}} ignores the timeout parameter
* {{configureIP()}} uses printStackTrace() instead of logging
* {{configureIP()}} does not log the output of aocl if the exit code != 0
* {{parseDiagnoseInfo()}} is too heavyweight -- it should be in its own class 
for better testability
* {{downloadIP()}} uses {{contains()}} for file name check -- this can really 
surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
also matches)
* method name {{downloadIP()}} is misleading -- it actually tries to finds the 
file. Everything is downloaded (localized) at this point.
* {{@VisibleForTesting}} methods should be package private


> Various fixes are needed in IntelFpgaOpenclPlugin
> -
>
> Key: YARN-9266
> URL: https://issues.apache.org/jira/browse/YARN-9266
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9266-001.patch
>
>
> Problems identified in this class:
>  * {{InnerShellExecutor}} ignores the timeout parameter
>  * {{configureIP()}} uses printStackTrace() instead of logging
>  * {{configureIP()}} does not log the output of aocl if the exit code != 0
>  * {{parseDiagnoseInfo()}} is too heavyweight – it should be in its own class 
> for better testability
>  * {{downloadIP()}} uses {{contains()}} for file name check – this can really 
> surprise users in some cases (eg. you want to use hello.aocx but hello2.aocx 
> also matches)
>  * method name {{downloadIP()}} is misleading – it actually tries to finds 
> the file. Everything is downloaded (localized) at this point.
>  * {{@VisibleForTesting}} methods should be package private
>  * {{aliasMap}} is not needed - store the acl number in the {{FpgaDevice}} 
> class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice

2019-02-05 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9268:
---
Description: 
Need to fix the following the class {{FpgaDevice}}:
 * It implements {{Comparable}}, but returns 0 in every case. There is no 
natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
this seems too forced and unnecessary.We think this class should not implement 
{{Comparable}} at all, at least not like that.
 * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
one, these are never needed in the code. Secondly, temp and power usage changes 
constantly. It's pointless to store these in this POJO.
 * {{serialVersionUID}} is 1L - let's generate a number for this
 * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
uniquely identifies the card, then let's demand them in the constructor and 
don't store Integers that can be null.

  was:
Need to fix the following the class {{FpgaDevice}}:
 * It implements {{Comparable}}, but not {{Comparable}}, so we have 
a raw type warning. It also returns 0 in every case. There is no natural 
ordering among FPGA devices, perhaps "acl0" comes before "acl1", but this seems 
too forced and unnecessary.We think this class should not implement 
{{Comparable}} at all, at least not like that.
 * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
one, these are never needed in the code. Secondly, temp and power usage changes 
constantly. It's pointless to store these in this POJO.
 * {{serialVersionUID}} is 1L - let's generate a number for this
 * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
uniquely identifies the card, then let's demand them in the constructor and 
don't store Integers that can be null.


> Various fixes are needed in FpgaDevice
> --
>
> Key: YARN-9268
> URL: https://issues.apache.org/jira/browse/YARN-9268
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Need to fix the following the class {{FpgaDevice}}:
>  * It implements {{Comparable}}, but returns 0 in every case. There is no 
> natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but 
> this seems too forced and unnecessary.We think this class should not 
> implement {{Comparable}} at all, at least not like that.
>  * Stores unnecessary fields: devName, busNum, temperature, power usage. For 
> one, these are never needed in the code. Secondly, temp and power usage 
> changes constantly. It's pointless to store these in this POJO.
>  * {{serialVersionUID}} is 1L - let's generate a number for this
>  * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor 
> uniquely identifies the card, then let's demand them in the constructor and 
> don't store Integers that can be null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    11   12   13   14   15   16   17   >