[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758296#comment-16758296
 ] 

Peter Bacsko commented on YARN-9265:
------------------------------------

We _really_ a good solution to this. There are several approach I can think of:

1. Extend/fix existing parsing logic:
 * Pros: everything is at a single place
 * Cons: parsing is already convoluted

2. Pluggable parser implementation:
 * Pros: logic can be separated for different cards, allow users to plug-in 
their own implementation
 * Cons: we might not be able to use two different FPGA devices in a single node

3. Users can override FPGA devices with properties. Basically we only need 
three things: acl numbers and major/minor device numbers. So a property like 
{{yarn.nodemanager.resource-plugins.fpga.available-devices = 
acl0/243:0,acl1/244:0}}
 * Pros: we don't rely on a semi-structured output
 * Cons: manual steps are necessary for successful configuration. Users have to 
know what they're doing.

4. Try to detemine devices from {{/sys/class/fpga}}
 * Pros: can be more reliable than using aocl + reading stuff under {{/dev}}
 * Cons: what if there are more entries? Eg. for PAC, we have 
{{/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0}} and 
{{/sys/class/fpga/intel-fpga-dev.0/intel-fpga-port.0}}.

5. Something else that I'm not aware of :)

[~leftnoteasy], [~tangzhankun], [~snemeth], [~shuzirra] opinions, ideas?

 

 

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> ----------------------------------------------------------------
>
>                 Key: YARN-9265
>                 URL: https://issues.apache.org/jira/browse/YARN-9265
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 3.1.0
>            Reporter: Peter Bacsko
>            Priority: Critical
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> --------------------------------------------------------------------
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   Status            Information
>  
> pac_a10_f200000     Passed            PAC Arria 10 Platform (pac_a10_f200000)
>                                       PCIe 08:00.0
>                                       FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> --------------------------------------------------------------------
>  
> Call "aocl diagnose <device-names>" to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f300000
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f300000}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to