[ https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758296#comment-16758296 ]
Peter Bacsko commented on YARN-9265: ------------------------------------ We _really_ a good solution to this. There are several approach I can think of: 1. Extend/fix existing parsing logic: * Pros: everything is at a single place * Cons: parsing is already convoluted 2. Pluggable parser implementation: * Pros: logic can be separated for different cards, allow users to plug-in their own implementation * Cons: we might not be able to use two different FPGA devices in a single node 3. Users can override FPGA devices with properties. Basically we only need three things: acl numbers and major/minor device numbers. So a property like {{yarn.nodemanager.resource-plugins.fpga.available-devices = acl0/243:0,acl1/244:0}} * Pros: we don't rely on a semi-structured output * Cons: manual steps are necessary for successful configuration. Users have to know what they're doing. 4. Try to detemine devices from {{/sys/class/fpga}} * Pros: can be more reliable than using aocl + reading stuff under {{/dev}} * Cons: what if there are more entries? Eg. for PAC, we have {{/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0}} and {{/sys/class/fpga/intel-fpga-dev.0/intel-fpga-port.0}}. 5. Something else that I'm not aware of :) [~leftnoteasy], [~tangzhankun], [~snemeth], [~shuzirra] opinions, ideas? > FPGA plugin fails to recognize Intel Processing Accelerator Card > ---------------------------------------------------------------- > > Key: YARN-9265 > URL: https://issues.apache.org/jira/browse/YARN-9265 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 3.1.0 > Reporter: Peter Bacsko > Priority: Critical > > The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card). > There are two major issues. > Problem #1 > The output of aocl diagnose: > {noformat} > -------------------------------------------------------------------- > Device Name: > acl0 > > Package Pat: > /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp > > Vendor: Intel Corp > > Physical Dev Name Status Information > > pac_a10_f200000 Passed PAC Arria 10 Platform (pac_a10_f200000) > PCIe 08:00.0 > FPGA temperature = 79 degrees C. > > DIAGNOSTIC_PASSED > -------------------------------------------------------------------- > > Call "aocl diagnose <device-names>" to run diagnose for specified devices > Call "aocl diagnose all" to run diagnose for all devices > {noformat} > The plugin fails to recognize this and fails with the following message: > {noformat} > 2019-01-25 06:46:02,834 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin: > Using FPGA vendor plugin: > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin > 2019-01-25 06:46:02,943 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer: > Trying to diagnose FPGA information ... > 2019-01-25 06:46:03,085 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule: > Using traffic control bandwidth handler > 2019-01-25 06:46:03,108 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn > 2019-01-25 06:46:03,139 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl: > FPGA Plugin bootstrap success. > 2019-01-25 06:46:03,247 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Couldn't find (?i)bus:slot.func\s=\s.*, pattern > 2019-01-25 06:46:03,248 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern > 2019-01-25 06:46:03,251 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Failed to get major-minor number from reading /dev/pac_a10_f300000 > 2019-01-25 06:46:03,252 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to > bootstrap configured resource subsystems! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > No FPGA devices detected! > {noformat} > Problem #2 > The plugin assumes that the file name under {{/dev}} can be derived from the > "Physical Dev Name", but this is wrong. For example, it thinks that the > device file is {{/dev/pac_a10_f300000}} which is not the case, the actual > file is {{/dev/intel-fpga-port.0}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org