Re: [slurm-users] Seff error with Slurm-18.08.1
Oh, thanks Paddy for your patch, it works very well !! Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu On 11/09/2018 07:59 AM, Marcus Wagner wrote: > Thanks Paddy, > > just something learned again ;) > > > Best > Marcus > > On 11/08/2018 05:07 PM, Paddy Doyle wrote: >> Hi all, >> >> It looks like we can use the api to avoid having to manually parse >> the '2=' >> value from the stats{tres_usage_in_max} value. >> >> I've submitted a bug report and patch: >> >> https://bugs.schedmd.com/show_bug.cgi?id=6004 >> >> The minimal changes needed would be in the attched seff.patch. >> >> Hope that helps, >> >> Paddy >> >> On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote: >> >>> Hi Miguel, >>> >>> >>> this is because SchedMD changed the stats field. There exists no more >>> rss_max, cmp. line 225 of seff. >>> You need to evaluate the field stats{tres_usage_in_max}, and there >>> the value >>> after '2=', but this is the memory value in bytes instead of kbytes, >>> so this >>> should be divided by 1024 additionally. >>> >>> >>> Best >>> Marcus >>> >>> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote: >>>> Hi and thanks for all your answers and sorry for the delay in my >>>> answer. >>>> Yesterday I have installed in the controller machine the Slurm-18.08.3 >>>> to check if with this last release the Seff command is working >>>> fine. The >>>> behavior has improve but I still receive a error message: >>>> >>>> >>>> # /usr/local/slurm-18.08.3/bin/seff 1694112 >>>> *Use of uninitialized value $lmem in numeric lt (<) at >>>> /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* >>>> Job ID: 1694112 >>>> Cluster: X >>>> User/Group: X >>>> State: COMPLETED (exit code 0) >>>> Nodes: 1 >>>> Cores per node: 2 >>>> CPU Utilized: 01:39:33 >>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime >>>> Job Wall-clock time: 00:01:10 >>>> Memory Utilized: 0.00 MB (estimated maximum) >>>> Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) >>>> [root@hydra ~]# >>>> >>>> >>>> And due to this problem, any job shows me as memory utilized the >>>> value >>>> of 0.00 MB. >>>> >>>> >>>> With slurm-17.11.1 is working fine: >>>> >>>> >>>> # /usr/local/slurm-17.11.0/bin/seff 1694112 >>>> Job ID: 1694112 >>>> Cluster: X >>>> User/Group: X >>>> State: COMPLETED (exit code 0) >>>> Nodes: 1 >>>> Cores per node: 2 >>>> CPU Utilized: 01:39:33 >>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime >>>> Job Wall-clock time: 00:01:10 >>>> Memory Utilized: 2.44 GB >>>> Memory Efficiency: 62.57% of 3.91 GB >>>> [root@hydra bin]# >>>> >>>> >>>> >>>> >>>> Miguel A. Sánchez Gómez >>>> System Administrator >>>> Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) >>>> >>>> Barcelona Biomedical Research Park (office 4.80) >>>> Doctor Aiguader 88 | 08003 Barcelona (Spain) >>>> Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 >>>> e-mail:miguelangel.sanc...@upf.edu >>>> On 11/06/2018 06:30 PM, Mike Cammilleri wrote: >>>>> Thanks for this. We'll try the workaround script. It is not >>>>> mission-critical but our users have gotten accustomed to seeing >>>>> these metrics at the end of each run and its nice to have. We are >>>>> currently doing this in a test VM environment, so by the time we >>>>> actually do the upgrade to the cluster perhaps the fix will be >>>>> available then. >>>>> >>>>> >>>>> Mike Cammilleri >>>>> >>>>> Systems Administrator >>>>> >>>>> Department of Statistics | UW-Madison >>>>> >>>>> 1300 University Ave | Room 1280 >>>>> 608-263-6673 | mi...@stat.wisc.edu &
Re: [slurm-users] Seff error with Slurm-18.08.1
Hi all, It looks like we can use the api to avoid having to manually parse the '2=' value from the stats{tres_usage_in_max} value. I've submitted a bug report and patch: https://bugs.schedmd.com/show_bug.cgi?id=6004 The minimal changes needed would be in the attched seff.patch. Hope that helps, Paddy On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote: > Hi Miguel, > > > this is because SchedMD changed the stats field. There exists no more > rss_max, cmp. line 225 of seff. > You need to evaluate the field stats{tres_usage_in_max}, and there the value > after '2=', but this is the memory value in bytes instead of kbytes, so this > should be divided by 1024 additionally. > > > Best > Marcus > > On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote: > > > > Hi and thanks for all your answers and sorry for the delay in my answer. > > Yesterday I have installed in the controller machine the Slurm-18.08.3 > > to check if with this last release the Seff command is working fine. The > > behavior has improve but I still receive a error message: > > > > > > # /usr/local/slurm-18.08.3/bin/seff 1694112 > > *Use of uninitialized value $lmem in numeric lt (<) at > > /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* > > Job ID: 1694112 > > Cluster: X > > User/Group: X > > State: COMPLETED (exit code 0) > > Nodes: 1 > > Cores per node: 2 > > CPU Utilized: 01:39:33 > > CPU Efficiency: 4266.43% of 00:02:20 core-walltime > > Job Wall-clock time: 00:01:10 > > Memory Utilized: 0.00 MB (estimated maximum) > > Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) > > [root@hydra ~]# > > > > > > And due to this problem, any job shows me as memory utilized the value > > of 0.00 MB. > > > > > > With slurm-17.11.1 is working fine: > > > > > > # /usr/local/slurm-17.11.0/bin/seff 1694112 > > Job ID: 1694112 > > Cluster: X > > User/Group: X > > State: COMPLETED (exit code 0) > > Nodes: 1 > > Cores per node: 2 > > CPU Utilized: 01:39:33 > > CPU Efficiency: 4266.43% of 00:02:20 core-walltime > > Job Wall-clock time: 00:01:10 > > Memory Utilized: 2.44 GB > > Memory Efficiency: 62.57% of 3.91 GB > > [root@hydra bin]# > > > > > > > > > > Miguel A. Sánchez Gómez > > System Administrator > > Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) > > > > Barcelona Biomedical Research Park (office 4.80) > > Doctor Aiguader 88 | 08003 Barcelona (Spain) > > Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 > > e-mail:miguelangel.sanc...@upf.edu > > On 11/06/2018 06:30 PM, Mike Cammilleri wrote: > > > > > > Thanks for this. We'll try the workaround script. It is not > > > mission-critical but our users have gotten accustomed to seeing > > > these metrics at the end of each run and its nice to have. We are > > > currently doing this in a test VM environment, so by the time we > > > actually do the upgrade to the cluster perhaps the fix will be > > > available then. > > > > > > > > > Mike Cammilleri > > > > > > Systems Administrator > > > > > > Department of Statistics | UW-Madison > > > > > > 1300 University Ave | Room 1280 > > > 608-263-6673 | mi...@stat.wisc.edu > > > > > > > > > > > > > > > *From:* slurm-users on > > > behalf of Chris Samuel > > > *Sent:* Tuesday, November 6, 2018 5:03 AM > > > *To:* slurm-users@lists.schedmd.com > > > *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1 > > > On 6/11/18 7:49 pm, Baker D.J. wrote: > > > > > > > The good new is that I am assured by SchedMD that the bug has been > > > fixed > > > > in v18.08.3. > > > > > > Looks like it's fixed in this commmit. > > > > > > commit 3d85c8f9240542d9e6dfb727244e75e449430aac > > > Author: Danny Auble > > > Date: Wed Oct 24 14:10:12 2018 -0600 > > > > > > Handle symbol resolution errors in the 18.08 slurmdbd. > > > > > > Caused by b1ff43429f6426c when moving the slurmdbd agent internals. > > > > > > Bug 5882. > > > > > > > > > > Having said that we will probably live with this issue > > > > rather than disrupt users with another upgrade so soon . > > > > > >
Re: [slurm-users] Seff error with Slurm-18.08.1
Hi Miguel, this is because SchedMD changed the stats field. There exists no more rss_max, cmp. line 225 of seff. You need to evaluate the field stats{tres_usage_in_max}, and there the value after '2=', but this is the memory value in bytes instead of kbytes, so this should be divided by 1024 additionally. Best Marcus On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote: Hi and thanks for all your answers and sorry for the delay in my answer. Yesterday I have installed in the controller machine the Slurm-18.08.3 to check if with this last release the Seff command is working fine. The behavior has improve but I still receive a error message: # /usr/local/slurm-18.08.3/bin/seff 1694112 *Use of uninitialized value $lmem in numeric lt (<) at /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) [root@hydra ~]# And due to this problem, any job shows me as memory utilized the value of 0.00 MB. With slurm-17.11.1 is working fine: # /usr/local/slurm-17.11.0/bin/seff 1694112 Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 2.44 GB Memory Efficiency: 62.57% of 3.91 GB [root@hydra bin]# Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail:miguelangel.sanc...@upf.edu On 11/06/2018 06:30 PM, Mike Cammilleri wrote: Thanks for this. We'll try the workaround script. It is not mission-critical but our users have gotten accustomed to seeing these metrics at the end of each run and its nice to have. We are currently doing this in a test VM environment, so by the time we actually do the upgrade to the cluster perhaps the fix will be available then. Mike Cammilleri Systems Administrator Department of Statistics | UW-Madison 1300 University Ave | Room 1280 608-263-6673 | mi...@stat.wisc.edu *From:* slurm-users on behalf of Chris Samuel *Sent:* Tuesday, November 6, 2018 5:03 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1 On 6/11/18 7:49 pm, Baker D.J. wrote: > The good new is that I am assured by SchedMD that the bug has been fixed > in v18.08.3. Looks like it's fixed in this commmit. commit 3d85c8f9240542d9e6dfb727244e75e449430aac Author: Danny Auble Date: Wed Oct 24 14:10:12 2018 -0600 Handle symbol resolution errors in the 18.08 slurmdbd. Caused by b1ff43429f6426c when moving the slurmdbd agent internals. Bug 5882. > Having said that we will probably live with this issue > rather than disrupt users with another upgrade so soon . An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though, should it? We just flip a symlink and the users see the new binaries, libraries, etc immediately, we can then restart daemons as and when we need to (in the right order of course, slurmdbd, slurmctld and then slurmd's). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de www.itc.rwth-aachen.de
Re: [slurm-users] Seff error with Slurm-18.08.1
Hi and thanks for all your answers and sorry for the delay in my answer. Yesterday I have installed in the controller machine the Slurm-18.08.3 to check if with this last release the Seff command is working fine. The behavior has improve but I still receive a error message: # /usr/local/slurm-18.08.3/bin/seff 1694112 *Use of uninitialized value $lmem in numeric lt (<) at /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) [root@hydra ~]# And due to this problem, any job shows me as memory utilized the value of 0.00 MB. With slurm-17.11.1 is working fine: # /usr/local/slurm-17.11.0/bin/seff 1694112 Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 2.44 GB Memory Efficiency: 62.57% of 3.91 GB [root@hydra bin]# Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu On 11/06/2018 06:30 PM, Mike Cammilleri wrote: > > Thanks for this. We'll try the workaround script. It is not > mission-critical but our users have gotten accustomed to seeing these > metrics at the end of each run and its nice to have. We are currently > doing this in a test VM environment, so by the time we actually do the > upgrade to the cluster perhaps the fix will be available then. > > > Mike Cammilleri > > Systems Administrator > > Department of Statistics | UW-Madison > > 1300 University Ave | Room 1280 > 608-263-6673 | mi...@stat.wisc.edu > > > > > *From:* slurm-users on behalf > of Chris Samuel > *Sent:* Tuesday, November 6, 2018 5:03 AM > *To:* slurm-users@lists.schedmd.com > *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1 > > On 6/11/18 7:49 pm, Baker D.J. wrote: > > > The good new is that I am assured by SchedMD that the bug has been > fixed > > in v18.08.3. > > Looks like it's fixed in this commmit. > > commit 3d85c8f9240542d9e6dfb727244e75e449430aac > Author: Danny Auble > Date: Wed Oct 24 14:10:12 2018 -0600 > > Handle symbol resolution errors in the 18.08 slurmdbd. > > Caused by b1ff43429f6426c when moving the slurmdbd agent internals. > > Bug 5882. > > > > Having said that we will probably live with this issue > > rather than disrupt users with another upgrade so soon . > > An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though, > should it? We just flip a symlink and the users see the new binaries, > libraries, etc immediately, we can then restart daemons as and when we > need to (in the right order of course, slurmdbd, slurmctld and then > slurmd's). > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >
Re: [slurm-users] Seff error with Slurm-18.08.1
On 6/11/18 7:49 pm, Baker D.J. wrote: The good new is that I am assured by SchedMD that the bug has been fixed in v18.08.3. Looks like it's fixed in this commmit. commit 3d85c8f9240542d9e6dfb727244e75e449430aac Author: Danny Auble Date: Wed Oct 24 14:10:12 2018 -0600 Handle symbol resolution errors in the 18.08 slurmdbd. Caused by b1ff43429f6426c when moving the slurmdbd agent internals. Bug 5882. Having said that we will probably live with this issue rather than disrupt users with another upgrade so soon . An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though, should it? We just flip a symlink and the users see the new binaries, libraries, etc immediately, we can then restart daemons as and when we need to (in the right order of course, slurmdbd, slurmctld and then slurmd's). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [slurm-users] Seff error with Slurm-18.08.1
I'm also interested in this issue since I've come across the same error today. We built Slurm-18.08.1 with the contribs packages on Ubuntu Bionic and seff is also complaining with $ /s/slurm/bin/seff 36 perl: error: plugin_load_from_file: dlopen(/s/slurm/lib/slurm/accounting_storage_slurmdbd.so): /s/slurm/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd perl: error: plugin_load_from_file: dlopen(/s/slurm/lib/slurm/accounting_storage_slurmdbd.so): /s/slurm/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd Job not found. Mike Cammilleri Systems Administrator Department of Statistics | UW-Madison 1300 University Ave | Room 1280 608-263-6673 | mi...@stat.wisc.edu From: slurm-users on behalf of Miguel A. Sánchez Sent: Tuesday, October 23, 2018 10:26 AM To: slurm-us...@schedmd.com Subject: [slurm-users] Seff error with Slurm-18.08.1 Hi all I have updated my slurm from the 17.11.0 version to the 18.08.1. With the previous version, the 17.11.0 version, the seff tool was working fine but with the 18.08.1 version, when I try to run the seff tool I receive the next error message: # ./seff perl: error: plugin_load_from_file: dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so): /usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd perl: error: plugin_load_from_file: dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so): /usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd Job not found. # Both Slurm installations has been compiled from sources in the same computer but only the seff that was compiled in the 17.11.0 version works fine. To compile the seff tool, from the source Slurm tree: cd contrib make make install I think the problem is in the perlapi. Could it be a bug? Any Idea about how can I fix this problem? Thanks a lot. -- Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu