[slurm-dev] Re: fix of segfault in srun

2015-10-07 Thread David Bigagli
Fixed in 15.08.2. commit 30a5d6778fc86f8799cefc4fbea4f9ae7eac8d92 Author: Hongjia Cao Date: Wed Oct 7 15:05:24 2015 +0200 Thanks for your contribution. On 10/07/2015 12:15 PM, Hongjia Cao wrote: attached. -- Thanks, /David/Bigagli da...@schedmd.com

[slurm-dev] Re: scontrol show conf crashes slurmctld 14.11.9

2015-09-08 Thread David Bigagli
Can you show us the stack using gdb ? Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015, Washington D.C. http://slurm.schedmd.com/slurm_ug_agenda.html > On 08 Sep 2015, at 16:45, Mar

[slurm-dev] Re: Limiting count of array tasks started during backfill

2015-09-07 Thread David Bigagli
y to 4. The minimum index value is 0. the maximum value is one less than the configura- tion parameter MaxArraySize. Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015,

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-07 Thread David Bigagli
Those pmi2 files are the server side of the pmi2 protocol implemented in slurmstepd, those are always installed. Is the client side that is the one that get’s installed from the contribs directory. Thanks /David/Bigagli da...@schedmd.com

[slurm-dev] Re: Unstable communication

2015-09-03 Thread David Bigagli
/David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015, Washington D.C. http://slurm.schedmd.com/slurm_ug_agenda.html > On 03 Sep 2015, at 00:02, Ulf Markwardt wrote: > > find an address, check slurm.conf

[slurm-dev] Re: slurm questions - limit paging?

2015-08-12 Thread David Bigagli
Is there a way to limit paging outside of Slurm? There are memory limits in Slurm but no paging limit. There is a backup controller in Slurm, you can read about it here: http://slurm.schedmd.com/slurm.conf.html Thanks /David/Bigagli da...@schedmd.com

[slurm-dev] Re: sacct vs sstat

2015-07-30 Thread David Bigagli
available when they start. Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16 September 2015, Washington D.C. http://slurm.schedmd.com/slurm_ug_agenda.html > On 30 Jul 2015, at 16:27, Yair Yarom wr

[slurm-dev] Re: race condition in stepd i/o shutdown

2015-07-27 Thread David Bigagli
I think that Hydra should kept 0,1,2 open and dup them to /dev/null so that any children’s file descriptor will be greater than 2. This is standard Unix way. Thanks /David/Bigagli da...@schedmd.com === Slurm User Group Meeting, 15-16

[slurm-dev] Re: PATCH: fix comparison of env-var option value in srun/sbatch/salloc

2015-07-20 Thread David Bigagli
Committed to master branch. Thanks for your contribution. commit 2678533ade11852b155126197430ed66b8c09f26 Author: Hongjia Cao Date: Mon Jul 20 04:47:00 2015 -0700 Fix comparison of env-var option value in srun/sbatch/salloc David Bigagli da...@schedmd.com

[slurm-dev] Re: module environment

2015-06-29 Thread David Bigagli
module is not a command in /bin. To make it work you have to source the module startup file in your script, for example: . /usr/local/Modules/3.2.10/init/bash then you can use the module file. > On Jun 29, 2015, at 4:43 PM, Antonia Mey wrote: > > Dear all, > > I may have a very basic quest

[slurm-dev] Re: Segfault when using accounting_storage/filetxt in 15.08.0-0pre4

2015-06-04 Thread David Bigagli
has anything to do with any of the changes I made. Is this a known problem with some kind of workaround, or should I file a bug on it? Eric -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] RE: scontrol suspen only allowd to slurm user and root?

2015-05-21 Thread David Bigagli
uch means to me that job suspension (via scontrol suspend) is allowed only to root and to SlurmUser Is that intentional? -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm and docker/containers

2015-05-19 Thread David Bigagli
on running on all of your compute nodes, and provided users can access the docker socket/port, they can submit jobs that call "docker run", can't they? Cheers, -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Inadvertent access to all qos'

2015-05-14 Thread David Bigagli
I go and create a new qos, all users instantly can utilize this qos. This is very strange, I wonder if some setting has been munged in the database somewhere mistakenly? Any ideas? Thanks. Best, Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Intel MPI, perhost, and SLURM: Can I override SLURM?

2015-04-30 Thread David Bigagli
since you brought it up. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm and PMI2

2015-04-27 Thread David Bigagli
em but to search for the library name looks like a reasonable start to me. I would hope you can help me with this. Thanks, Ulf -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm and PMI2

2015-04-27 Thread David Bigagli
ible for the problem but to search for the library name looks like a reasonable start to me. I would hope you can help me with this. Thanks, Ulf -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread David Bigagli
should fix both of these issues. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi> -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: What is SPANK logging function slurm_debug?

2015-04-17 Thread David Bigagli
-- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Question about prologging

2015-04-14 Thread David Bigagli
es, despite the file actually being there. I've even set extended ACL's on the directory so that the SlurmUser can see all of the files (sudo -u slurm ls -lR failed with permission denied). Could anyone tell me why the slurm_script file cannot be read via prolog? Thank you! John DeSantis

[slurm-dev] Re: Problems running job

2015-03-30 Thread David Bigagli
on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-25 Thread David Bigagli
The slurm.spec file decides if to install the init.d scripts or the systemd stuff. On 03/24/2015 07:24 PM, Fred Liu wrote: -Original Message- From: David Bigagli [mailto:da...@schedmd.com] Sent: 星期三, 三月 25, 2015 1:19 To: slurm-dev Subject: [slurm-dev] Re: successful systemd

[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-24 Thread David Bigagli
all layouts are now unloaded. Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon. -- Subject: Unit slurmctld.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: possible bug in srun --unbuffered option

2015-03-09 Thread David Bigagli
Yes it works with or without --unbuffered. I don't think data are buffered inside of Slurm. On 03/09/2015 10:15 AM, Lipari, Don wrote: -Original Message- From: David Bigagli [mailto:da...@schedmd.com] Sent: Thursday, March 05, 2015 10:49 AM To: slurm-dev Subject: [slurm-de

[slurm-dev] Re: possible bug in srun --unbuffered option

2015-03-05 Thread David Bigagli
deed broken on Slurm 14.11. We just took our first cluster running 14.11 into production this week, so probably not many users have run into this yet. Regards, Pär Lindfors, NSC -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Two problems with 14.11.4

2015-03-04 Thread David Bigagli
(after which it runs fine). Problem number 2: 'scontrol show jobs' shows jobs in state RUNNING that don't actually appear to exist. Some of these are days old. What might be going on here? -- Jon Nelson Dyn / Senior Software Engineer p. +1 (603) 263-8029 -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fa

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli
by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Two patches for jobacct_gather.

2015-02-06 Thread David Bigagli
for us that have working cgroups memory limits. Best regards, Magnus Jonsson -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Small bug in scontrol output

2015-01-28 Thread David Bigagli
Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=./test.sh WorkDir=/home/adm17 StdErr=/home/adm17/test-e%j.txt <- here %j is not expanded StdIn=/dev/null StdOut=/home/adm17/test-o12032.txt Regards, Uwe

[slurm-dev] Re: GresTypes typo in docs

2015-01-06 Thread David Bigagli
ght out by a typo on http://slurm.schedmd.com/gres.html where the example has GresType=gpu,bandwith rather than GresTypes=... Could you please fix the doc! BTW. Slurm was quite ungracious about having that bad entry in slurm.conf Regards, Gareth -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: QOS reporting question

2014-12-18 Thread David Bigagli
what amount of time within a QOS. sacct can give me information on an account level but I can't seem to get it to report on a QOS level on a user by user bases. Thanks Jackie -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: _slurm_cgroup_destroy message?

2014-11-18 Thread David Bigagli
partment for Research Computing, University of Oslo -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Programmatically submit a job

2014-11-07 Thread David Bigagli
ngly typed interface, and it would be nice if I could use an interface with stronger types. Cheers, Walter Landry -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path

2014-11-07 Thread David Bigagli
I agree. Done in commit c8f34560c87cfbbf. On 11/06/2014 06:46 PM, Christopher Samuel wrote: On 07/11/14 11:53, David Bigagli wrote: Hi, Hiya David, it used to logged at debug level in 2.6 and now it is an error. This seems to be an issue with cgroups which does not allow that path

[slurm-dev] Re: slurmstepd: _slurm_cgroup_destroy: problem deleting step cgroup path

2014-11-06 Thread David Bigagli
No I can't use srun directly as we get poor scaling, the next thing in the list (after SC14) is to migrate to Open-MPI 1.8.4 which is due out shortly which should address this. cheers, Chris -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: make dist patch

2014-11-03 Thread David Bigagli
\ etc.33.1.3/topology.conf\ - etc.33.1.4/slurm.conf \ etc.33.1.4/testcases\ etc.33.1.4/topology.conf\ test34.1\ -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Database incompletely configured

2014-10-31 Thread David Bigagli
On 10/31/2014 12:49 PM, David Bigagli wrote: The database is created by the slurmdbd daemon. Have you granted access to the database to the slurm user? Yes, I did a grant all on slurm_acct_db.* TO 'slurm'@'localhost'; -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Database incompletely configured

2014-10-31 Thread David Bigagli
s. How can I for a rebuild of the database? I have been grepping through the source tree, but I haven't stumbled on the script that creates the tables and columns needed. ~Charles~ -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Documentation mismatch: man pages / html

2014-10-20 Thread David Bigagli
## Does anyone know the correct name and usage of this parameter? Thank you. Regards, Uwe -- -- Carles Fenoy -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Determining hostname of srun given job_id/step_id

2014-10-10 Thread David Bigagli
I think the question was about the submission node, the node where the srun/sbatch was executed from. On 10/10/2014 04:14 PM, Franco Broi wrote: Are we talking about alloc_node? You can retrieve it using the perl api. On 11 Oct 2014 06:53, David Bigagli wrote: Hi, the information is in

[slurm-dev] Re: Determining hostname of srun given job_id/step_id

2014-10-10 Thread David Bigagli
process resides given a known job_id/step_id? Thanks, Andrew -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: jobfilter plugin

2014-09-24 Thread David Bigagli
files [2014-09-24T13:47:52.151] error: cannot find job_submit plugin for job_submit/defaults [2014-09-24T13:47:52.151] error: cannot create job_submit context for job_submit/defaults [2014-09-24T13:47:52.151] fatal: failed to initialize job_submit plugin On Tue, 16 Sep 2014, David Bigagli wrote:

[slurm-dev] Re: Bug (?) and Bugfix in update reservation

2014-09-22 Thread David Bigagli
http://twitter.com/bull_de ** Bull Firmenprofil bei XING: https://www.xing.com/companies/bullgmbh -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: jobfilter plugin

2014-09-16 Thread David Bigagli
e impression from the docu it was included in slurm. The slurm.conf line reads: JobSubmitPlugins=default in compliance with the documentation. Thanks for your help Eva -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: sacctmgr

2014-09-16 Thread David Bigagli
anks Eva -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: make eio_shutdown_time per eio handle

2014-08-28 Thread David Bigagli
Code committed to 14.03.8. On 08/28/2014 05:24 AM, Hongjia Cao wrote: The patch change the global eio_shutdown_time to a field in eio handle to allow multiple eio handles in one process. This will be convenient for a process to launch multiple job steps. -- Thanks, /David/Bigagli

[slurm-dev] Re: cgroup freezer throwing "Device or resource busy" upon job cancel or kill - 14.03.6

2014-08-18 Thread David Bigagli
Unfortunately the article refers to the memory sub system which gets removed without problem. The issue happens on the freezer, however it is just an error message without consequences. On 08/13/2014 04:16 PM, Kilian Cavalotti wrote: On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli wrote

[slurm-dev] Re: cgroup freezer throwing "Device or resource busy" upon job cancel or kill - 14.03.6

2014-08-13 Thread David Bigagli
Interesting indeed. Let me have a look at it and experiment with it a bit. On 08/13/2014 04:16 PM, Kilian Cavalotti wrote: On Wed, Aug 13, 2014 at 10:00 AM, David Bigagli wrote: For some reason at the first attempt rmdir(2) returns EBUSY. Would writing to memory.force_empty before

[slurm-dev] Re: cgroup freezer throwing "Device or resource busy" upon job cancel or kill - 14.03.6

2014-08-13 Thread David Bigagli
same kernel. Cheers, -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: How to interpret sacct output

2014-08-11 Thread David Bigagli
sumption of the batch step higher than either of the two job steps? Thank you, Robert On 8/5/2014 1:41 PM, David Bigagli wrote: Yes that is correct. The first entry is the allocation which has 2 cpus, -n 2 was specified, the second entry is the batch step that run for 81 seconds, so the tota

[slurm-dev] Re: How to interpret sacct output

2014-08-05 Thread David Bigagli
navailable to anyone else, Slurm is just showing that he effectively used 162 cpu seconds (81 x 2). Thank you, Robert On 8/5/2014 9:53 AM, David Bigagli wrote: Hi Robert, the first line is the allocation and the second the batch step, the batch step runs on one cpu. I am not sure what are

[slurm-dev] Re: How to interpret sacct output

2014-08-05 Thread David Bigagli
nc. 2880 Zanker Road Suite 203 San Jose, CA 95134 Tel: +1 408 300 9448 Fax: +1 408 715 0102 www.BrightComputing.com <http://www.brightcomputing.com> -- Thanks, /David/Bigagli Slurm User Group Meeting September 23-24, Lugano, Switzerland Find out more http://slurm.schedmd.com/slurm_ug_agenda.html www.schedmd.com

[slurm-dev] Re: Interactive job array

2014-07-17 Thread David Bigagli
Julien Collas wrote: Hi, How would you do to run an interactive job array ? By interactive, I mean that the command only exit at the end of the array ? Regards, Julien -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Segfault at slutmctld-14.03.3-2 start on CentOS 5.10

2014-05-28 Thread David Bigagli
= 0, db_fail = 0xf0b2ff, db_resumed = 0xc2} dir_name = Remembering some previous problems I suspect that some uninitialised variable in some structure (which represents some omitted option in slurmd.conf) may cause such effect. Could someone please give me some hints? Thanks! -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: SlurmdSpoolDir change without impacting jobs

2014-05-23 Thread David Bigagli
, /David/Bigagli www.schedmd.com

[slurm-dev] Re: srun interactive failure after upgrade

2014-05-21 Thread David Bigagli
ED MESSAGE- Hash: SHA1 On 21/05/14 05:56, David Bigagli wrote: Yes, it is a change in behaviour. There was a fix in the I/O module that unfortunately introduced this side effect. Oh dear, that's going to be a fun bit of user re-education if we go to 14.x. Hopefully we can abstract it o

[slurm-dev] Re: srun interactive failure after upgrade

2014-05-20 Thread David Bigagli
on its own and not within an salloc, is no longer supported and expected to fail? Thanks Martins On 5/20/14 1:23 PM, David Bigagli wrote: In 14.03 you should use the SallocDefaultCommand as documented in http://slurm.schedmd.com/slurm.conf.html to srun with the --pty option. On 05/19/2014 10

[slurm-dev] Re: PMI2 related error

2014-05-20 Thread David Bigagli
яков Артем Юрьевич Best regards, Artem Y. Polyakov -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: srun interactive failure after upgrade

2014-05-20 Thread David Bigagli
some debugging on and we are receiving task exit from the tasks on the secondary node right after startup. Let me know what other debugging output might be useful here. Thanks, Mike Robbert -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Fwd: slurm installation

2014-04-25 Thread David Bigagli
WCKey=yes # # Database info StorageType=accounting_storage/mysql #StorageHost=localhost #StoragePort=1234 StoragePass=slurm_pass StorageUser=slurm StorageLoc=slurm_acct_db -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working

2014-04-11 Thread David Bigagli
Errata corrige. The core file is in the log directory. On 04/11/2014 12:08 PM, David Bigagli wrote: Hi, this Slurm bug has been fixed and it will be available in 14.03.1 which will be released soon. Otherwise it is available in the HEAD. You should find a core file of slurmstepd in the

[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working

2014-04-11 Thread David Bigagli
tify_io_failure: aborting, io error with slurmstepd on node 0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete Launching with salloc/sbatch works. - Anthony -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Website correction

2014-04-11 Thread David Bigagli
cores on a node only. LICENSE_ONLY -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Patch for squeue submit time

2014-04-02 Thread David Bigagli
output. There are very few format chars left so I just picked a free one. Thanks Martins -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Gres GPU Problem with new slurm cluster

2014-03-31 Thread David Bigagli
nks, -J -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: performance improvement for hostlist

2014-03-21 Thread David Bigagli
quot;xmalloc usecs: %lld", delta_t); START_TIMER; for (i = 0; i < COUNT; i++) { ptr = malloc(MAX_RANGES * sizeof(struct _range)); free(ptr); } END_TIMER; info("malloc usecs: %lld", delta_t); return 0; }

[slurm-dev] Re: performance improvement for hostlist

2014-03-20 Thread David Bigagli
multithread patch (commit 17449c066af69441b741110ef51fc2f534272871) does not help. Replacing hostlist_push with hostlist_push_host (commit 1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo, but I think the root cause is in xmalloc. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: performance improvement for hostlist

2014-03-20 Thread David Bigagli
hostlist_push with hostlist_push_host (commit 1b0b135f9579e253ddd5bf680d2ea70ad12f9bda) fixes the problem of sinfo, but I think the root cause is in xmalloc. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Slurm upgrade documentation

2014-03-18 Thread David Bigagli
have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful. -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: fix sjstat >1TB memory printout

2014-03-18 Thread David Bigagli
nks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: SLURM or Slurm?

2014-03-11 Thread David Bigagli
but, in fact, this is just an abbreviation and in the repository I see "SLURM" is used (for example in README or in COPYING). Thanks, Taras -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: SegFault On FreeBSD with AllowGroups Specified in Partition

2014-02-23 Thread David Bigagli
mat_access #PartitionName=lion Nodes=lion-[1-48] Default=NO MaxTime=2880 State=DOWN AllowGroups=lion.che_cluster_access -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: PMI library

2014-02-14 Thread David Bigagli
question about SLURM's libpmi. I am currently adopting DMTCP project (checkpointer) to support SLURM. Currently I am working on PMI support. And looking into _kvs_put function I have the following question: for (i=0; i -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: slurm_terminate_job versus slurm_kill_job

2014-01-29 Thread David Bigagli
s neater (or at least more thorough), but kill is getting the job done. Is there a reason that "slurm_kill_job" shouldn't be used? Thanks Michael -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Patch to contribs/torque/pbsnodes.pl to work more like TORQUE pbsnodes command

2013-11-12 Thread David Bigagli
offline (-o), reset (-r), clear (-c), and -N (set note/reason) command line options * adds the -l (brief list) and -n (list with notes) command line options * format the output in the default verbose list mode more like TORQUE's pbsnodes does --Troy -- Thanks, /David/Bi

[slurm-dev] Re: Job Dependencies on Arrays

2013-10-28 Thread David Bigagli
? I want to make sure that all the jobs in the array have completed prior to the job I want to run runs. Would you reference the primary job ID? Or would you reference the entire span of jobs namely JobID_[1-100]? -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com

[slurm-dev] Re: Interactive Jobs Not Launching Under High Load

2013-10-25 Thread David Bigagli
hanks, /David/Bigagli www.schedmd.com voice: +1 415 320 2776

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-07 Thread David Bigagli
This is the link to the commit. https://github.com/SchedMD/slurm/commit/6ef96d5aae739197e5512ea50ea55eef46f1975c On 10/07/2013 10:26 AM, David Bigagli wrote: Absolutely. It is fixed now. On 10/07/2013 09:00 AM, Ralph Castain wrote: Oops!! You put the fix in the wrong place, I'm a

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-07 Thread David Bigagli
er on * back again to send out the right protocol * message size. */ remaining_len -= PMII_COMMANDLEN_SIZE; That change needed to go *before* the highlighted snprintf. I'm afraid 2.6.3 continue to segfault :-( On Oct 3, 2013, at 12:12 PM, Ralph Castain wrote: Cool - thanks David

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
Fixed. I chose the method proposed by Michael, subtract first, add later. :-) On 10/03/2013 10:29 AM, Ralph Castain wrote: On Oct 3, 2013, at 10:16 AM, David Bigagli wrote: I am not saying that remaining_len is correct or that mpich is bugless :-) I am only saying that decrementing

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
ed to at least change the snprintf command to reflect the reduced size of the "c" buffer. On Oct 3, 2013, at 9:44 AM, David Bigagli wrote: Hi, I don't know the details of the segfault but the code in question is correct. If you decrease the length then the file cmdlen: cmdlen

[slurm-dev] Re: Bug in pmi2_api.c

2013-10-03 Thread David Bigagli
Hi, I don't know the details of the segfault but the code in question is correct. If you decrease the length then the file cmdlen: cmdlen = PMII_MAX_COMMAND_LEN - remaining_len; will not be correct and wrong length will be sent to the pmi2 server. This code is taken verbatim from mpich2-

[slurm-dev] Re: Bug in Slurm time=days-hours:minutes parsing?

2013-10-02 Thread David Bigagli
Sounds good. :-) Thanks for the patch it is going to be in Slurm 2.6.3. On 10/02/2013 12:28 AM, Mark Nelson wrote: Hi All, It does look like there is a bug in time_str2secs(): If we give it a time of format: days-0:min, we exit the for loop with days set, but with min set to our hours value

[slurm-dev] Re: Overtime job exit code

2013-09-10 Thread David Bigagli
Hi, this issue has been fixed in the 2.6.2 release. On 09/10/2013 09:01 AM, Michael Gutteridge wrote: We allow jobs to overrun their wall time via "OverTimeLimit". We've noticed that jobs that complete successfully but go over the wall time are reported as having "JobState=TIMEOUT" in the

[slurm-dev] Re: Required node not available (down or drained)

2013-08-26 Thread David Bigagli
This link points to SLURM 2.3 documentation. For more updated versions and the currently released version 2.6.1 you may want to use this documentation: http://slurm.schedmd.com/troubleshoot.html#nodes On 08/26/2013 10:10 AM, Nikita Burtsev wrote: https://computing.llnl.gov/linux/slurm/trou

[slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0

2013-07-25 Thread David Bigagli
Hello, it is a requirement to specify --mpi=pmi2 otherwise the srun will not load the pmi2 library implementing the server side pmi2 functionalities. There was a error in the contribs/pmi2/pmi2_api.c causing the 'no value for req' message, this was the ->1488 remaining_len -= PMII_COM

[slurm-dev] Re: slurm integration with FlexLM license manager

2013-07-02 Thread David Bigagli
ctually checks the licenses out during which interval an external user > checks out licenses unbeknownst to the scheduler, but I suspect they have > done nothing. > If anyone hears of anything different, I, for one, would be happy to know. > > Gary D. Brown > > On Tue, Ju

[slurm-dev] Re: slurm integration with FlexLM license manager

2013-07-02 Thread David Bigagli
Indeed currently there is no integration between Flexlm and SLURM, but some ideas are being passed around what to do about it. I am one of the original designers and developers of Platform License Scheduler. The item 1) you mentioned is certainly the first step but consider even that may not be ea

[slurm-dev] Re: Slurmctld multithreaded?

2013-06-12 Thread David Bigagli
Hi, the gstack command will show you the activities of each thread in the slurmctld process. This is an example: david@prometeo ~>gstack 14432 Thread 8 (Thread 0x7fa9c9190700 (LWP 14433)): #0 0x0035b90acb8d in nanosleep () from /lib64/libc.so.6 #1 0x0035b90aca00 in sleep () from /li

[slurm-dev] prova

2013-05-29 Thread David Bigagli
test */David*

[slurm-dev] Re: Easy Backfilling Plugin for SLURM

2013-04-25 Thread David Bigagli
The available slurm documentation can be found here: http://slurm.schedmd.com */David* On Thu, Apr 25, 2013 at 11:41 AM, David Bigagli wrote: > Hi, slurm.conf has the following parameter as documented in the > slurm.conf man page: > > max_job_bf=# > The

[slurm-dev] Re: Easy Backfilling Plugin for SLURM

2013-04-25 Thread David Bigagli
Hi, slurm.conf has the following parameter as documented in the slurm.conf man page: max_job_bf=# The maximum number of jobs to attempt backfill scheduling for (i.e. the queue depth). Higher values result in more overhead and less responsiveness.

[slurm-dev] Re: slurmdbd dies

2013-03-28 Thread David Bigagli
Try to start it so it does not damonize I think the option is -D but better check the man page, see if it core dumps. sent from galaxy nexus On Mar 27, 2013 9:31 AM, "Pablo Sanz Mercado" wrote: > > > Hi Alejandro, > > Sorry, the messages we obtain about the "couldn't suspend job"

[slurm-dev] Re: node switching / selection

2013-03-22 Thread David Bigagli
Is it possible the job runs on several nodes, say -N 3, then one node is lost so it ends up running on 2 nodes only? Such a job should have been submitted with ---no-kill. /David On Fri, Mar 22, 2013 at 4:06 PM, Michael Colonno wrote: > > Actually did mean node below. The job launched on

[slurm-dev] Re: Memory swapping, and transition delay issues.

2013-03-12 Thread David Bigagli
Hi, the problem of memory over-subscription is discusses in 'man slurm.conf'. Have a look at DefMemPerCPU, DefMemPerNode and the suggested configuration when using CR_CPU_Memory. */David* On Tue, Mar 12, 2013 at 3:15 PM, Joo-Kyung Kim wrote: > Hi, > > ** ** > > I am using SLURM 2.4.0.

[slurm-dev] Re: change select() to poll() in src/common/fd.c

2013-03-12 Thread David Bigagli
This is the way select() works regardless of the version of redhat or any other distribution. The fd_set is a bit array defined in of __FD_SETSIZE which is defined as 1024 in */David* On Tue, Mar 12, 2013 at 11:30 AM, Hongjia Cao wrote: > When launching tasks on about 1000 nodes, I get the f

[slurm-dev] Re: slurmctld in version 2.5.3 is segfaulting in communication with slurmdbd via munge

2013-03-06 Thread David Bigagli
Have you updated the slurmdb daemon first as described here: http://schedmd.com/slurmdocs/quickstart_admin.html */David* On Wed, Mar 6, 2013 at 5:58 PM, Lennart Karlsson wrote: > > Hi, > > Today I upgraded SLURM from v 2.4.3 to v 2.5.3. > > It seems like a mistake, because slurmctld crashes. A

[slurm-dev] Re: Optimize nodes usage

2013-03-06 Thread David Bigagli
If the jobs run one after another slurm will pick the first host that can run the job. You could group your jobs running them as job steps and also have a look at the --distribution option of sbatch and srun.

  1   2   >