It has been a few days so I am sending this again incase someone has
seen this issue and might have a seen this problem or has a suggestion
of where to look and why it might not be taking these settings with
5.2 when it did with 5.1

On Mon, Aug 4, 2008 at 2:00 PM, Rob Lines <[EMAIL PROTECTED]> wrote:
> We were previously running 5.1 x86_64 and recently updated to 5.2
> using yum.  Under 5.1 we were having problems when running jobs using
> torque and the solution had been to add the following items to the
> files noted
> "*          soft    memlock         unlimited" in /etc/security/limits.conf
> "session    required" in /etc/pam.d/{rsh,sshd}
> This changed the max locked memory setting in ulimit as follows:
> Before the change
> rsh nodeX ulimit -a
> still gives us
> max locked memory       (kbytes, -l) 32
> After the change
> rsh nodeX ulimit -a
> max locked memory       (kbytes, -l) 16505400
> The nodes have 16gb of memory.
> Now after the 5.2 updates those files are all the same and on most of
> the nodes we haven't yet rebooted them due to log running processes
> but a few nodes have been restarted and now that jobs are starting to
> be put on them we are back to max locked memory of 32k rather than
> 16gb.
> The error we are receiving on those jobs is :
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>    This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>    This will severely limit memory registrations.
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(306).......: Initialization failed
> MPID_Init(113)..............: channel initialization failed
> MPIDI_CH3_Init(167).........:
> MPIDI_CH3I_RDMA_init(138)...:
> rdma_setup_startup_ring(333): cannot create cq
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(306).......: Initialization failed
> MPID_Init(113)..............: channel initialization failed
> MPIDI_CH3_Init(167).........:
> MPIDI_CH3I_RDMA_init(138)...:
> rdma_setup_startup_ring(333): cannot create cq
> rank 45 in job 1  nodeX_35175   caused collective abort of all ranks
>  exit status of rank 45: return code 1
> rank 44 in job 1  nodeX_35175   caused collective abort of all ranks
>  exit status of rank 44: return code 1
> The full output of :
> rsh nodeX ulimit -a
> connect to address x.x.x.x port 544: Connection refused
> Trying krb4 rsh...
> connect to address x.x.x.x port 544: Connection refused
> trying normal rsh (/usr/bin/rsh)
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 135168
> max locked memory       (kbytes, -l) 32
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 135168
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> Any ideas, suggestions or items I could roll back would be
> appreciated.  I looked through the list of packages that were updated
> and the only one that I could see that was related was pam.  ssh and
> rsh were not updated.
> Thank you,
> Rob
CentOS mailing list

Reply via email to