>
> ------------------------------
>
> Message: 6
> Date: Tue, 24 Feb 2015 16:14:20 +0000
> From: Simon Andrews <[email protected]>
> To: Reuti <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] Memory errors even after setting
>         h_vmem
> Message-ID: <d11252bf.1a585%[email protected]>
> Content-Type: text/plain; charset="us-ascii"
>
>
>
> On 24/02/2015 15:40, "Reuti" <[email protected]> wrote:
>
> >
> >> Despite this we're getting jobs which are dying due to not being able to
> >> allocate memory.  The nodes on which these failures happen still have
> >> plenty of free memory and the jobs are dying from internal malloc
> >>errors,
> >> rather than being killed due to the limit which was imposed by grid
> >>engine.
> >> I suspect that what is happening is that we're getting memory
> >> fragmentation, so that even though there is plenty of memory available,
> >> the programs aren't able to allocate a large enough contiguous block of
> >> memory and are therefore dying.
> >
>

You might also check how much vmem vs. res. mem the jobs/applications are
requesting. I found that some applications (Matlab, I'm looking at you!!!)
request MUCH for vmem than actual res. mem they use. We use CentOS 6. So my
h_vmem defaults and limits are a lot higher than what's actually used. The
main offender seems to be the java engine matlab and some other apps use,
it allocs way too much vmem.

Mathworks helped me debug and there's a workaround. Here's the relevant
info:

"I have one other thing I’d like you to try.  One of our developers
mentioned that there was a change made to the memory manager in Red Hat 6.
Additional virtual memory is allocated in 64 MB blocks (they call them
“arenas”) in order to eliminate false sharing among multiple cores.  An
environment variable is available to limit the number of arenas,
MALLOC_ARENA_MAX .  Could you set the value of this environment variable to
1 and see if that makes a difference?  I tested it on a Red Hat 6 machine
in our quality lab.  On that machine, when I started MATLAB with –nojvm, it
used 806 MB virtual memory.  I then set the environment variable to 1, and
it used 673 MB.  The difference is close to two 64 MB blocks.  (I used
–nojvm so I didn’t have to set up X forwarding on the machine, but you
should try it without –nojvm.)

If this reduces the memory, you could try various values.  With a value of
1, false sharing may occur under certain circumstances (the same behavior
that always occurred in Red Hat 5).  The larger the number, the less likely
false sharing.  I looked online and found a suggested value of 4."

Some dicussion online:
Hadoop says set to 4:
https://issues.apache.org/jira/browse/HADOOP-7154

IBM:
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

I've been using 4 (i.e. I set it for all users at login) and Matlab
off-the-beat allocs 1.3GB vmem instead of 3.5GB.

It'd be great to limit based on resident mem directly, but OGS can't do
that. One of the commercial forks has that capability now, can't remember
offhand which one. In any case I set my host h_vmem limit somewhat above
actual memory to account for these peculiarities.

-M
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to