Hi Greg,

thx for the explanations - now it is much clearer to me, especially why it does not happen after a restart ( I guess the locking "situation" is different then). We are using always only a local file system - nfs should not be involved. Some of the machines which failed were virtual machines, some of them nodes of a cluster.  The screenshot I showed was on our big server, which has 128GB ram, so the rise didn't cause a crash. Next time I will monitor this more closely and will try stopping of the child processes, as you suggested to get more insights. At the moment we have finished all renderings (and they are great in Okulus ), but I'm sure there will come more in some weeks.

thx for the help!

best
Jan

On 04/25/2018 07:14 PM, Gregory J. Ward wrote:
Hi Jan,

This could be an unexpected "hang" condition with one of the rtrace processes, 
where a single ray evaluation is blocked waiting for access to the ambient file, while 
the other processes continue computing away, filling up the queue with results after the 
blocked one.  I could see this becoming a runaway memory condition, but I don't know why 
a process would be blocked.  NFS file locking is used on the ambient file, and this has 
been known to fail on some Linux builds, but I haven't seen it fail by refusing to 
unlock.  (The problem in the past has been unlocking when it shouldn't.)

If you can monitor your processes, watching for when the parent becomes large, 
then stop all the child processes (kill -stop pid1 pid2 ...), restarting them 
one by one using (kill -continue pidN).  If the parent process starts to shrink 
after that, or at least doesn't continue to grow, then this would support my 
hypothesis.

The other thing to look for is a child process with 0% CPU time.  If none of 
the child processes are hung, then I'm not sure why memory would be growing in 
the parent.

There's no sense trying to fix such an unusual problem until we have a firmer 
idea of the cause.

Cheers,
-Greg

From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT

Hi Greg,

While doing the renderings for the VR of Kynthia we encountered "sometimes" 
(that means not 100% reproducible) problems with the memory.

We rendered 4 images at the same time sharing an ambient file, each rtrace was 
using the -n 2 or -n 3 option.

I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the 
"mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a 
large swap space, some of these processes failed with "cannot allocate memory". I know 
that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had 
around 10 failing with this issue.

The "children" use around 800-900mb, so this is fine and what we expected. But 
we dont know how to estimate to total memory usage (lets say a single rtrace would need 
500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother 
process, which size a bit unpredictable and sometimes exploding.

This "growth" of the mother process happens always at the end of the images 
(lets say 90% finished).

Interestingly when restarting the processes the fail never happened again (but 
I have to admit I didn't restart the simulation explicitly on the same machine, 
since I had a fully automized process, where the failed ones were automatically 
restarted on one of the 50 machines we had available.)

Finally we finished all 400(!) renderings with a very good quality.

So this is not an urgent issue, but we wanted to report this. Maybe you have 
some rules of thumb to calculate the memory usage when applying the -n option 
when the usage of a single process is known?

best

Jan

--
Dr.-Ing.  Jan Wienold


--
Dr.-Ing.  Jan Wienold
Ecole Polytechnique Fédérale de Lausanne (EPFL)
EPFL ENAC IA LIPID

http://people.epfl.ch/jan.wienold
LE 1 111 (Office)
Phone    +41 21 69 30849


_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Reply via email to