Hi Greg,
unfortunately the currently running process was started from a build
without the -g switch. I will recompile and test it to try getting the
backtrace. I am pretty sure that there is no >100% reflection.
The one thing that I suspected to be the culprit is how I mask the
rendering. Does rfluxmtx properly digest zero direction vectors as
rtrace does? I have observed that the rcontrib process is stuck once the
last visible pixel has been rendered. The remaining part is all "out of
view", e.g. 0 0 0 direction vector. So it might be that rcontrib is just
busy computing the zero length rays (most of my image is masked) but
makes no visible progress. I expected these rays to result in little
load, assuming that oversampling would not apply for zero length vectors
- and I have a pretty high "oversampling" set with the -c N parameter.
Is it possible that oversampling (accumulating) collides with my use of
the "dummy rays" to mask the image?
Cheers, Lars.
Hi Lars,
Did you check to make sure that none of your surfaces has 100% or greater
reflection? This can throw ray processing into a loop.
If you compile with the "-g" in place of "-O" and terminate the process when it's stuck
with a "kill -QUIT" signal, you should be able to get a backtrace to find out where the process was
hung.
Cheers,
-Greg
P.S. I did introduce a change to the ray queuing code used by both rtrace and
rcontrib that should prevent runaway memory growth, but it won't prevent an
infinite loop. Those are the worst.
From: "Lars O. Grobe" <[email protected]>
Date: May 4, 2018 8:43:55 AM PDT
Hi,
a quick follow-up just to clarify - I cannot proof that the stop / cont signals
caused the completion of the task, it may have been coincidence... Right now I
have a never-ending rcontrib-process again, and stop / cont does not help this
time.
Cheers,
Lars.
Hi Greg, Jan,
I just observed a similar problem with rcontib. I am running a chain of vwrays,
rtrace, awk, rfluxmtx to calculate daylight coefficients in an image region
(rtrace returns view origin, direction and modifier, and awk filters so that
rays are passed into rfluxmtx only if a defined modifier is hit). This in
general works pretty well, even with 38 processes in parallel, but I just had
one rcontrib process stuck at 100% CPU (no memory effects though). Issuing a
kill -stop PID; kill -cont PID sequence on the rcontrib process made it
immediately complete the task. The ambient file can be excluded here as a
cause, since rcontrib does not utilize the ambient cache. This is all on
non-networked filesystems, ubuntu linux.
Cheers, Lars.
Hi Jan,
This could be an unexpected "hang" condition with one of the rtrace processes,
where a single ray evaluation is blocked waiting for access to the ambient file, while
the other processes continue computing away, filling up the queue with results after the
blocked one. I could see this becoming a runaway memory condition, but I don't know why
a process would be blocked. NFS file locking is used on the ambient file, and this has
been known to fail on some Linux builds, but I haven't seen it fail by refusing to
unlock. (The problem in the past has been unlocking when it shouldn't.)
If you can monitor your processes, watching for when the parent becomes large,
then stop all the child processes (kill -stop pid1 pid2 ...), restarting them
one by one using (kill -continue pidN). If the parent process starts to shrink
after that, or at least doesn't continue to grow, then this would support my
hypothesis.
The other thing to look for is a child process with 0% CPU time. If none of
the child processes are hung, then I'm not sure why memory would be growing in
the parent.
There's no sense trying to fix such an unusual problem until we have a firmer
idea of the cause.
Cheers,
-Greg
From: Jan Wienold <[email protected]>
Date: April 25, 2018 1:28:44 AM PDT
Hi Greg,
While doing the renderings for the VR of Kynthia we encountered "sometimes"
(that means not 100% reproducible) problems with the memory.
We rendered 4 images at the same time sharing an ambient file, each rtrace was
using the -n 2 or -n 3 option.
I made a screenshot of top and some of the processes. If you look at id 88263 it seems like the
"mother-process" uses in total 41 GB (virt) !! - since some of our machines don't have a
large swap space, some of these processes failed with "cannot allocate memory". I know
that the Virt mem is not a real indicator for what is ever used, but from our 400 jobs we had
around 10 failing with this issue.
The "children" use around 800-900mb, so this is fine and what we expected. But
we dont know how to estimate to total memory usage (lets say a single rtrace would need
500mb, I would have expected running -n 2 uses 1GB, but at least there is also the mother
process, which size a bit unpredictable and sometimes exploding.
This "growth" of the mother process happens always at the end of the images
(lets say 90% finished).
Interestingly when restarting the processes the fail never happened again (but
I have to admit I didn't restart the simulation explicitly on the same machine,
since I had a fully automized process, where the failed ones were automatically
restarted on one of the 50 machines we had available.)
Finally we finished all 400(!) renderings with a very good quality.
So this is not an urgent issue, but we wanted to report this. Maybe you have
some rules of thumb to calculate the memory usage when applying the -n option
when the usage of a single process is known?
best
Jan
--
Dr.-Ing. Jan Wienold
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev
_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev