Hi Lars,

Did you check to make sure that none of your surfaces has 100% or greater 
reflection?  This can throw ray processing into a loop.

If you compile with the "-g" in place of "-O" and terminate the process when 
it's stuck with a "kill -QUIT" signal, you should be able to get a backtrace to 
find out where the process was hung.

Cheers,
-Greg

P.S.  I did introduce a change to the ray queuing code used by both rtrace and 
rcontrib that should prevent runaway memory growth, but it won't prevent an 
infinite loop.  Those are the worst.

> From: "Lars O. Grobe" <[email protected]>
> Date: May 4, 2018 8:43:55 AM PDT
> 
> Hi,
> 
> a quick follow-up just to clarify - I cannot proof that the stop / cont 
> signals caused the completion of the task, it may have been coincidence... 
> Right now I have a never-ending rcontrib-process again, and stop / cont does 
> not help this time.
> 
> Cheers,
> Lars.
> 
>> Hi Greg, Jan,
>> 
>> I just observed a similar problem with rcontib. I am running a chain of 
>> vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an image 
>> region (rtrace returns view origin, direction and modifier, and awk filters 
>> so that rays are passed into rfluxmtx only if a defined modifier is hit). 
>> This in general works pretty well, even with 38 processes in parallel, but I 
>> just had one rcontrib process stuck at 100% CPU (no memory effects though). 
>> Issuing a kill -stop PID; kill -cont PID sequence on the rcontrib process 
>> made it immediately complete the task. The ambient file can be excluded here 
>> as a cause, since rcontrib does not utilize the ambient cache. This is all 
>> on non-networked filesystems, ubuntu linux.
>> 
>> Cheers, Lars.
>> 
>>> Hi Jan,
>>> 
>>> This could be an unexpected "hang" condition with one of the rtrace 
>>> processes, where a single ray evaluation is blocked waiting for access to 
>>> the ambient file, while the other processes continue computing away, 
>>> filling up the queue with results after the blocked one.  I could see this 
>>> becoming a runaway memory condition, but I don't know why a process would 
>>> be blocked.  NFS file locking is used on the ambient file, and this has 
>>> been known to fail on some Linux builds, but I haven't seen it fail by 
>>> refusing to unlock.  (The problem in the past has been unlocking when it 
>>> shouldn't.)
>>> 
>>> If you can monitor your processes, watching for when the parent becomes 
>>> large, then stop all the child processes (kill -stop pid1 pid2 ...), 
>>> restarting them one by one using (kill -continue pidN).  If the parent 
>>> process starts to shrink after that, or at least doesn't continue to grow, 
>>> then this would support my hypothesis.
>>> 
>>> The other thing to look for is a child process with 0% CPU time.  If none 
>>> of the child processes are hung, then I'm not sure why memory would be 
>>> growing in the parent.
>>> 
>>> There's no sense trying to fix such an unusual problem until we have a 
>>> firmer idea of the cause.
>>> 
>>> Cheers,
>>> -Greg
>>> 
>>>> From: Jan Wienold <[email protected]>
>>>> Date: April 25, 2018 1:28:44 AM PDT
>>>> 
>>>> Hi Greg,
>>>> 
>>>> While doing the renderings for the VR of Kynthia we encountered 
>>>> "sometimes" (that means not 100% reproducible) problems with the memory.
>>>> 
>>>> We rendered 4 images at the same time sharing an ambient file, each rtrace 
>>>> was using the -n 2 or -n 3 option.
>>>> 
>>>> I made a screenshot of top and some of the processes. If you look at id 
>>>> 88263 it seems like the "mother-process" uses in total 41 GB (virt) !! - 
>>>> since some of our machines don't have a large swap space, some of these 
>>>> processes failed with "cannot allocate memory". I know that the Virt mem 
>>>> is not a real indicator for what is ever used, but from our 400 jobs we 
>>>> had around 10 failing with this issue.
>>>> 
>>>> The "children" use around 800-900mb, so this is fine and what we expected. 
>>>> But we dont know how to estimate to total memory usage (lets say a single 
>>>> rtrace would need 500mb, I would have expected running -n 2 uses 1GB, but 
>>>> at least there is also the mother process, which size a bit unpredictable 
>>>> and sometimes exploding.
>>>> 
>>>> This "growth" of the mother process happens always at the end of the 
>>>> images (lets say 90% finished).
>>>> 
>>>> Interestingly when restarting the processes the fail never happened again 
>>>> (but I have to admit I didn't restart the simulation explicitly on the 
>>>> same machine, since I had a fully automized process, where the failed ones 
>>>> were automatically restarted on one of the 50 machines we had available.)
>>>> 
>>>> Finally we finished all 400(!) renderings with a very good quality.
>>>> 
>>>> So this is not an urgent issue, but we wanted to report this. Maybe you 
>>>> have some rules of thumb to calculate the memory usage when applying the 
>>>> -n option when the usage of a single process is known?
>>>> 
>>>> best
>>>> 
>>>> Jan
>>>> 
>>>> -- 
>>>> Dr.-Ing.  Jan Wienold
>>> _______________________________________________
>>> Radiance-dev mailing list
>>> [email protected]
>>> https://www.radiance-online.org/mailman/listinfo/radiance-dev
>> 
>> 
>> _______________________________________________
>> Radiance-dev mailing list
>> [email protected]
>> https://www.radiance-online.org/mailman/listinfo/radiance-dev
> 
> 
> _______________________________________________
> Radiance-dev mailing list
> [email protected]
> https://www.radiance-online.org/mailman/listinfo/radiance-dev

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Reply via email to