Re: [Radiance-dev] memory usage rtrace -n

Gregory J. Ward Fri, 04 May 2018 11:36:06 -0700

Hi Lars,

Thanks for sending this.  Your delays are not caused by ray flushing.  Your 
settings for -x and -y override the flushing behavior, so performance should 
not be affected by your zero rays.  This does leave me puzzled as to the cause 
of your delays, however.  You say the rcontrib process is at 100%?  If that's 
the case, then kill -QUIT on a binary compiled with "-g" should give us a clue 
where it's getting stuck.  If the process is at 0%, then a backtrace from kill 
-QUIT may or may not be helpful.


Cheers,
-Greg

> From: "Lars O. Grobe" <[email protected]>
> Date: May 4, 2018 11:23:17 AM PDT
> 
> Hi Greg,
> 
> this is how the rcontrib process was started by rfluxmtx (grepped from ps ax):
> 
> rcontrib -fo+ -n 38 -ab 2 -ad 128 -lw .008 -ss 16 -st .01 -o 
> /tmp/Cellular_fisheye_celWg01XVBP_d_02_sys_Klems_celWg02XVBP_r_02_sys_Klems_celWg03XVBP_r_02_sys_Klems_TUR_Izmir.172180_IWEC.glazingdcs/%04d.hdr
>  -x 1024 -y 1024 -ld- -fac -c 32 -bn 1 -b if(-Dx*0-Dy*0-Dz*1,0,-1) -m 
> ground_glow -f reinhartb.cal -p MF=1,rNx=0,rNy=0,rNz=-1,Ux=0,Uy=1,Uz=0,RHS=+1 
> -bn Nrbins -b rbin -m sky_glow !oconv -f offices/cellularOffice/Cellular.rad 
> offices/cellularOffice/Cellular_wg01XVBPd02o_wg02XVBPr02o_wg03XVBPr02o.rad 
> uniformSky.rad
> 
> Maybe it is just the constant flushing (after every pixel...), and I had just 
> managed to issue the kill -stop; kill -cont right before the process was done 
> by coincidence. I there a better way than to get just parts of a view 
> rendered if the zero rays affect performance that drastically?
> 
> Cheers, Lars.
>> Hi Lars,
>> 
>> If you're running rfluxmtx in pass-through mode, then it's rcontrib that is 
>> actually handling the zero direction vectors.  It would help me to know 
>> exactly the parameters being used in that case, as the logic in rcontrib is 
>> pretty complicated, especially when it comes to multiprocessing.  What 
>> rcontrib command is reported by rfluxmtx using the '-v' option?
>> 
>> Under some circumstances, the zero rays are interpreted as "flush requests", 
>> which can slow things down a bit.  There shouldn't be any infinite loops, 
>> however, and I don't think flushing happens if you specify both -x and -y > 
>> 0 to rcontrib.
>> 
>> Oversampling should work fine with dummy rays.  In any case, you should get 
>> a result for each N input rays, even if their directions are 0 0 0.  The 
>> results will just be zero for those records.  (I assume you know that to 
>> have gotten this far.)
>> 
>> Cheers,
>> -Greg
>> 
>>> From: "Lars O. Grobe" <[email protected]>
>>> Date: May 4, 2018 9:38:42 AM PDT
>>> 
>>> Hi Greg,
>>> 
>>> unfortunately the currently running process was started from a build 
>>> without the -g switch. I will recompile and test it to try getting the 
>>> backtrace. I am pretty sure that there is no >100% reflection.
>>> 
>>> The one thing that I suspected to be the culprit is how I mask the 
>>> rendering. Does rfluxmtx properly digest zero direction vectors as rtrace 
>>> does? I have observed that the rcontrib process is stuck once the last 
>>> visible pixel has been rendered. The remaining part is all "out of view", 
>>> e.g. 0 0 0 direction vector. So it might be that rcontrib is just busy 
>>> computing the zero length rays (most of my image is masked) but makes no 
>>> visible progress. I expected these rays to result in little load, assuming 
>>> that oversampling would not apply for zero length vectors - and I have a 
>>> pretty high "oversampling" set with the -c N parameter. Is it possible that 
>>> oversampling (accumulating) collides with my use of the "dummy rays" to 
>>> mask the image?
>>> 
>>> Cheers, Lars.
>>>> Hi Lars,
>>>> 
>>>> Did you check to make sure that none of your surfaces has 100% or greater 
>>>> reflection?  This can throw ray processing into a loop.
>>>> 
>>>> If you compile with the "-g" in place of "-O" and terminate the process 
>>>> when it's stuck with a "kill -QUIT" signal, you should be able to get a 
>>>> backtrace to find out where the process was hung.
>>>> 
>>>> Cheers,
>>>> -Greg
>>>> 
>>>> P.S.  I did introduce a change to the ray queuing code used by both rtrace 
>>>> and rcontrib that should prevent runaway memory growth, but it won't 
>>>> prevent an infinite loop.  Those are the worst.
>>>> 
>>>>> From: "Lars O. Grobe" <[email protected]>
>>>>> Date: May 4, 2018 8:43:55 AM PDT
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> a quick follow-up just to clarify - I cannot proof that the stop / cont 
>>>>> signals caused the completion of the task, it may have been 
>>>>> coincidence... Right now I have a never-ending rcontrib-process again, 
>>>>> and stop / cont does not help this time.
>>>>> 
>>>>> Cheers,
>>>>> Lars.
>>>>> 
>>>>>> Hi Greg, Jan,
>>>>>> 
>>>>>> I just observed a similar problem with rcontib. I am running a chain of 
>>>>>> vwrays, rtrace, awk, rfluxmtx to calculate daylight coefficients in an 
>>>>>> image region (rtrace returns view origin, direction and modifier, and 
>>>>>> awk filters so that rays are passed into rfluxmtx only if a defined 
>>>>>> modifier is hit). This in general works pretty well, even with 38 
>>>>>> processes in parallel, but I just had one rcontrib process stuck at 100% 
>>>>>> CPU (no memory effects though). Issuing a kill -stop PID; kill -cont PID 
>>>>>> sequence on the rcontrib process made it immediately complete the task. 
>>>>>> The ambient file can be excluded here as a cause, since rcontrib does 
>>>>>> not utilize the ambient cache. This is all on non-networked filesystems, 
>>>>>> ubuntu linux.
>>>>>> 
>>>>>> Cheers, Lars.
>>>>>> 
>>>>>>> Hi Jan,
>>>>>>> 
>>>>>>> This could be an unexpected "hang" condition with one of the rtrace 
>>>>>>> processes, where a single ray evaluation is blocked waiting for access 
>>>>>>> to the ambient file, while the other processes continue computing away, 
>>>>>>> filling up the queue with results after the blocked one.  I could see 
>>>>>>> this becoming a runaway memory condition, but I don't know why a 
>>>>>>> process would be blocked.  NFS file locking is used on the ambient 
>>>>>>> file, and this has been known to fail on some Linux builds, but I 
>>>>>>> haven't seen it fail by refusing to unlock.  (The problem in the past 
>>>>>>> has been unlocking when it shouldn't.)
>>>>>>> 
>>>>>>> If you can monitor your processes, watching for when the parent becomes 
>>>>>>> large, then stop all the child processes (kill -stop pid1 pid2 ...), 
>>>>>>> restarting them one by one using (kill -continue pidN).  If the parent 
>>>>>>> process starts to shrink after that, or at least doesn't continue to 
>>>>>>> grow, then this would support my hypothesis.
>>>>>>> 
>>>>>>> The other thing to look for is a child process with 0% CPU time.  If 
>>>>>>> none of the child processes are hung, then I'm not sure why memory 
>>>>>>> would be growing in the parent.
>>>>>>> 
>>>>>>> There's no sense trying to fix such an unusual problem until we have a 
>>>>>>> firmer idea of the cause.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> -Greg
>>>>>>> 
>>>>>>>> From: Jan Wienold <[email protected]>
>>>>>>>> Date: April 25, 2018 1:28:44 AM PDT
>>>>>>>> 
>>>>>>>> Hi Greg,
>>>>>>>> 
>>>>>>>> While doing the renderings for the VR of Kynthia we encountered 
>>>>>>>> "sometimes" (that means not 100% reproducible) problems with the 
>>>>>>>> memory.
>>>>>>>> 
>>>>>>>> We rendered 4 images at the same time sharing an ambient file, each 
>>>>>>>> rtrace was using the -n 2 or -n 3 option.
>>>>>>>> 
>>>>>>>> I made a screenshot of top and some of the processes. If you look at 
>>>>>>>> id 88263 it seems like the "mother-process" uses in total 41 GB (virt) 
>>>>>>>> !! - since some of our machines don't have a large swap space, some of 
>>>>>>>> these processes failed with "cannot allocate memory". I know that the 
>>>>>>>> Virt mem is not a real indicator for what is ever used, but from our 
>>>>>>>> 400 jobs we had around 10 failing with this issue.
>>>>>>>> 
>>>>>>>> The "children" use around 800-900mb, so this is fine and what we 
>>>>>>>> expected. But we dont know how to estimate to total memory usage (lets 
>>>>>>>> say a single rtrace would need 500mb, I would have expected running -n 
>>>>>>>> 2 uses 1GB, but at least there is also the mother process, which size 
>>>>>>>> a bit unpredictable and sometimes exploding.
>>>>>>>> 
>>>>>>>> This "growth" of the mother process happens always at the end of the 
>>>>>>>> images (lets say 90% finished).
>>>>>>>> 
>>>>>>>> Interestingly when restarting the processes the fail never happened 
>>>>>>>> again (but I have to admit I didn't restart the simulation explicitly 
>>>>>>>> on the same machine, since I had a fully automized process, where the 
>>>>>>>> failed ones were automatically restarted on one of the 50 machines we 
>>>>>>>> had available.)
>>>>>>>> 
>>>>>>>> Finally we finished all 400(!) renderings with a very good quality.
>>>>>>>> 
>>>>>>>> So this is not an urgent issue, but we wanted to report this. Maybe 
>>>>>>>> you have some rules of thumb to calculate the memory usage when 
>>>>>>>> applying the -n option when the usage of a single process is known?
>>>>>>>> 
>>>>>>>> best
>>>>>>>> 
>>>>>>>> Jan

_______________________________________________
Radiance-dev mailing list
[email protected]
https://www.radiance-online.org/mailman/listinfo/radiance-dev

Re: [Radiance-dev] memory usage rtrace -n

Reply via email to