Thanks to all! I am not sure if I figured out why all this is working as it is working, but I did change the format from ASCII to FLOAT and the time was considerably (90% ?) reduced.
I may do some more tests in the future. Thanks to all! Germán 2017-05-10 19:41 GMT-03:00 Ian Ashdown <ian_ashd...@helios32.com>: > > You could be suffering from memory allocation costs, not so much the > operations themselves. The rmtxop > > > program uses doubles and 3 components per matrix entry, so that's > (3x2048x2305 x 8 bytes) or 108 Mbytes > > > for each of your matrices. When you multiply one such matrix by a sky > vector, you only have the additional > > > memory needed by the vector (54 KBytes). When you add matrices, rmtxop > keeps two of them in memory > > > at a time, or 216 MBytes of memory. That's not a lot for most PCs these > days, but the allocation and freeing > > > of that much space may take some time if malloc is not efficient. > > > > > The problem here is not the amount of memory, but of the CPU’s access to > it. When the CPU is accessing the arrays, the data is stored in a hierarchy > of caches. For a modern Intel Core i7, for example, there are typically > four L2 caches of 256 KB each and a slower L3 cache of 8 MB that is shared > by the CPU cores. > > > > If the arrays can be stored in the L2 cache, the processor can usually run > at full speed. If not, then the CPU will typically have to wait while the > data is retrieved from the slower L3 cache. > > > > The caches are on-chip. If the arrays exceed the L3 cache capacity, then > the data will need to be retrieved from the much slower main memory and > transferred over the memory bus. With 216 MB of array data to contend with, > this is most likely the culprit. > > > > Performance optimization typically involves arranging the array data in > memory such that it can be loaded in cache lines, organizing the array > stride, splitting the arrays into subarrays for multithreaded processing, > and so on. However, these are mostly processor-family specific. (For Intel > CPUs, SSE and AVX instructions are also available to improve parallelism. > Optimizing compilers can help here, but hand coding may be needed for > optimal performance on specific processors.) > > > > If the arrays are stored in ASCII rather than binary, you will typically > see a performance hit of several hundred times as the CPU spends most of > its time parsing the strings into floating-point data. > > > > Ian Ashdown, P. Eng. (Ret.), FIES > > Senior Scientist > > SunTracker Technologies Litd. > > www.suntrackertech.com > > > > _______________________________________________ > Radiance-dev mailing list > Radiance-dev@radiance-online.org > https://www.radiance-online.org/mailman/listinfo/radiance-dev > >
_______________________________________________ Radiance-dev mailing list Radiance-dev@radiance-online.org https://www.radiance-online.org/mailman/listinfo/radiance-dev