Hi Chuck,

Mea culpa. I forgot to set the KNOB_MAX_WORKER_THREADS env variable in my swr 
tests. Sorry for this oversight.


Your suspicion was right and the large virtual memory consumption was indeed 
due to a massive oversubscription of threads fired by swr by default. Setting 
KNOB_MAX_WORKER_THREADS to 1 fixed the oversubscription of threads and the 
memory consumption issue when the number of mpi processes is equal to the 
number of cores requested.


I have a last question about the performance of the swr library vs the default 
llvmpipe library. I have found some benchmark of swr vs llvmpipe showing 
acceleration factor ranging from 30 to 50 depending on the number of tets 
(http://openswr.org/slides/SWR_Sept15.pdf). But I is not clear to me which part 
of this acceleration is due to a better threading of the swr library and which 
part is due to a potential better usage of the AVX* instructions. If I do not 
allow threading of swr by setting KNOB_MAX_WORKER_THREADS to 1 (basically 
serial swr vs serial llvmpipe), do you know which performance gain I could 
still expect by relying only on a potential better usage of the AVX 
instructions?


Thanks a lot for your help.


Best regards,


Michel


Performance: OpenSWR vs MESA LLVMpipe<http://openswr.org/slides/SWR_Sept15.pdf>
openswr.org
Performance: OpenSWR vs MESA* LLVMpipe Software and workloads used in 
performance tests may have been optimized for performance only on Intel 
microprocessors.



________________________________
From: Chuck Atkins <chuck.atk...@kitware.com>
Sent: Tuesday, February 7, 2017 10:03:50 PM
To: Michel Rasquin
Cc: paraview@paraview.org
Subject: Re: [Paraview] Issues with PVSB 5.2 and OSMesa support

Hi Michel,

Indeed, I built PVSB 5.2 with the intel 2016.2.181 and intelmpi 5.1.3.181 
compilers, then ran the resulting pvserver on Haswell CPU nodes (Intel 
E5-2680v3) which supports AVX2 instructions.  So this fits exactly the known 
issue you mentioned in your email.

Yep, that'll do it.  The problem is due to a bug in the Intel compiler 
performing over-agressive vectorized code generation.  I'm not sure if it's 
fixed in >= 17 or not but I definitely know it's broken in <= 16.x.  
GALLIUM_DRIVER=SWR is going to give you the best performance in this situation 
anyways though and is the recommended osmesa driver on x86_64 CPUs.



Exporting the GALIIUM_DRIVER env variable to swr then leads to an interesting 
behavior. With the swr driver, the good news is that I can connect my pvserver 
built in release mode without crashing.

Great!


For the recollection, the llvmpipe driver compiled in release mode crashes 
during the client/server connection, whereas the llvmpipe driver compiled in 
debug mode works fine.

This lines up with the issue being bad vectorization since the compiler won't 
be doing m,ost of those optimizations in a debug build.


However, our PBS scheduling killed quickly my interactive job because the 
virtual memory was exhausted, which was puzzling. Increasing the number of 
cores requested for my job and keeping some of them idle allowed me to increase 
the available memory at the cost of wasted cpu resources.

I suspect the problem is is a massive oversubscription of threads by swr.  The 
default behavior of swr is to use all available CPU cores on the node.  
However, when running multiple MPI processes per node, they have no way of 
knowing about each other.  So if you've got 24 cores per node and run 24 
pvservers, you'll end up with 24^2 = 576 rendering threads on a nodes; not so 
great.  You can control this with the KNOB_MAX_WORKER_THREADS environment 
variable.  Typically you'll want to set it to the inverse of processes per node 
your job is running.  So if yor node has 24 cores and you run 24 processes per 
node, then set KNOB_MAX_WORKER_THREADS to 1, but if you're running 4 processes 
per node, then set it to 6; you get the idea.  That should address the virtual 
memory problem.  It's a balance since typically rendering will perform better 
with fewer ppn and mroe threads per process, but the filters, like Contour, 
parallelize at the MPI level and work better with a higher ppn.  You'll need to 
find the right balance for your use case depending on whether it's 
render-heaver or pipeline pricessing heavy.


Would you also know if this known issue with the llvmpipe driver will be fixed 
of PV 5.3 (agreeing on the fact that the swr driver should be faster on intel 
CPU provided that it does not exhaust the memory consumption).

It's actually an Intel compiler bug and not a ParaView (or even Mesa for that 
matter) issue, so probably not.  It may be fixed in future releases of icc but 
I wouldn't know withotu testng it.


- Chuck

_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at 
http://www.kitware.com/opensource/opensource.html

Please keep messages on-topic and check the ParaView Wiki at: 
http://paraview.org/Wiki/ParaView

Search the list archives at: http://markmail.org/search/?q=ParaView

Follow this link to subscribe/unsubscribe:
http://public.kitware.com/mailman/listinfo/paraview

Reply via email to