On Mon, Dec 7, 2009 at 3:05 PM, Chris Kees < christopher.e.kees at usace.army.mil> wrote:
> This whole thread has been really interesting. I'm interested in Barry's > suggestion, and I know a handful of people not on the list who would also be > interested. I've got a couple of questions/responses to the MatSetValues, > point-wise physics, and "support tools" issues that were raised. > > Did you guys settle the issue that Jed brought up about about the Python > Global Interpreter Lock (GIL)? There was some discussion of it at SciPy > this summer, but I didn't come away feeling like anybody had an immediate > solution or that Guido was in favor of eliminating the GIL anytime soon. > Are you guys assuming that the "worker threads" for the multiple > cores/GPU's are going to be set up by C code (hand- or machine-generated)? > I think this is going to be solved be first, rearranging to the code to segregate reductions or conflicts 2) using atomics and then 3) intelligent reordering to minimize the conflicts. I have never seen this be a big problem, but some code is always going to have the bad tradeoff I guess. > It seems to me that the MatSetValues discussion moved on from the threading > issue (since you can't call MatSetValues efficiently in python anyway) to > the more general issue of how PETSc is going to explicitly support shared > memory parallelism. I interpret the last interchange as implying that the > solution is another level of domain decomposition that would be hidden from > the mat assembly loop level (i.e. users aren't going to have to write their > code like the "thread part" is their parallel subdomain). Is that a correct > interpretation? Something along the lines of additional args to MatCreate > that indicates how many threads to use per subdomain and some OpenMPI style > instructions to parallelize the assembly loop? > I am not advocating that. I want a CUDA-like solution, but of course you can probably convert any one to the other with a tremendous amount of pain. I wish for all matrix-free applications and vector assemblies, because then I think the problem is very simple. I have not seen any compelling argument against this approach, and all the best codes I know use it. > On the issue of the point-wise physics (e.g. quadrature point routines, > Riemann solvers, ...). It seems like this is a long-standing bottleneck for > object-orient approaches to PDE's going back to the OO Fem research in the > 80s and 90s. The SciPy group that met to talk about PDE's spent some time > on it, partly related to femhub project. I agree that a code generation > approach seems to be the only way forward. Ondrej Certik from the femhub > project has written a symbolic math module for python, and we talked some > about using this to represent the physics in a way that can be used for code > generation. I'm not sure what progress he's made. At the end of the day > you've got to allow people to write code for physics without knowledge of > the assembly processes while still injecting that code into the inner loop, > otherwise you'll approach the speed of monolithic fortran codes. > This solution has been "obvious" for some time, but needs a solution even a Fortran programmer could love. Or at least a bad C programmer. > One issue that you might also want to consider with multi-language > approaches is the configuration and build tools. I agree that he debugging > support is a major issue, but what has taken far more of my time is the lack > of cross-platform configuration and build tools for multi-language software. > We have looked at CMake and Scons but have had to stick with a pretty ugly > mixture of python scripts of python and make. I feel like the other > scientific computing packages like Sage and EPD are successfully growing > because they distribute binaries that include all dependencies, but I don't > see that being feasible for HPC software anytime soon. It seems like to > really be successful you may have to rely on a full featured and fully > documented tool like Scons is aiming at being, otherwise you'll end up with > an approach that works for you (because you wrote it) but not very well for > others trying to use the new PETSc. > I think we do a better job than anyone with PETSc right now, but we will have to work a little harder to get everything right. > We have a small group at that US Army ERDC that has been developing a > python FEM framework for multiscale/multiphysics modeling that tries to > break coupling between the physics and the FEM methods (i.e. we try to > support CG,DG, and non-conforming elements with variable order and element > topologies running from the same description of the physics). The code uses > our own wrappers to petsc and mpi, but only because we started before mpi4py > and petsc4py existed. We have a memory intensive physics API based on > computing and storing quadrature point values of PDE coefficients and > numerical fluxes and a residual and jacobian assembly processes along the > lines of Matt's first approach of storing element residuals and jacobians. > We also have handwritten "optimized" codes that collapse all the loops and > eliminate storage for direct residual and jacobian assembly. We did that > last step out of necessity and as a precursor to working on a code > generation approach (it's what we want the code generator to write). It > also has the side effect of providing a low level API for people who want to > write traditional monolithic fortran codes for residual and jacobian > assembly. > You should also look at Lisandro's fluids code. > Our approach for the "user" or more correctly the "physics developer" is > 1) prototype physics in python 2) "vectorize" the physics by rewriting in > C/C++/Fortran using swig, f2py, or hand written wrappers and 3) take the > inner loop of 2 and build monolithic residual and jacobian routines. This > is obviously not satisfactory and not maintainable in the long-run but we > are solving "real" nonlinear multi-physics problems in 3D with it (e.g. > incompressible two-phase flow, two-phase flow in porous media, shallow water > equations, and a few other models). Recent profiles have put about 95% of > the compute time inside KSPSolve, but we've got a lot of work to do on > better preconditioning and tolerance selection before that number provides > reliable guidance. Anyway, If you guys go forward with Barry's suggestion > I'd be interested in seeing if there's some way to work out some > inter-agency collaboration in addition to the good academic collaboration > that you already have. > I would say an improvement to this would be CUDA kernels for your inner loops. Matt > Chris > > > On Dec 6, 2009, at 9:43 AM, Jed Brown wrote: > > On Sat, 5 Dec 2009 14:23:45 -0600, Matthew Knepley <knepley at gmail.com> >> wrote: >> >>> We do impose some of our process on users already, such as CHKERRQ and >>> PetscMalloc. >>> >> >> CHKERRQ is still optional, it's just recommended in order to get a >> decent trace. It can't be used, for example, for top-level calls within >> my VisIt plugin because those functions necessarily return something >> else (I have an alternative which checks the return code, generates >> debugging messages, possibly tries to deallocate local resources, and >> throws the appropriate exception). In C99 projects, I use an alternate >> macro that uses __func__ instead of __FUNCT__. >> >> There is PetscMallocSet if the user doesn't want to use PetscMalloc, but >> users are rarely required to directly interact with >> PetscMalloc/PetscFree anyway since most arrays are copied or managed by >> an object. >> >> Jed >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091207/4954adf3/attachment.html>