[petsc-dev] MatZeroRowsColumnsStencil does not have a fortran binding
On Nov 4, 2012, at 1:28 PM, Jed Brown jedbrown at mcs.anl.govmailto:jedbrown at mcs.anl.gov wrote: Pushed to 3.3. Thanks In the future, can you make patches by committing locally and using hg export (or even hg email)? This patch had DOS-style formatting that needed to be converted before applying. Weird... the patch I sent shows up as unix (LF) line endings in bbedit. It looks like LSU's exchange server is doing something funky with line endings. Blaise http://petsc.cs.iit.edu/petsc/releases/petsc-3.3/rev/ca8b914ed071 On Sun, Nov 4, 2012 at 12:52 PM, Blaise A Bourdin bourdin at lsu.edumailto:bourdin at lsu.edu wrote: Hi, Right now, MatZeroRowsStencil has a fortran binding, but MatZeroRowsColumnsStencil does not. Is it possible to add the binding for MatZeroRowsColumnsStencil in petsc-3.3 and petsc-dev? I am attaching a patch for 3.3 Blaise -- Department of Mathematics and Center for Computation Technology Louisiana State University, Baton Rouge, LA 70803, USA Tel. +1 (225) 578 1612tel:%2B1%20%28225%29%20578%201612, Fax +1 (225) 578 4276tel:%2B1%20%28225%29%20578%204276 http://www.math.lsu.edu/~bourdin -- Department of Mathematics and Center for Computation Technology Louisiana State University, Baton Rouge, LA 70803, USA Tel. +1 (225) 578 1612, Fax +1 (225) 578 4276 http://www.math.lsu.edu/~bourdin -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/4aea2046/attachment.html
[petsc-dev] OpenMPI is a menace
On 11/ 5/12 10:13 AM, Matthew Knepley wrote: I am tired of wasting time on their stupid bugs which we report but are never fixed. Is there a way to retaliate without grenades? Switch over to HPX? http://stellar.cct.lsu.edu/tag/hpx/ I'm not sure it's at a point where it can handle petsc, but keep it in mind for 2013. I'm cc'ing one of the devs who is probably not subscribed, but can answer questions. (bandwidth permitting and maybe delayed until @SC12 or after) --- Option 2 - We (PathScale) have been considering to take on shipping a supported version of OpenMPI for some time. If anyone would be interested in add-on support or just paying a bounty to fix bugs - We may be able to work it out. (Not the perfect open source answer, but maybe it's better (cheaper?) than grenades.. Which bugs are you specifically interested in getting fixed?
[petsc-dev] OpenMPI is a menace
On 11/ 5/12 10:48 AM, Matthew Knepley wrote: On Sun, Nov 4, 2012 at 10:38 PM, C. Bergstr?m cbergstrom at pathscale.com mailto:cbergstrom at pathscale.com wrote: Which bugs are you specifically interested in getting fixed? When they install to a non-standard location, they do not add RPATH support for the library link in their compiler wrappers. (That's really it? You're going to throw a grenade over this?) I just pulled the svn nightly tarball of OpenMPI and I'm wondering if of these below could be used to help? Have you tried and it's buggy? They seem to propagate to the wrapper scripts --with-wrapper-ldflags Extra flags to add to LDFLAGS when using wrapper compilers --with-wrapper-libs Extra flags to add to LIBS when using wrapper compilers
[petsc-dev] OpenMPI is a menace
On 11/ 5/12 10:55 AM, Jed Brown wrote: I just bumped the thread on this over on their mailing list. Educating users about LD_LIBRARY_PATH is a losing battle, in my opinion. (I wholly agree and I wish -rpath matched Solaris -R behavior. preload can be a slightly better alternative in some cases) I'm still not getting exactly what's being done - What --prefix= is used to build openmpi? Where is it installed? Can the installed location be known at compile time?
[petsc-dev] OpenMPI is a menace
On 11/ 5/12 11:02 AM, Matthew Knepley wrote: These don't solve the problem. Check the mailing list. Please forward the emails/thread to me so I can catch-up. I'll honestly try to resolve this if I can
[petsc-dev] OpenMPI is a menace
On 11/ 5/12 12:21 PM, Bryce Lelbach wrote: When they install to a non-standard location, they do not add RPATH support for the library link in their compiler wrappers. bingo - and users aren't setting --with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib - imho it's a bug and should be considered a part of the minimal flags - Hopefully I can get Jeff's patch or that it gets finished/resolved before SC12
[petsc-dev] OpenMPI is a menace
On Sun, Nov 4, 2012 at 11:21 PM, Bryce Lelbach blelbach at cct.lsu.edu wrote: HPX is drastically different from MPI. Comparison table: HPX: Intra-node (threading) and inter-node (distributed); provides extermely fine grained threading (millions of short-lived threads per node) MPI: Only does distributed. Uh, MPI does intra-node. In fact, the main reason for explicit threads within a node is specifically to enhance sharing, mostly cache. There are viable solutions to network device contention, including explicitly via subcomms and implicitly via lighter weight MPI ranks (e.g., AMPI). HPX: Sends work to data. MPI: Sends data to work. Uh, you run whatever task you want on the data at hand. You _can_ send data, but that is in no way dictated by the MPI model. HPX: Provides an active global address space to abstract local memory boundaries across nodes. MPI: Forces user code to explicitly perform remote communication. This is why we have libraries. The utility of a global address space is mostly a philosophical debate. That remote data must be copied to local storage if it will be accessed frequently is not controversial. HPX: Hides latencies by overlapping them with out computations. MPI: Only option to deal with latencies is latency avoidance. HPX: Utilizes local synchronization and zealously avoids explicit global barriers, allowing computation to proceed as far as possible without communicating/synchronizing. MPI: Strongly emphasizes global barriers. Bullshit. An explicit barrier (like MPI_Barrier) is almost never necessary or desirable. Indeed, it implies side-channel synchronization, such as via the file system. Reductions like MPI_Allreduce() represent a data dependence with no opportunity for further local work in the majority of existing algorithms, such as Krylov methods and convergence tests. We have a number of new algorithms, such as pipelined GMRES, for which non-blocking collectives (MPI_Iallreduce, MPI_Ibarrier) are important. HPX: Supports the transmission of POD data, polymorphic types, functions and higher order functions (e.g. functions with bound arguments, etc). MPI: Only does POD data. MPI_Datatypes are much more general. HPX: Diverse set of runtime services (builtin, intrinsic instrumentation, error handling facilities, logging, runtime configuration, loadable modules). MPI: None of the above. HPX: Supports dynamic, heuristic load balancing (both at the intra-node and inter-node level). MPI: Limited builtin support for static load balancing. HPX is a general-purpose C++11 runtime system for parallel and distributed computing of any scale (from quad-core tablets to exascale systems). It is freely available under a BSD-style license, and developed by a growing community of international collobators. It is an integral part of US DoE's exascale software stack, and is supported heavily by the US NSF. stellar.cct.lsu.edu for benchmarks, papers and links to the github repository. I wish you the best in this project, but frankly, I'm unimpressed by the performance obtained in your papers and it appears that this project has overlooked many of the same library features that GASNet and other PGAS systems have missed. -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/2508cd58/attachment-0001.html
[petsc-dev] Current status: GPUs for PETSc
Hi Karli, and others, On 05/11/2012 01:51, Karl Rupp wrote: ... That's it for now, after some more refining I'll start with a careful migration of the code/concepts into PETSc. Comments are, of course, always welcome. So we're working on FE assembly + solve on GPUs using fenics kernels (github.com/OP2/PyOP2). For the GPU solve, it would be nice if we could backdoor assembled matrices straight on to the GPU. That is, create a Mat saying this is the sparsity pattern and then, rather than calling MatSetValues on the host, just pass a pointer to the device data. At the moment, we're doing a similar thing using CUSP, but are looking at doing multi-GPU assembly + solve and would like not to have to reinvent too many wheels, in particular, the MPI-parallel layer. Additionally, we're already using PETSc for the CPU-side linear algebra so it would be nice to use the same interface everywhere. I guess effectively we'd like something like MatCreateSeqAIJWithArrays and MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers rather than host pointers. Is there any roadmap in PETSc for this kind of thing? Would patches in this direction be welcome? Cheers, Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
[petsc-dev] Current status: GPUs for PETSc
On Mon, Nov 5, 2012 at 6:27 AM, Lawrence Mitchell lawrence.mitchell at ed.ac.uk wrote: Hi Karli, and others, On 05/11/2012 01:51, Karl Rupp wrote: ... That's it for now, after some more refining I'll start with a careful migration of the code/concepts into PETSc. Comments are, of course, always welcome. So we're working on FE assembly + solve on GPUs using fenics kernels (github.com/OP2/PyOP2). For the GPU solve, it would be nice if we could backdoor assembled matrices straight on to the GPU. That is, create a Mat saying this is the sparsity pattern and then, rather than calling MatSetValues on the host, just pass a pointer to the device data. Yes, I was doing the same thing. GPU assembly is not really any faster than CPU, so I did not publish it, but not having to transfer the values back is very nice. In Karl's scheme its easy, since you just go in and get the device pointer from the handle. We can talk about a user level interface after you verify that this works and is what you want. Matt At the moment, we're doing a similar thing using CUSP, but are looking at doing multi-GPU assembly + solve and would like not to have to reinvent too many wheels, in particular, the MPI-parallel layer. Additionally, we're already using PETSc for the CPU-side linear algebra so it would be nice to use the same interface everywhere. I guess effectively we'd like something like MatCreateSeqAIJWithArrays and MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers rather than host pointers. Is there any roadmap in PETSc for this kind of thing? Would patches in this direction be welcome? Cheers, Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/cca4512c/attachment.html
[petsc-dev] Current status: GPUs for PETSc
Hi Lawrence, That's it for now, after some more refining I'll start with a careful migration of the code/concepts into PETSc. Comments are, of course, always welcome. So we're working on FE assembly + solve on GPUs using fenics kernels (github.com/OP2/PyOP2). For the GPU solve, it would be nice if we could backdoor assembled matrices straight on to the GPU. That is, create a Mat saying this is the sparsity pattern and then, rather than calling MatSetValues on the host, just pass a pointer to the device data. Thanks for the input. My reference implementation supports such kind of backdooring, so there is no conceptional problem with that. What I don't know yet is 'The Right Way' of integrating this functionality into the existing PETSc interface routines. Anyhow, I see this as an essential feature, so it's on my roadmap already. At the moment, we're doing a similar thing using CUSP, but are looking at doing multi-GPU assembly + solve and would like not to have to reinvent too many wheels, in particular, the MPI-parallel layer. Additionally, we're already using PETSc for the CPU-side linear algebra so it would be nice to use the same interface everywhere. Yes, that's what we are aiming for. The existing MPI-layer just works well irrespective of whether you're dealing with CPUs or GPUs on each rank. I guess effectively we'd like something like MatCreateSeqAIJWithArrays and MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers rather than host pointers. Is there any roadmap in PETSc for this kind of thing? Would patches in this direction be welcome? Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while OpenCL expects cl_mem. This suggests to use something like MatCreateSeqAIJWithCUDAArrays(), MatCreateSeqAIJWithOpenCLArrays(), but as I said above, I haven't come to a decision on that yet. I'm not aware of any roadmap on the GPU part, but I want to integrate this rather sooner than later. Patches are of course welcome, either for the current branch, or later on based on the refurbished GPU extensions. Best regards, Karli
[petsc-dev] Current status: GPUs for PETSc
On Mon, Nov 5, 2012 at 9:06 AM, Karl Rupp rupp at mcs.anl.gov wrote: Hi Lawrence, That's it for now, after some more refining I'll start with a careful migration of the code/concepts into PETSc. Comments are, of course, always welcome. So we're working on FE assembly + solve on GPUs using fenics kernels (github.com/OP2/PyOP2). For the GPU solve, it would be nice if we could backdoor assembled matrices straight on to the GPU. That is, create a Mat saying this is the sparsity pattern and then, rather than calling MatSetValues on the host, just pass a pointer to the device data. Thanks for the input. My reference implementation supports such kind of backdooring, so there is no conceptional problem with that. What I don't know yet is 'The Right Way' of integrating this functionality into the existing PETSc interface routines. Anyhow, I see this as an essential feature, so it's on my roadmap already. At the moment, we're doing a similar thing using CUSP, but are looking at doing multi-GPU assembly + solve and would like not to have to reinvent too many wheels, in particular, the MPI-parallel layer. Additionally, we're already using PETSc for the CPU-side linear algebra so it would be nice to use the same interface everywhere. Yes, that's what we are aiming for. The existing MPI-layer just works well irrespective of whether you're dealing with CPUs or GPUs on each rank. I guess effectively we'd like something like MatCreateSeqAIJWithArrays and MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers rather than host pointers. Is there any roadmap in PETSc for this kind of thing? Would patches in this direction be welcome? Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while OpenCL expects cl_mem. This suggests to use something like MatCreateSeqAIJWithCUDAArrays(**), MatCreateSeqAIJWithOpenCLArray**s(), but as I said above, I haven't come to a decision on that yet. Let me be more specific. I would not support this. I think it is wrong. You should create the Mat in the normal way and then pull out the backend storage. This way we have one simple interface for creation and preallocation, and eventually we make a nicer FEM interface to cover up the device pointer extraction. Matt I'm not aware of any roadmap on the GPU part, but I want to integrate this rather sooner than later. Patches are of course welcome, either for the current branch, or later on based on the refurbished GPU extensions. Best regards, Karli -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/dde254c4/attachment.html
[petsc-dev] Current status: GPUs for PETSc
On 05/11/2012 14:00, Matthew Knepley wrote: On Mon, Nov 5, 2012 at 6:27 AM, Lawrence Mitchell lawrence.mitchell at ed.ac.uk mailto:lawrence.mitchell at ed.ac.uk wrote: Hi Karli, and others, On 05/11/2012 01:51, Karl Rupp wrote: ... That's it for now, after some more refining I'll start with a careful migration of the code/concepts into PETSc. Comments are, of course, always welcome. So we're working on FE assembly + solve on GPUs using fenics kernels (github.com/OP2/PyOP2 http://github.com/OP2/PyOP2). For the GPU solve, it would be nice if we could backdoor assembled matrices straight on to the GPU. That is, create a Mat saying this is the sparsity pattern and then, rather than calling MatSetValues on the host, just pass a pointer to the device data. Yes, I was doing the same thing. GPU assembly is not really any faster than CPU, so I did not publish it, but not having to transfer the values back is very nice. Our experience is that we can outperform Dolfin's CPU assembly by a reasonable amount, so for us it does seem worthwhile. But given that we're also able to generate CPU assembly+solve, it would be nice to use PETSc in all cases. And of course, even in the case where GPU and CPU assembly are the same speed, if the GPU solve goes faster it's nice to be able to avoid all the host-device transfers that CPU assembly necessitates for non-linear systems. In Karl's scheme its easy, since you just go in and get the device pointer from the handle. We can talk about a user level interface after you verify that this works and is what you want. That should be enough I think. In the MPI case I guess it means providing the relevant data structure for the two MatSeq (diagonal and off-diagonal) bits. Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
[petsc-dev] Current status: GPUs for PETSc
On 05/11/2012 14:10, Matthew Knepley wrote: Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while OpenCL expects cl_mem. This suggests to use something like MatCreateSeqAIJWithCUDAArrays(__), MatCreateSeqAIJWithOpenCLArray__s(), but as I said above, I haven't come to a decision on that yet. Let me be more specific. I would not support this. I think it is wrong. You should create the Mat in the normal way and then pull out the backend storage. This way we have one simple interface for creation and preallocation, and eventually we make a nicer FEM interface to cover up the device pointer extraction. So something like: MatCreate(m, ...); MatSetType(m, PETSC_GPU_TYPE); MatSetPreallocation(m, ...); // now the device data is allocated MatSetColumnIndices(m, ...); // now the sparsity pattern is set up MatGetHandle(m, handle); handle-device_data; // pointer we can pass to external assembly routine ? Cheers Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
[petsc-dev] Current status: GPUs for PETSc
On Mon, Nov 5, 2012 at 9:17 AM, Lawrence Mitchell lawrence.mitchell at ed.ac.uk wrote: On 05/11/2012 14:10, Matthew Knepley wrote: Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while OpenCL expects cl_mem. This suggests to use something like MatCreateSeqAIJWithCUDAArrays(__), MatCreateSeqAIJWithOpenCLArray__s(), but as I said above, I haven't come to a decision on that yet. Let me be more specific. I would not support this. I think it is wrong. You should create the Mat in the normal way and then pull out the backend storage. This way we have one simple interface for creation and preallocation, and eventually we make a nicer FEM interface to cover up the device pointer extraction. So something like: MatCreate(m, ...); MatSetType(m, PETSC_GPU_TYPE); MatSetPreallocation(m, ...); // now the device data is allocated MatSetColumnIndices(m, ...); This step is unnecessary. Matt // now the sparsity pattern is set up MatGetHandle(m, handle); handle-device_data; // pointer we can pass to external assembly routine ? Cheers Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/bb4cc914/attachment-0001.html
[petsc-dev] Current status: GPUs for PETSc
On 05/11/2012 14:18, Matthew Knepley wrote: MatSetColumnIndices(m, ...); This step is unnecessary. How else is the matrix going to know what the column of the data is? I'm envisaging not calling MatSetValues on the device, so afaict, there's no other way of PETSc knowing. Lawrence -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
[petsc-dev] Current status: GPUs for PETSc
Hi Matt, I guess effectively we'd like something like MatCreateSeqAIJWithArrays and MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers rather than host pointers. Is there any roadmap in PETSc for this kind of thing? Would patches in this direction be welcome? Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while OpenCL expects cl_mem. This suggests to use something like MatCreateSeqAIJWithCUDAArrays(__), MatCreateSeqAIJWithOpenCLArray__s(), but as I said above, I haven't come to a decision on that yet. Let me be more specific. I would not support this. I think it is wrong. You should create the Mat in the normal way and then pull out the backend storage. This way we have one simple interface for creation and preallocation, and eventually we make a nicer FEM interface to cover up the device pointer extraction. Thanks for the input, I also prefer the 'one simple interface' approach. Best regards, Karli
[petsc-dev] Current status: GPUs for PETSc
On 11/05/2012 08:23 AM, Lawrence Mitchell wrote: On 05/11/2012 14:18, Matthew Knepley wrote: MatSetColumnIndices(m, ...); This step is unnecessary. How else is the matrix going to know what the column of the data is? I'm envisaging not calling MatSetValues on the device, so afaict, there's no other way of PETSc knowing. I think the SetColumnIndices() step can possibly be dealt with in the same way as the entries, e.g. handle1-column_data = your_col_data_array; handle2-data = your_matrix_entries_array; because you probably don't want to use the CPU-bound MatSetColumnIndices() for data that already resides on the GPU. Best regards, Karli