[petsc-dev] MatZeroRowsColumnsStencil does not have a fortran binding

2012-11-05 Thread Blaise A Bourdin

On Nov 4, 2012, at 1:28 PM, Jed Brown jedbrown at mcs.anl.govmailto:jedbrown 
at mcs.anl.gov
 wrote:

Pushed to 3.3.
Thanks

In the future, can you make patches by committing locally and using hg export 
(or even hg email)? This patch had DOS-style formatting that needed to be 
converted before applying.

Weird... the patch I sent shows up as unix (LF) line endings in bbedit. It 
looks like LSU's exchange server is doing something funky with line endings.

Blaise


http://petsc.cs.iit.edu/petsc/releases/petsc-3.3/rev/ca8b914ed071


On Sun, Nov 4, 2012 at 12:52 PM, Blaise A Bourdin bourdin at 
lsu.edumailto:bourdin at lsu.edu wrote:
Hi,

Right now, MatZeroRowsStencil has a fortran binding, but 
MatZeroRowsColumnsStencil does not.
Is it possible to add the binding for MatZeroRowsColumnsStencil in petsc-3.3 
and petsc-dev?

I am attaching a patch for 3.3

Blaise


--
Department of Mathematics and Center for Computation  Technology
Louisiana State University, Baton Rouge, LA 70803, USA
Tel. +1 (225) 578 1612tel:%2B1%20%28225%29%20578%201612, Fax  +1 (225) 578 
4276tel:%2B1%20%28225%29%20578%204276 http://www.math.lsu.edu/~bourdin









--
Department of Mathematics and Center for Computation  Technology
Louisiana State University, Baton Rouge, LA 70803, USA
Tel. +1 (225) 578 1612, Fax  +1 (225) 578 4276 http://www.math.lsu.edu/~bourdin







-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/4aea2046/attachment.html


[petsc-dev] OpenMPI is a menace

2012-11-05 Thread C. Bergström
On 11/ 5/12 10:13 AM, Matthew Knepley wrote:
 I am tired of wasting time on their stupid bugs which we report but 
 are never fixed. Is there a way to retaliate without grenades?
Switch over to HPX?
http://stellar.cct.lsu.edu/tag/hpx/

I'm not sure it's at a point where it can handle petsc, but keep it in 
mind for 2013.  I'm cc'ing one of the devs who is probably not 
subscribed, but can answer questions.  (bandwidth permitting and maybe 
delayed until @SC12 or after)
---
Option 2 - We (PathScale) have been considering to take on shipping a 
supported version of OpenMPI for some time.  If anyone would be 
interested in add-on support or just paying a bounty to fix bugs - We 
may be able to work it out.  (Not the perfect open source answer, but 
maybe it's better (cheaper?) than grenades..

Which bugs are you specifically interested in getting fixed?



[petsc-dev] OpenMPI is a menace

2012-11-05 Thread C. Bergström
On 11/ 5/12 10:48 AM, Matthew Knepley wrote:

 On Sun, Nov 4, 2012 at 10:38 PM, C. Bergstr?m 
 cbergstrom at pathscale.com mailto:cbergstrom at pathscale.com wrote:


 Which bugs are you specifically interested in getting fixed?


 When they install to a non-standard location, they do not add RPATH 
 support for the
 library link in their compiler wrappers.
(That's really it?  You're going to throw a grenade over this?)

I just pulled the svn nightly tarball of OpenMPI and I'm wondering if of 
these below could be used to help?  Have you tried and it's buggy?  They 
seem to propagate to the wrapper scripts

   --with-wrapper-ldflags  Extra flags to add to LDFLAGS when using 
wrapper compilers
   --with-wrapper-libs Extra flags to add to LIBS when using wrapper 
compilers



[petsc-dev] OpenMPI is a menace

2012-11-05 Thread C. Bergström
On 11/ 5/12 10:55 AM, Jed Brown wrote:
 I just bumped the thread on this over on their mailing list.

 Educating users about LD_LIBRARY_PATH is a losing battle, in my opinion.
(I wholly agree and I wish -rpath matched Solaris -R behavior.  preload 
can be a slightly better alternative in some cases)  I'm still not 
getting exactly what's being done -

What  --prefix= is used to build openmpi?
Where is it installed?
Can the installed location be known at compile time?


[petsc-dev] OpenMPI is a menace

2012-11-05 Thread C. Bergström
On 11/ 5/12 11:02 AM, Matthew Knepley wrote:


 These don't solve the problem. Check the mailing list.

Please forward the emails/thread to me so I can catch-up.  I'll honestly 
try to resolve this if I can


[petsc-dev] OpenMPI is a menace

2012-11-05 Thread C. Bergström
On 11/ 5/12 12:21 PM, Bryce Lelbach wrote:

 When they install to a non-standard location, they do not add RPATH support
 for the
 library link in their compiler wrappers.
bingo - and users aren't setting 
--with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib
-
imho it's a bug and should be considered a part of the minimal flags - 
Hopefully I can get Jeff's patch or that it gets finished/resolved 
before SC12


[petsc-dev] OpenMPI is a menace

2012-11-05 Thread Jed Brown
On Sun, Nov 4, 2012 at 11:21 PM, Bryce Lelbach blelbach at cct.lsu.edu wrote:

 HPX is drastically different from MPI. Comparison table:

 HPX: Intra-node (threading) and inter-node (distributed); provides
 extermely fine grained threading (millions of short-lived threads per node)
 MPI: Only does distributed.


Uh, MPI does intra-node. In fact, the main reason for explicit threads
within a node is specifically to enhance sharing, mostly cache. There are
viable solutions to network device contention, including explicitly via
subcomms and implicitly via lighter weight MPI ranks (e.g., AMPI).



 HPX: Sends work to data.
 MPI: Sends data to work.


Uh, you run whatever task you want on the data at hand. You _can_ send
data, but that is in no way dictated by the MPI model.



 HPX: Provides an active global address space to abstract local memory
 boundaries across nodes.
 MPI: Forces user code to explicitly perform remote communication.


This is why we have libraries. The utility of a global address space is
mostly a philosophical debate. That remote data must be copied to local
storage if it will be accessed frequently is not controversial.



 HPX: Hides latencies by overlapping them with out computations.
 MPI: Only option to deal with latencies is latency avoidance.

 HPX: Utilizes local synchronization and zealously avoids explicit global
 barriers, allowing computation to proceed as far as possible without
 communicating/synchronizing.
 MPI: Strongly emphasizes global barriers.


Bullshit. An explicit barrier (like MPI_Barrier) is almost never necessary
or desirable. Indeed, it implies side-channel synchronization, such as via
the file system. Reductions like MPI_Allreduce() represent a data
dependence with no opportunity for further local work in the majority of
existing algorithms, such as Krylov methods and convergence tests. We have
a number of new algorithms, such as pipelined GMRES, for which non-blocking
collectives (MPI_Iallreduce, MPI_Ibarrier) are important.



 HPX: Supports the transmission of POD data, polymorphic types, functions
 and higher order functions (e.g. functions with bound arguments, etc).
 MPI: Only does POD data.


MPI_Datatypes are much more general.



 HPX: Diverse set of runtime services (builtin, intrinsic instrumentation,
 error handling facilities, logging, runtime configuration, loadable
 modules).
 MPI: None of the above.

 HPX: Supports dynamic, heuristic load balancing (both at the intra-node
 and inter-node level).
 MPI: Limited builtin support for static load balancing.

 HPX is a general-purpose C++11 runtime system for parallel and distributed
 computing
 of any scale (from quad-core tablets to exascale systems). It is freely
 available
 under a BSD-style license, and developed by a growing community of
 international
 collobators. It is an integral part of US DoE's exascale software stack,
 and is
 supported heavily by the US NSF.

 stellar.cct.lsu.edu for benchmarks, papers and links to the github
 repository.


I wish you the best in this project, but frankly, I'm unimpressed by the
performance obtained in your papers and it appears that this project has
overlooked many of the same library features that GASNet and other PGAS
systems have missed.
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/2508cd58/attachment-0001.html


[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Lawrence Mitchell
Hi Karli, and others,

On 05/11/2012 01:51, Karl Rupp wrote:

...

 That's it for now, after some more refining I'll start with a careful 
 migration of the code/concepts into PETSc. Comments are, of course, 
 always welcome.

So we're working on FE assembly + solve on GPUs using fenics kernels
(github.com/OP2/PyOP2).  For the GPU solve, it would be nice if we could
backdoor assembled matrices straight on to the GPU.  That is, create a Mat
saying this is the sparsity pattern and then, rather than calling
MatSetValues on the host, just pass a pointer to the device data.

At the moment, we're doing a similar thing using CUSP, but are looking at
doing multi-GPU assembly + solve and would like not to have to reinvent too
many wheels, in particular, the MPI-parallel layer.  Additionally, we're
already using PETSc for the CPU-side linear algebra so it would be nice to
use the same interface everywhere.

I guess effectively we'd like something like MatCreateSeqAIJWithArrays and
MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers
rather than host pointers.  Is there any roadmap in PETSc for this kind of
thing?  Would patches in this direction be welcome?

Cheers,
Lawrence

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Matthew Knepley
On Mon, Nov 5, 2012 at 6:27 AM, Lawrence Mitchell 
lawrence.mitchell at ed.ac.uk wrote:

 Hi Karli, and others,

 On 05/11/2012 01:51, Karl Rupp wrote:

 ...

  That's it for now, after some more refining I'll start with a careful
  migration of the code/concepts into PETSc. Comments are, of course,
  always welcome.

 So we're working on FE assembly + solve on GPUs using fenics kernels
 (github.com/OP2/PyOP2).  For the GPU solve, it would be nice if we could
 backdoor assembled matrices straight on to the GPU.  That is, create a Mat
 saying this is the sparsity pattern and then, rather than calling
 MatSetValues on the host, just pass a pointer to the device data.


Yes, I was doing the same thing. GPU assembly is not really any faster than
CPU,
so I did not publish it, but not having to transfer the values back is very
nice.

In Karl's scheme its easy, since you just go in and get the device pointer
from the
handle. We can talk about a user level interface after you verify that this
works and
is what you want.

Matt


 At the moment, we're doing a similar thing using CUSP, but are looking at
 doing multi-GPU assembly + solve and would like not to have to reinvent too
 many wheels, in particular, the MPI-parallel layer.  Additionally, we're
 already using PETSc for the CPU-side linear algebra so it would be nice to
 use the same interface everywhere.

 I guess effectively we'd like something like MatCreateSeqAIJWithArrays and
 MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers
 rather than host pointers.  Is there any roadmap in PETSc for this kind of
 thing?  Would patches in this direction be welcome?

 Cheers,
 Lawrence

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/cca4512c/attachment.html


[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Karl Rupp
Hi Lawrence,


 That's it for now, after some more refining I'll start with a careful
 migration of the code/concepts into PETSc. Comments are, of course,
 always welcome.

 So we're working on FE assembly + solve on GPUs using fenics kernels
 (github.com/OP2/PyOP2).  For the GPU solve, it would be nice if we could
 backdoor assembled matrices straight on to the GPU.  That is, create a Mat
 saying this is the sparsity pattern and then, rather than calling
 MatSetValues on the host, just pass a pointer to the device data.

Thanks for the input. My reference implementation supports such kind of 
backdooring, so there is no conceptional problem with that. What I don't 
know yet is 'The Right Way' of integrating this functionality into the 
existing PETSc interface routines. Anyhow, I see this as an essential 
feature, so it's on my roadmap already.


 At the moment, we're doing a similar thing using CUSP, but are looking at
 doing multi-GPU assembly + solve and would like not to have to reinvent too
 many wheels, in particular, the MPI-parallel layer.  Additionally, we're
 already using PETSc for the CPU-side linear algebra so it would be nice to
 use the same interface everywhere.

Yes, that's what we are aiming for. The existing MPI-layer just works 
well irrespective of whether you're dealing with CPUs or GPUs on each rank.


 I guess effectively we'd like something like MatCreateSeqAIJWithArrays and
 MatCreateMPIAIJWithSplitArrays but with the ability to pass device pointers
 rather than host pointers.  Is there any roadmap in PETSc for this kind of
 thing?  Would patches in this direction be welcome?

Type safety is a bit nasty. CUDA allows to deal with plain 'void *', 
while OpenCL expects cl_mem. This suggests to use something like
   MatCreateSeqAIJWithCUDAArrays(),
   MatCreateSeqAIJWithOpenCLArrays(),
but as I said above, I haven't come to a decision on that yet.

I'm not aware of any roadmap on the GPU part, but I want to integrate 
this rather sooner than later. Patches are of course welcome, either for 
the current branch, or later on based on the refurbished GPU extensions.

Best regards,
Karli



[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Matthew Knepley
On Mon, Nov 5, 2012 at 9:06 AM, Karl Rupp rupp at mcs.anl.gov wrote:

 Hi Lawrence,



  That's it for now, after some more refining I'll start with a careful
 migration of the code/concepts into PETSc. Comments are, of course,
 always welcome.


 So we're working on FE assembly + solve on GPUs using fenics kernels
 (github.com/OP2/PyOP2).  For the GPU solve, it would be nice if we could
 backdoor assembled matrices straight on to the GPU.  That is, create a Mat
 saying this is the sparsity pattern and then, rather than calling
 MatSetValues on the host, just pass a pointer to the device data.


 Thanks for the input. My reference implementation supports such kind of
 backdooring, so there is no conceptional problem with that. What I don't
 know yet is 'The Right Way' of integrating this functionality into the
 existing PETSc interface routines. Anyhow, I see this as an essential
 feature, so it's on my roadmap already.



  At the moment, we're doing a similar thing using CUSP, but are looking at
 doing multi-GPU assembly + solve and would like not to have to reinvent
 too
 many wheels, in particular, the MPI-parallel layer.  Additionally, we're
 already using PETSc for the CPU-side linear algebra so it would be nice to
 use the same interface everywhere.


 Yes, that's what we are aiming for. The existing MPI-layer just works well
 irrespective of whether you're dealing with CPUs or GPUs on each rank.



  I guess effectively we'd like something like MatCreateSeqAIJWithArrays and
 MatCreateMPIAIJWithSplitArrays but with the ability to pass device
 pointers
 rather than host pointers.  Is there any roadmap in PETSc for this kind of
 thing?  Would patches in this direction be welcome?


 Type safety is a bit nasty. CUDA allows to deal with plain 'void *', while
 OpenCL expects cl_mem. This suggests to use something like
   MatCreateSeqAIJWithCUDAArrays(**),
   MatCreateSeqAIJWithOpenCLArray**s(),
 but as I said above, I haven't come to a decision on that yet.


Let me be more specific. I would not support this. I think it is wrong.

You should create the Mat in the normal way and then pull out the backend
storage. This way we have one simple interface for creation and
preallocation,
and eventually we make a nicer FEM interface to cover up the device pointer
extraction.

   Matt


 I'm not aware of any roadmap on the GPU part, but I want to integrate this
 rather sooner than later. Patches are of course welcome, either for the
 current branch, or later on based on the refurbished GPU extensions.

 Best regards,
 Karli




-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/dde254c4/attachment.html


[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Lawrence Mitchell
On 05/11/2012 14:00, Matthew Knepley wrote:
 
 On Mon, Nov 5, 2012 at 6:27 AM, Lawrence Mitchell
 lawrence.mitchell at ed.ac.uk mailto:lawrence.mitchell at ed.ac.uk wrote:
 
 Hi Karli, and others,
 
 On 05/11/2012 01:51, Karl Rupp wrote:
 
 ...
 
  That's it for now, after some more refining I'll start with a careful
  migration of the code/concepts into PETSc. Comments are, of course,
  always welcome.
 
 So we're working on FE assembly + solve on GPUs using fenics kernels
 (github.com/OP2/PyOP2 http://github.com/OP2/PyOP2).  For the GPU
 solve, it would be nice if we could
 backdoor assembled matrices straight on to the GPU.  That is, create a Mat
 saying this is the sparsity pattern and then, rather than calling
 MatSetValues on the host, just pass a pointer to the device data.
 
 
 Yes, I was doing the same thing. GPU assembly is not really any faster than 
 CPU,
 so I did not publish it, but not having to transfer the values back is very
 nice.

Our experience is that we can outperform Dolfin's CPU assembly by a
reasonable amount, so for us it does seem worthwhile.  But given that we're
also able to generate CPU assembly+solve, it would be nice to use PETSc in
all cases.  And of course, even in the case where GPU and CPU assembly are
the same speed, if the GPU solve goes faster it's nice to be able to avoid
all the host-device transfers that CPU assembly necessitates for non-linear
systems.

 In Karl's scheme its easy, since you just go in and get the device pointer
 from the
 handle. We can talk about a user level interface after you verify that this
 works and
 is what you want.

That should be enough I think.  In the MPI case I guess it means providing
the relevant data structure for the two MatSeq (diagonal and off-diagonal) bits.

Lawrence

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Lawrence Mitchell
On 05/11/2012 14:10, Matthew Knepley wrote:
 
 Type safety is a bit nasty. CUDA allows to deal with plain 'void *',
 while OpenCL expects cl_mem. This suggests to use something like
   MatCreateSeqAIJWithCUDAArrays(__),
   MatCreateSeqAIJWithOpenCLArray__s(),
 but as I said above, I haven't come to a decision on that yet.
 
 
 Let me be more specific. I would not support this. I think it is wrong.
 
 You should create the Mat in the normal way and then pull out the backend
 storage. This way we have one simple interface for creation and preallocation,
 and eventually we make a nicer FEM interface to cover up the device pointer
 extraction.

So something like:

MatCreate(m, ...);
MatSetType(m, PETSC_GPU_TYPE);
MatSetPreallocation(m, ...);

// now the device data is allocated

MatSetColumnIndices(m, ...);

// now the sparsity pattern is set up

MatGetHandle(m, handle);

handle-device_data; // pointer we can pass to external assembly routine

?

Cheers

Lawrence

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Matthew Knepley
On Mon, Nov 5, 2012 at 9:17 AM, Lawrence Mitchell 
lawrence.mitchell at ed.ac.uk wrote:

 On 05/11/2012 14:10, Matthew Knepley wrote:
 
  Type safety is a bit nasty. CUDA allows to deal with plain 'void *',
  while OpenCL expects cl_mem. This suggests to use something like
MatCreateSeqAIJWithCUDAArrays(__),
MatCreateSeqAIJWithOpenCLArray__s(),
  but as I said above, I haven't come to a decision on that yet.
 
 
  Let me be more specific. I would not support this. I think it is wrong.
 
  You should create the Mat in the normal way and then pull out the backend
  storage. This way we have one simple interface for creation and
 preallocation,
  and eventually we make a nicer FEM interface to cover up the device
 pointer
  extraction.

 So something like:

 MatCreate(m, ...);
 MatSetType(m, PETSC_GPU_TYPE);
 MatSetPreallocation(m, ...);

 // now the device data is allocated

 MatSetColumnIndices(m, ...);


This step is unnecessary.

   Matt


 // now the sparsity pattern is set up

 MatGetHandle(m, handle);

 handle-device_data; // pointer we can pass to external assembly routine

 ?

 Cheers

 Lawrence

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121105/bb4cc914/attachment-0001.html


[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Lawrence Mitchell
On 05/11/2012 14:18, Matthew Knepley wrote:
 MatSetColumnIndices(m, ...);
 
 This step is unnecessary.

How else is the matrix going to know what the column of the data is?  I'm
envisaging not calling MatSetValues on the device, so afaict, there's no
other way of PETSc knowing.

Lawrence

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Karl Rupp
Hi Matt,

  I guess effectively we'd like something like
 MatCreateSeqAIJWithArrays and
 MatCreateMPIAIJWithSplitArrays but with the ability to pass
 device pointers
 rather than host pointers.  Is there any roadmap in PETSc for
 this kind of
 thing?  Would patches in this direction be welcome?


 Type safety is a bit nasty. CUDA allows to deal with plain 'void *',
 while OpenCL expects cl_mem. This suggests to use something like
MatCreateSeqAIJWithCUDAArrays(__),
MatCreateSeqAIJWithOpenCLArray__s(),
 but as I said above, I haven't come to a decision on that yet.


 Let me be more specific. I would not support this. I think it is wrong.

 You should create the Mat in the normal way and then pull out the backend
 storage. This way we have one simple interface for creation and
 preallocation,
 and eventually we make a nicer FEM interface to cover up the device pointer
 extraction.


Thanks for the input, I also prefer the 'one simple interface' approach.

Best regards,
Karli


[petsc-dev] Current status: GPUs for PETSc

2012-11-05 Thread Karl Rupp

On 11/05/2012 08:23 AM, Lawrence Mitchell wrote:
 On 05/11/2012 14:18, Matthew Knepley wrote:
  MatSetColumnIndices(m, ...);

 This step is unnecessary.

 How else is the matrix going to know what the column of the data is?  I'm
 envisaging not calling MatSetValues on the device, so afaict, there's no
 other way of PETSc knowing.


I think the SetColumnIndices() step can possibly be dealt with in the 
same way as the entries, e.g.
   handle1-column_data = your_col_data_array;
   handle2-data = your_matrix_entries_array;
because you probably don't want to use the CPU-bound 
MatSetColumnIndices() for data that already resides on the GPU.

Best regards,
Karli