> I think the construction of the prolongation/restriction operators, the
> local part of the coarse matrix and the assembling (or subassembling) of
> the global coarse matrix should all belong to PCIS (with PCBDDC and PCNN as
> subclasses). In fact, for both PCBDDC and PCNN all stuffs involved in the
> preconditioner application can be viewed as a subassembled matrices
> (Prolongation/Restrictions, static condensation also). This would need to
> change the actual structure of MATIS and allowing for the creation of
> rectangular operators mapping between two different spaces; MATIS creation
> will thus need both LocalToGlobalMapping (row mapping) and
> GlobalToLocalMapping (column mapping) arguments to be created.

How is the column mapping a global-to-local?

Global-to-local is a *bad* thing and best avoided by algorithms.
Fortunately, I don't think you actually mean that, you just mean a column
scatter which is perfectly scalable.

> A brand new logic in PCIS setup I would like can be
> PCISSetup() /* common to all PCIS methods */
> "PCISDealWithAllLocalStuffNeededByTheSpecificNonOverlappingMethod()"
> PCISCreateRestrictionAndProlongationOperators(pc)
> PCISAssembleLocalCoarseMat(pc)
> PCISCreatePartitionOfCoarseMesh(pc,&partition)
> PCISAssembleGlobalCoarseMat(pc,partition)

So this is starting to look rather similar to multigrid. I think that's a
good thing, there is extra flexibility and consistency in making everything
look like a variant of multigrid.

> "PCISDealWithAllLocalStuffNeededByTheSpecificNonOverlappingMethod()"  will
> contain the construction of the "Neumann" solver (for BDDC, it is actually
> a saddle point problem)

I think what you're talking about is the pinned Neumann problem. Dohrmann
solves this by first dropping the corners to make the remaining patch
non-singular, factoring the result, and then enforcing the integral
constraints with Lagrange multipliers (the Schur complement can be formed
and factored). Although this is practical, especially for 1-1
process-subdomain mapping, I'd rather not hard-code it. I can imagine
having huge multi-process subdomains with many integral constraints as well
as very small subdomains where it's more practical to just factor directly,
parhaps after change of basis.

Anyway, if I'm thinking of these methods as multigrid, we have:

1. composite smoother:
    weighted split of load, optionally harmonically extend with Dirichlet
problems, solve pinned Neumann problem, balance correction

2. restriction/prolongation:
    formed by solving a series of pinned Neumann problems, Eq 2 of Dohrmann

3. Constructing coarse grid. Some write it as a Schur complement (which I
was thinking of earlier because it's convenient for analysis), but Dohrmann
writes it as PtAP. I realize that it's implemented differently from normal
PtAP because it's done as a local dense operation that gets (perhaps
partially) assembled.

The coarse grid should be self-similar so it would reuse the components

For my variants, I want to be able to insert an overlapping strip
correction to the smoother and to add an operator-dependent adaptive
process to coarse space enrichment. I believe that the smoother above also
makes sense as a one-level method, when combined with some other correction
that allows the coarse points to move. I'd like to eventually factor the
code so that such things are easy.

> For PCBDDC and PCNN:
> PCISCreateRestrictionAndProlongation_NN will create a MATIS representing a
> default P which, in case of a scalar PDE, will be the constant function
> scaled by the partition of unity operator, with global dimensions N x
> sum^N_{i=1}pcis->n with N the number of subdomains and pcis->n the size of
> the local matis matrix (local dirichlet plus interface nodes) (in case of
> more complex vector valued PDEs it will need a MatNearNullSpace object as
> already implemented in BDDC)

This sounds good. We can MatConvert this MATIS to something else if we want
it assembled. As mentioned earlier, I want MATIS to support multiprocess
subdomains as well as mutiple subdomains per process.

> PCISCreateRestrictionAndProlongation_BDDC: (in case of exact solvers for
> the Dirichlet problems) default P will be of size
> n_coarse_dofs*sum^N_{i=1}pcis->n_B with local matrices of P the actual
> pcbddc->coarse_phi_B
> PCISCreateRestrictionAndProlongation_BDDC: (in case of inexact solvers for
> the Dirichlet problems) default P will be of size
> n_coarse_dofs*sum^N_{i=1}pcis->n with local matrices of P the actual
> pcbddc->coarse_phi_B concatenated with pcbddc->coarse_phi_D
> (pcis->n=pcis->n_B+pcis->n_D)

This sounds okay. The reduced space iteration with exact Dirichlet solvers
bothers me somewhat. We should be able to implement as running PCFieldSplit
to restrict the inner iteration to the interface problem, but with our
current data structures, that may have thrown away the interior information
that we need.

> PCISAssembleLocalCoarseMat will assemble the sequential matrix
> representing subdomains' contribution to the global coarse matrix (_NN and
> _BDDC cases can be easily written using already existing codes)
> PCISAssembleCoarseMat(pc,IS partition) would then decide how to finally
> assemble the coarse matrix depending on the partition passed in (and
> possibly change the row mapping of the default prolongation operators).
> Does this logic fits what you have in mind?

Yeah, this sounds fine. I guess "local" here means "to the subdomain"
rather than process?

>>> Isn't PtAP still the right operation, you just have a particular
>>>> implementation that takes advantage of structure?
>>> yes it is, but since it is an expensive operations (P is dense), in
>>> BDDC, once you solved the local problems to create P, you have almost
>>> straigthly (and at a very low cost) the columns of the coarse matrix. The
>>> latter can be obtained (as it is implemented in the code) as C^T\Lambda
>>> where C is the local sparse  matrix of constraints, and \Lambda is a dense
>>> and small matrix of lagrange multipliers.
>>>> I know you can also assemble B A^{-1} B^T, which is the same thing, and
>>>> maybe we should provide a generic op for that.
>>> What is B? the jump operator?
>> Your C above.
>> I have other algorithms in mind where the the interpolants would be
>> constructed somewhat differently. I may need to think a bit about what the
>> right common operation is for that case. I just feel like we may be getting
>> too tightly dependent on the specific BDDC algorithm (which has exponential
>> condition number growth in the number of levels), which we probably want to
>> generalize in the future.
