Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Karl Rupp via petsc-dev




On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
I am using karlrupp/fix-cuda-streams, merged with master, and I get this 
error:


Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n 1 
printenv']":

Error, invalid argument:  1

My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did edit 
the jsrun command but Karl's branch still fails. (SUMMIT was down today 
so there could have been updates).


Any suggestions?


Looks very much like a systems issue to me.

Best regards,
Karli


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
I double checked that a clean build of your (master) branch has this error
by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include stuff
from Barry that is not yet in master, works.

On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

>
>
> On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > I am using karlrupp/fix-cuda-streams, merged with master, and I get this
> > error:
> >
> > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n 1
> > printenv']":
> > Error, invalid argument:  1
> >
> > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did edit
> > the jsrun command but Karl's branch still fails. (SUMMIT was down today
> > so there could have been updates).
> >
> > Any suggestions?
>
> Looks very much like a systems issue to me.
>
> Best regards,
> Karli
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Karl Rupp via petsc-dev



I double checked that a clean build of your (master) branch has this 
error by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include 
stuff from Barry that is not yet in master, works.


so did master work recently (i.e. right before my branch got merged)?

Best regards,
Karli





On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:




On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
 > I am using karlrupp/fix-cuda-streams, merged with master, and I
get this
 > error:
 >
 > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n 1
 > printenv']":
 > Error, invalid argument:  1
 >
 > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
did edit
 > the jsrun command but Karl's branch still fails. (SUMMIT was down
today
 > so there could have been updates).
 >
 > Any suggestions?

Looks very much like a systems issue to me.

Best regards,
Karli



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
Mark,

Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu branch?

Satish

On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> I double checked that a clean build of your (master) branch has this error
> by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include stuff
> from Barry that is not yet in master, works.
> 
> On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> 
> >
> >
> > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > I am using karlrupp/fix-cuda-streams, merged with master, and I get this
> > > error:
> > >
> > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n 1
> > > printenv']":
> > > Error, invalid argument:  1
> > >
> > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did edit
> > > the jsrun command but Karl's branch still fails. (SUMMIT was down today
> > > so there could have been updates).
> > >
> > > Any suggestions?
> >
> > Looks very much like a systems issue to me.
> >
> > Best regards,
> > Karli
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
Mark,

Can you try the fix in branch balay/fix-mpiexec-shell-escape and see if it 
works?

Satish

On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:

> Mark,
> 
> Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu branch?
> 
> Satish
> 
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> 
> > I double checked that a clean build of your (master) branch has this error
> > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include stuff
> > from Barry that is not yet in master, works.
> > 
> > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > petsc-dev@mcs.anl.gov> wrote:
> > 
> > >
> > >
> > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > I am using karlrupp/fix-cuda-streams, merged with master, and I get this
> > > > error:
> > > >
> > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n 1
> > > > printenv']":
> > > > Error, invalid argument:  1
> > > >
> > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did edit
> > > > the jsrun command but Karl's branch still fails. (SUMMIT was down today
> > > > so there could have been updates).
> > > >
> > > > Any suggestions?
> > >
> > > Looks very much like a systems issue to me.
> > >
> > > Best regards,
> > > Karli
> > >
> > 
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
On Wed, Sep 25, 2019 at 8:51 AM Karl Rupp  wrote:

>
> > I double checked that a clean build of your (master) branch has this
> > error by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > stuff from Barry that is not yet in master, works.
>
> so did master work recently (i.e. right before my branch got merged)?
>

This problem is from master:

10:16 1 (d1fb55d...)|BISECTING ~/petsc-karl$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[0542e31a63bf85c93992c9e34728883db83474ac] Large number of fixes,
optimizations for configure, speeds up the configure
10:18 (0542e31...)|BISECTING ~/petsc-karl$



> Best regards,
> Karli
>
>
>
> >
> > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> >
> >
> > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> >  > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > get this
> >  > error:
> >  >
> >  > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe
> -n 1
> >  > printenv']":
> >  > Error, invalid argument:  1
> >  >
> >  > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did edit
> >  > the jsrun command but Karl's branch still fails. (SUMMIT was down
> > today
> >  > so there could have been updates).
> >  >
> >  > Any suggestions?
> >
> > Looks very much like a systems issue to me.
> >
> > Best regards,
> > Karli
> >
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
Let me know if you still want me to test this fix.

On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish  wrote:

> Mark,
>
> Can you try the fix in branch balay/fix-mpiexec-shell-escape and see if it
> works?
>
> Satish
>
> On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
>
> > Mark,
> >
> > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu branch?
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > I double checked that a clean build of your (master) branch has this
> error
> > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> stuff
> > > from Barry that is not yet in master, works.
> > >
> > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > petsc-dev@mcs.anl.gov> wrote:
> > >
> > > >
> > > >
> > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> get this
> > > > > error:
> > > > >
> > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n
> 1
> > > > > printenv']":
> > > > > Error, invalid argument:  1
> > > > >
> > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did
> edit
> > > > > the jsrun command but Karl's branch still fails. (SUMMIT was down
> today
> > > > > so there could have been updates).
> > > > >
> > > > > Any suggestions?
> > > >
> > > > Looks very much like a systems issue to me.
> > > >
> > > > Best regards,
> > > > Karli
> > > >
> > >
> >
>
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
Can you retry with updated balay/fix-mpiexec-shell-escape branch?


current mpiexec interface/code in petsc is messy.

Its primarily needed for the test suite. But then - you can't easily
run the test suite on machines like summit.

Also - it assumes mpiexec provided supports '-n 1'. However if one
provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1" -
what is the appropriate thing here?

And then configure needs to run some binaries for some checks - here
perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
to ncore]. So perhaps mpiexec is required for this purpose on summit?

And then there is this code to escape spaces in path - for
windows. [but we have to make sure this is not in code-path for user
specified --with-mpiexec="jsrun -g 1"

Satish

On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> No luck,
> 
> On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish  wrote:
> 
> > Mark,
> >
> > Can you try the fix in branch balay/fix-mpiexec-shell-escape and see if it
> > works?
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> >
> > > Mark,
> > >
> > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu branch?
> > >
> > > Satish
> > >
> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >
> > > > I double checked that a clean build of your (master) branch has this
> > error
> > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > stuff
> > > > from Barry that is not yet in master, works.
> > > >
> > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > > petsc-dev@mcs.anl.gov> wrote:
> > > >
> > > > >
> > > > >
> > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > get this
> > > > > > error:
> > > > > >
> > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe -n
> > 1
> > > > > > printenv']":
> > > > > > Error, invalid argument:  1
> > > > > >
> > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I did
> > edit
> > > > > > the jsrun command but Karl's branch still fails. (SUMMIT was down
> > today
> > > > > > so there could have been updates).
> > > > > >
> > > > > > Any suggestions?
> > > > >
> > > > > Looks very much like a systems issue to me.
> > > > >
> > > > > Best regards,
> > > > > Karli
> > > > >
> > > >
> > >
> >
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
On Wed, Sep 25, 2019 at 12:44 PM Balay, Satish  wrote:

> Can you retry with updated balay/fix-mpiexec-shell-escape branch?
>
>
> current mpiexec interface/code in petsc is messy.
>
> Its primarily needed for the test suite. But then - you can't easily
> run the test suite on machines like summit.
>
> Also - it assumes mpiexec provided supports '-n 1'. However if one
> provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1" -
> what is the appropriate thing here?
>

jsrun does take -n. It just has other args. I am trying to check if it
requires other args. I thought it did but let me check.


>
> And then configure needs to run some binaries for some checks - here
> perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
> to ncore]. So perhaps mpiexec is required for this purpose on summit?
>
> And then there is this code to escape spaces in path - for
> windows. [but we have to make sure this is not in code-path for user
> specified --with-mpiexec="jsrun -g 1"
>
> Satish
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > No luck,
> >
> > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
> wrote:
> >
> > > Mark,
> > >
> > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and see
> if it
> > > works?
> > >
> > > Satish
> > >
> > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> > >
> > > > Mark,
> > > >
> > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
> branch?
> > > >
> > > > Satish
> > > >
> > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > > >
> > > > > I double checked that a clean build of your (master) branch has
> this
> > > error
> > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > > stuff
> > > > > from Barry that is not yet in master, works.
> > > > >
> > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > > > petsc-dev@mcs.anl.gov> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > > get this
> > > > > > > error:
> > > > > > >
> > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> --oversubscribe -n
> > > 1
> > > > > > > printenv']":
> > > > > > > Error, invalid argument:  1
> > > > > > >
> > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> did
> > > edit
> > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT was
> down
> > > today
> > > > > > > so there could have been updates).
> > > > > > >
> > > > > > > Any suggestions?
> > > > > >
> > > > > > Looks very much like a systems issue to me.
> > > > > >
> > > > > > Best regards,
> > > > > > Karli
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> On Wed, Sep 25, 2019 at 12:44 PM Balay, Satish  wrote:
> 
> > Can you retry with updated balay/fix-mpiexec-shell-escape branch?
> >
> >
> > current mpiexec interface/code in petsc is messy.
> >
> > Its primarily needed for the test suite. But then - you can't easily
> > run the test suite on machines like summit.
> >
> > Also - it assumes mpiexec provided supports '-n 1'. However if one
> > provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1" -
> > what is the appropriate thing here?
> >
> 
> jsrun does take -n. It just has other args. I am trying to check if it
> requires other args. I thought it did but let me check.

https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/

-n  --nrs   Number of resource sets

Beta2 Change (October 17):
-n was be replaced by -nnodes

So its not the same functionality as 'mpiexec -n'

Either way - please try the above branch

Satish

> 
> 
> >
> > And then configure needs to run some binaries for some checks - here
> > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
> > to ncore]. So perhaps mpiexec is required for this purpose on summit?
> >
> > And then there is this code to escape spaces in path - for
> > windows. [but we have to make sure this is not in code-path for user
> > specified --with-mpiexec="jsrun -g 1"
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > No luck,
> > >
> > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
> > wrote:
> > >
> > > > Mark,
> > > >
> > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and see
> > if it
> > > > works?
> > > >
> > > > Satish
> > > >
> > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> > > >
> > > > > Mark,
> > > > >
> > > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
> > branch?
> > > > >
> > > > > Satish
> > > > >
> > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > > > >
> > > > > > I double checked that a clean build of your (master) branch has
> > this
> > > > error
> > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > > > stuff
> > > > > > from Barry that is not yet in master, works.
> > > > > >
> > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > > > > petsc-dev@mcs.anl.gov> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > > > get this
> > > > > > > > error:
> > > > > > > >
> > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> > --oversubscribe -n
> > > > 1
> > > > > > > > printenv']":
> > > > > > > > Error, invalid argument:  1
> > > > > > > >
> > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did
> > > > edit
> > > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT was
> > down
> > > > today
> > > > > > > > so there could have been updates).
> > > > > > > >
> > > > > > > > Any suggestions?
> > > > > > >
> > > > > > > Looks very much like a systems issue to me.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Karli
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mills, Richard Tran via petsc-dev
On 9/25/19 11:38 AM, Mark Adams via petsc-dev wrote:
[...]
> jsrun does take -n. It just has other args. I am trying to check if it
> requires other args. I thought it did but let me check.

https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/

-n  --nrs   Number of resource sets


-n is still supported. There are two versions of everything. One letter ones 
and more explanatory ones.
Yes, it's supported, but it's a little different than what "-n" usually does in 
mpiexec, where it means the number of processes. For 'jsrun', it means the 
number of resource sets, which is multiplied by the "tasks per resource set" 
specified by "-a" to get the MPI process count. I think if we can specify that 
"-a 1" is part of our "mpiexec", then we should be OK with using -n as PETSc 
normally does.

--Richard

In fact they have a nice little tool to viz layouts and they give you the 
command line with this short form, eg,

https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=


Beta2 Change (October 17):
-n was be replaced by -nnodes

So its not the same functionality as 'mpiexec -n'

I am still waiting for an interactive shell to test just -n. That really should 
run


Either way - please try the above branch

Satish

>
>
> >
> > And then configure needs to run some binaries for some checks - here
> > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
> > to ncore]. So perhaps mpiexec is required for this purpose on summit?
> >
> > And then there is this code to escape spaces in path - for
> > windows. [but we have to make sure this is not in code-path for user
> > specified --with-mpiexec="jsrun -g 1"
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > No luck,
> > >
> > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
> > > mailto:ba...@mcs.anl.gov>>
> > wrote:
> > >
> > > > Mark,
> > > >
> > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and see
> > if it
> > > > works?
> > > >
> > > > Satish
> > > >
> > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> > > >
> > > > > Mark,
> > > > >
> > > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
> > branch?
> > > > >
> > > > > Satish
> > > > >
> > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > > > >
> > > > > > I double checked that a clean build of your (master) branch has
> > this
> > > > error
> > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > > > stuff
> > > > > > from Barry that is not yet in master, works.
> > > > > >
> > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > > > > petsc-dev@mcs.anl.gov> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > > > get this
> > > > > > > > error:
> > > > > > > >
> > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> > --oversubscribe -n
> > > > 1
> > > > > > > > printenv']":
> > > > > > > > Error, invalid argument:  1
> > > > > > > >
> > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did
> > > > edit
> > > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT was
> > down
> > > > today
> > > > > > > > so there could have been updates).
> > > > > > > >
> > > > > > > > Any suggestions?
> > > > > > >
> > > > > > > Looks very much like a systems issue to me.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Karli
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>




Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
Oh, and I tested the branch and it didn't work. file was attached.

On Wed, Sep 25, 2019 at 2:38 PM Mark Adams  wrote:

>
>
> On Wed, Sep 25, 2019 at 2:23 PM Balay, Satish  wrote:
>
>> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>>
>> > On Wed, Sep 25, 2019 at 12:44 PM Balay, Satish 
>> wrote:
>> >
>> > > Can you retry with updated balay/fix-mpiexec-shell-escape branch?
>> > >
>> > >
>> > > current mpiexec interface/code in petsc is messy.
>> > >
>> > > Its primarily needed for the test suite. But then - you can't easily
>> > > run the test suite on machines like summit.
>> > >
>> > > Also - it assumes mpiexec provided supports '-n 1'. However if one
>> > > provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1" -
>> > > what is the appropriate thing here?
>> > >
>> >
>> > jsrun does take -n. It just has other args. I am trying to check if it
>> > requires other args. I thought it did but let me check.
>>
>>
>> https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/
>>
>> -n  --nrs   Number of resource sets
>>
>>
> -n is still supported. There are two versions of everything. One letter
> ones and more explanatory ones.
>
> In fact they have a nice little tool to viz layouts and they give you the
> command line with this short form, eg,
>
> https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=
>
>
>
>> Beta2 Change (October 17):
>> -n was be replaced by -nnodes
>>
>> So its not the same functionality as 'mpiexec -n'
>>
>
> I am still waiting for an interactive shell to test just -n. That really
> should run
>
>
>>
>> Either way - please try the above branch
>
>
>> Satish
>>
>> >
>> >
>> > >
>> > > And then configure needs to run some binaries for some checks - here
>> > > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
>> > > to ncore]. So perhaps mpiexec is required for this purpose on summit?
>> > >
>> > > And then there is this code to escape spaces in path - for
>> > > windows. [but we have to make sure this is not in code-path for user
>> > > specified --with-mpiexec="jsrun -g 1"
>> > >
>> > > Satish
>> > >
>> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>> > >
>> > > > No luck,
>> > > >
>> > > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
>> > > wrote:
>> > > >
>> > > > > Mark,
>> > > > >
>> > > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and
>> see
>> > > if it
>> > > > > works?
>> > > > >
>> > > > > Satish
>> > > > >
>> > > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
>> > > > >
>> > > > > > Mark,
>> > > > > >
>> > > > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
>> > > branch?
>> > > > > >
>> > > > > > Satish
>> > > > > >
>> > > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>> > > > > >
>> > > > > > > I double checked that a clean build of your (master) branch
>> has
>> > > this
>> > > > > error
>> > > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may
>> include
>> > > > > stuff
>> > > > > > > from Barry that is not yet in master, works.
>> > > > > > >
>> > > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
>> > > > > > > petsc-dev@mcs.anl.gov> wrote:
>> > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
>> > > > > > > > > I am using karlrupp/fix-cuda-streams, merged with master,
>> and I
>> > > > > get this
>> > > > > > > > > error:
>> > > > > > > > >
>> > > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
>> > > --oversubscribe -n
>> > > > > 1
>> > > > > > > > > printenv']":
>> > > > > > > > > Error, invalid argument:  1
>> > > > > > > > >
>> > > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work
>> but I
>> > > did
>> > > > > edit
>> > > > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT
>> was
>> > > down
>> > > > > today
>> > > > > > > > > so there could have been updates).
>> > > > > > > > >
>> > > > > > > > > Any suggestions?
>> > > > > > > >
>> > > > > > > > Looks very much like a systems issue to me.
>> > > > > > > >
>> > > > > > > > Best regards,
>> > > > > > > > Karli
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> >
>>
>>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
I made changes and asked to retest with the latest changes.

Satish

On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> Oh, and I tested the branch and it didn't work. file was attached.
> 
> On Wed, Sep 25, 2019 at 2:38 PM Mark Adams  wrote:
> 
> >
> >
> > On Wed, Sep 25, 2019 at 2:23 PM Balay, Satish  wrote:
> >
> >> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >>
> >> > On Wed, Sep 25, 2019 at 12:44 PM Balay, Satish 
> >> wrote:
> >> >
> >> > > Can you retry with updated balay/fix-mpiexec-shell-escape branch?
> >> > >
> >> > >
> >> > > current mpiexec interface/code in petsc is messy.
> >> > >
> >> > > Its primarily needed for the test suite. But then - you can't easily
> >> > > run the test suite on machines like summit.
> >> > >
> >> > > Also - it assumes mpiexec provided supports '-n 1'. However if one
> >> > > provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1" -
> >> > > what is the appropriate thing here?
> >> > >
> >> >
> >> > jsrun does take -n. It just has other args. I am trying to check if it
> >> > requires other args. I thought it did but let me check.
> >>
> >>
> >> https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/
> >>
> >> -n  --nrs   Number of resource sets
> >>
> >>
> > -n is still supported. There are two versions of everything. One letter
> > ones and more explanatory ones.
> >
> > In fact they have a nice little tool to viz layouts and they give you the
> > command line with this short form, eg,
> >
> > https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=
> >
> >
> >
> >> Beta2 Change (October 17):
> >> -n was be replaced by -nnodes
> >>
> >> So its not the same functionality as 'mpiexec -n'
> >>
> >
> > I am still waiting for an interactive shell to test just -n. That really
> > should run
> >
> >
> >>
> >> Either way - please try the above branch
> >
> >
> >> Satish
> >>
> >> >
> >> >
> >> > >
> >> > > And then configure needs to run some binaries for some checks - here
> >> > > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
> >> > > to ncore]. So perhaps mpiexec is required for this purpose on summit?
> >> > >
> >> > > And then there is this code to escape spaces in path - for
> >> > > windows. [but we have to make sure this is not in code-path for user
> >> > > specified --with-mpiexec="jsrun -g 1"
> >> > >
> >> > > Satish
> >> > >
> >> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >> > >
> >> > > > No luck,
> >> > > >
> >> > > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
> >> > > wrote:
> >> > > >
> >> > > > > Mark,
> >> > > > >
> >> > > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and
> >> see
> >> > > if it
> >> > > > > works?
> >> > > > >
> >> > > > > Satish
> >> > > > >
> >> > > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> >> > > > >
> >> > > > > > Mark,
> >> > > > > >
> >> > > > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
> >> > > branch?
> >> > > > > >
> >> > > > > > Satish
> >> > > > > >
> >> > > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >> > > > > >
> >> > > > > > > I double checked that a clean build of your (master) branch
> >> has
> >> > > this
> >> > > > > error
> >> > > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may
> >> include
> >> > > > > stuff
> >> > > > > > > from Barry that is not yet in master, works.
> >> > > > > > >
> >> > > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> >> > > > > > > petsc-dev@mcs.anl.gov> wrote:
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> >> > > > > > > > > I am using karlrupp/fix-cuda-streams, merged with master,
> >> and I
> >> > > > > get this
> >> > > > > > > > > error:
> >> > > > > > > > >
> >> > > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> >> > > --oversubscribe -n
> >> > > > > 1
> >> > > > > > > > > printenv']":
> >> > > > > > > > > Error, invalid argument:  1
> >> > > > > > > > >
> >> > > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work
> >> but I
> >> > > did
> >> > > > > edit
> >> > > > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT
> >> was
> >> > > down
> >> > > > > today
> >> > > > > > > > > so there could have been updates).
> >> > > > > > > > >
> >> > > > > > > > > Any suggestions?
> >> > > > > > > >
> >> > > > > > > > Looks very much like a systems issue to me.
> >> > > > > > > >
> >> > > > > > > > Best regards,
> >> > > > > > > > Karli
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> >
> >>
> >>
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
I did test this and sent the log (error).

On Wed, Sep 25, 2019 at 2:58 PM Balay, Satish  wrote:

> I made changes and asked to retest with the latest changes.
>
> Satish
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > Oh, and I tested the branch and it didn't work. file was attached.
> >
> > On Wed, Sep 25, 2019 at 2:38 PM Mark Adams  wrote:
> >
> > >
> > >
> > > On Wed, Sep 25, 2019 at 2:23 PM Balay, Satish 
> wrote:
> > >
> > >> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >>
> > >> > On Wed, Sep 25, 2019 at 12:44 PM Balay, Satish 
> > >> wrote:
> > >> >
> > >> > > Can you retry with updated balay/fix-mpiexec-shell-escape branch?
> > >> > >
> > >> > >
> > >> > > current mpiexec interface/code in petsc is messy.
> > >> > >
> > >> > > Its primarily needed for the test suite. But then - you can't
> easily
> > >> > > run the test suite on machines like summit.
> > >> > >
> > >> > > Also - it assumes mpiexec provided supports '-n 1'. However if one
> > >> > > provides non-standard mpiexec such as --with-mpiexec="jsrun -g 1"
> -
> > >> > > what is the appropriate thing here?
> > >> > >
> > >> >
> > >> > jsrun does take -n. It just has other args. I am trying to check if
> it
> > >> > requires other args. I thought it did but let me check.
> > >>
> > >>
> > >>
> https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/
> > >>
> > >> -n  --nrs   Number of resource sets
> > >>
> > >>
> > > -n is still supported. There are two versions of everything. One letter
> > > ones and more explanatory ones.
> > >
> > > In fact they have a nice little tool to viz layouts and they give you
> the
> > > command line with this short form, eg,
> > >
> > > https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=
> > >
> > >
> > >
> > >> Beta2 Change (October 17):
> > >> -n was be replaced by -nnodes
> > >>
> > >> So its not the same functionality as 'mpiexec -n'
> > >>
> > >
> > > I am still waiting for an interactive shell to test just -n. That
> really
> > > should run
> > >
> > >
> > >>
> > >> Either way - please try the above branch
> > >
> > >
> > >> Satish
> > >>
> > >> >
> > >> >
> > >> > >
> > >> > > And then configure needs to run some binaries for some checks -
> here
> > >> > > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI
> defaults
> > >> > > to ncore]. So perhaps mpiexec is required for this purpose on
> summit?
> > >> > >
> > >> > > And then there is this code to escape spaces in path - for
> > >> > > windows. [but we have to make sure this is not in code-path for
> user
> > >> > > specified --with-mpiexec="jsrun -g 1"
> > >> > >
> > >> > > Satish
> > >> > >
> > >> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >> > >
> > >> > > > No luck,
> > >> > > >
> > >> > > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish <
> ba...@mcs.anl.gov>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Mark,
> > >> > > > >
> > >> > > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape
> and
> > >> see
> > >> > > if it
> > >> > > > > works?
> > >> > > > >
> > >> > > > > Satish
> > >> > > > >
> > >> > > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> > >> > > > >
> > >> > > > > > Mark,
> > >> > > > > >
> > >> > > > > > Can you send configure.log from
> mark/fix-cuda-with-gamg-pintocpu
> > >> > > branch?
> > >> > > > > >
> > >> > > > > > Satish
> > >> > > > > >
> > >> > > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >> > > > > >
> > >> > > > > > > I double checked that a clean build of your (master)
> branch
> > >> has
> > >> > > this
> > >> > > > > error
> > >> > > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may
> > >> include
> > >> > > > > stuff
> > >> > > > > > > from Barry that is not yet in master, works.
> > >> > > > > > >
> > >> > > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > >> > > > > > > petsc-dev@mcs.anl.gov> wrote:
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > >> > > > > > > > > I am using karlrupp/fix-cuda-streams, merged with
> master,
> > >> and I
> > >> > > > > get this
> > >> > > > > > > > > error:
> > >> > > > > > > > >
> > >> > > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> > >> > > --oversubscribe -n
> > >> > > > > 1
> > >> > > > > > > > > printenv']":
> > >> > > > > > > > > Error, invalid argument:  1
> > >> > > > > > > > >
> > >> > > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to
> work
> > >> but I
> > >> > > did
> > >> > > > > edit
> > >> > > > > > > > > the jsrun command but Karl's branch still fails.
> (SUMMIT
> > >> was
> > >> > > down
> > >> > > > > today
> > >> > > > > > > > > so there could have been updates).
> > >> > > > > > > > >
> > >> > > > > > > > > Any suggestions?
> > >> > > > > > > >
> > >> > > > > > > > Looks very much like a systems issue to me.
> > >> > > > > > > >
> > >> > > > > > > > Best rega

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
> Yes, it's supported, but it's a little different than what "-n" usually
> does in mpiexec, where it means the number of processes. For 'jsrun', it
> means the number of resource sets, which is multiplied by the "tasks per
> resource set" specified by "-a" to get the MPI process count. I think if we
> can specify that "-a 1" is part of our "mpiexec", then we should be OK with
> using -n as PETSc normally does.
>

jsrun does not run with just -n on SUMMIT. I have found that it works with
adding -g 1.


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> I did test this and sent the log (error).

Mark,

I made more changes - can you retry again - and resend log.

Satish


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
Defined "VERSION_GIT" to ""v3.11.3-2242-gb5e99a5""

This is not the latest state - It should be:

commit cb53a042369fb946804f53931a88b58e10588da1 (HEAD -> 
balay/fix-mpiexec-shell-escape, origin/balay/fix-mpiexec-shell-escape)

Try:

git fetch
git checkout origin/balay/fix-mpiexec-shell-escape

Satish

On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> On Wed, Sep 25, 2019 at 4:57 PM Balay, Satish  wrote:
> 
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > I did test this and sent the log (error).
> >
> > Mark,
> >
> > I made more changes - can you retry again - and resend log.
> >
> > Satish
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
I will test this now but 

17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
remote: Enumerating objects: 119, done.
remote: Counting objects: 100% (119/119), done.
remote: Compressing objects: 100% (91/91), done.
remote: Total 119 (delta 49), reused 74 (delta 28)
Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
Resolving deltas: 100% (49/49), completed with 1 local objects.
>From https://gitlab.com/petsc/petsc
 + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
origin/balay/fix-mpiexec-shell-escape  (forced update)
 + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
origin/jczhang/feature-sf-on-gpu  (forced update)
   cb9de97..f9ff08a  jolivet/fix-error-col-row ->
origin/jolivet/fix-error-col-row
   40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
origin/oanam/jacobf/cell-to-ref-mapping
 + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
origin/stefanozampini/hypre-cuda-rebased  (forced update)
18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
origin/balay/fix-mpiexec-shell-escape
Note: checking out 'origin/balay/fix-mpiexec-shell-escape'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at cb53a04... mpiexec: fix shell escape of path-to-mpiexec only
when using autodected-path. Also spectrum MPI uses OMPI_MAJOR_VERSION etc -
so check if mpiexec supports --oversubscribe - before using it.
18:16 (cb53a04...) ~/petsc-karl$

On Wed, Sep 25, 2019 at 5:58 PM Balay, Satish  wrote:

> Defined "VERSION_GIT" to ""v3.11.3-2242-gb5e99a5""
>
> This is not the latest state - It should be:
>
> commit cb53a042369fb946804f53931a88b58e10588da1 (HEAD ->
> balay/fix-mpiexec-shell-escape, origin/balay/fix-mpiexec-shell-escape)
>
> Try:
>
> git fetch
> git checkout origin/balay/fix-mpiexec-shell-escape
>
> Satish
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > On Wed, Sep 25, 2019 at 4:57 PM Balay, Satish  wrote:
> >
> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >
> > > > I did test this and sent the log (error).
> > >
> > > Mark,
> > >
> > > I made more changes - can you retry again - and resend log.
> > >
> > > Satish
> > >
> >
>
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
> 18:16 (cb53a04...) ~/petsc-karl$

So this is the commit I recommended you test against - and that's what
you have got now. Please go ahead and test.

[note: the branch is rebased - so 'git pull' won't work -(as you can
see from the "(forced update)" message - and '<>' status from git
prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
deal with in detached mode - which makes this obvious]

Satish


On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> I will test this now but 
> 
> 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> remote: Enumerating objects: 119, done.
> remote: Counting objects: 100% (119/119), done.
> remote: Compressing objects: 100% (91/91), done.
> remote: Total 119 (delta 49), reused 74 (delta 28)
> Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> Resolving deltas: 100% (49/49), completed with 1 local objects.
> >From https://gitlab.com/petsc/petsc
>  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> origin/balay/fix-mpiexec-shell-escape  (forced update)
>  + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
> origin/jczhang/feature-sf-on-gpu  (forced update)
>cb9de97..f9ff08a  jolivet/fix-error-col-row ->
> origin/jolivet/fix-error-col-row
>40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
> origin/oanam/jacobf/cell-to-ref-mapping
>  + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
> origin/stefanozampini/hypre-cuda-rebased  (forced update)
> 18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
> origin/balay/fix-mpiexec-shell-escape
> Note: checking out 'origin/balay/fix-mpiexec-shell-escape'.
> 
> You are in 'detached HEAD' state. You can look around, make experimental
> changes and commit them, and you can discard any commits you make in this
> state without impacting any branches by performing another checkout.
> 
> If you want to create a new branch to retain commits you create, you may
> do so (now or later) by using -b with the checkout command again. Example:
> 
>   git checkout -b new_branch_name
> 
> HEAD is now at cb53a04... mpiexec: fix shell escape of path-to-mpiexec only
> when using autodected-path. Also spectrum MPI uses OMPI_MAJOR_VERSION etc -
> so check if mpiexec supports --oversubscribe - before using it.
> 18:16 (cb53a04...) ~/petsc-karl$
> 
> On Wed, Sep 25, 2019 at 5:58 PM Balay, Satish  wrote:
> 
> > Defined "VERSION_GIT" to ""v3.11.3-2242-gb5e99a5""
> >
> > This is not the latest state - It should be:
> >
> > commit cb53a042369fb946804f53931a88b58e10588da1 (HEAD ->
> > balay/fix-mpiexec-shell-escape, origin/balay/fix-mpiexec-shell-escape)
> >
> > Try:
> >
> > git fetch
> > git checkout origin/balay/fix-mpiexec-shell-escape
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > On Wed, Sep 25, 2019 at 4:57 PM Balay, Satish  wrote:
> > >
> > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > > >
> > > > > I did test this and sent the log (error).
> > > >
> > > > Mark,
> > > >
> > > > I made more changes - can you retry again - and resend log.
> > > >
> > > > Satish
> > > >
> > >
> >
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:

> > 18:16 (cb53a04...) ~/petsc-karl$
>
> So this is the commit I recommended you test against - and that's what
> you have got now. Please go ahead and test.
>
>
I sent the log for this. This is the output:

18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
PETSC_DIR=$PWD
===
 Configuring PETSc to compile on your system

===
===

* WARNING: F77 (set to
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
  use ./configure F77=$F77 if you really want to use that value **


===


===

* WARNING: Using default optimization C flags -O

   You might consider manually setting
optimal optimization flags for your system with

 COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
examples

 ===


===

* WARNING: You have an older version of Gnu make,
it will work,
but may not support all the
parallel testing options. You can install the
  latest
Gnu make with your package manager, such as brew or macports, or use

the --download-make option to get the latest Gnu make warning
message *

===

  TESTING: configureMPIEXEC from
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)

***
 UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
details):
---
Unable to run jsrun -g 1 with option "-n 1"
Error: It is only possible to use js commands within a job allocation
unless CSM is running
09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
Exiting.
***

18:20 1 (cb53a04...) ~/petsc-karl$


> [note: the branch is rebased - so 'git pull' won't work -(as you can
> see from the "(forced update)" message - and '<>' status from git
> prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> deal with in detached mode - which makes this obvious]
>

I got this <> and "fixed" it by deleting the branch and repulling it. I
guess I needed to fetch also.

Mark


>
> Satish
>
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > I will test this now but 
> >
> > 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> > remote: Enumerating objects: 119, done.
> > remote: Counting objects: 100% (119/119), done.
> > remote: Compressing objects: 100% (91/91), done.
> > remote: Total 119 (delta 49), reused 74 (delta 28)
> > Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> > Resolving deltas: 100% (49/49), completed with 1 local objects.
> > >From https://gitlab.com/petsc/petsc
> >  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> > origin/balay/fix-mpiexec-shell-escape  (forced update)
> >  + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
> > origin/jczhang/feature-sf-on-gpu  (forced update)
> >cb9de97..f9ff08a  jolivet/fix-error-col-row ->
> > origin/jolivet/fix-error-col-row
> >40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
> > origin/oanam/jacobf/cell-to-ref-mapping
> >  + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
> > origin/stefanozampini/hypre-cuda-rebased  (forced update)
> > 18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
> > origin/balay/fix-mpiexec-shell-escape
> > Note: checking out 'origin/balay/fix-mpiexec-shell-escape'.
> >
> > You are in 'detached HEAD' state. You can look around, make experimental
> > changes and commit them, and you can discard any commits you make in this
> > state without impacting any branches by performing another checkout.
> >
> > If you want to create a new branch to retain commits you create, you may
> > do so (now or later) by using -b with the checkout command again.
> Example:
> >
> >   git checkout -b new_branch_name
> >
> > HEAD is now at cb53a04... mpiexec: fix shell escape of path-to-mpiexec
> only
> > when using autodected-path. Also spectrum MPI uses OMPI_MAJOR_VERSION
> etc -
> > so check if m

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
> Unable to run jsrun -g 1 with option "-n 1"
> Error: It is only possible to use js commands within a job allocation
> unless CSM is running


Nope  this is a different error message.

The message suggests - you can't run 'jsrun -g 1 -n 1 binary' Can you try this 
manually and see
what you get?

jsrun -g 1 -n 1 printenv

Satish


On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:
> 
> > > 18:16 (cb53a04...) ~/petsc-karl$
> >
> > So this is the commit I recommended you test against - and that's what
> > you have got now. Please go ahead and test.
> >
> >
> I sent the log for this. This is the output:
> 
> 18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
> PETSC_DIR=$PWD
> ===
>  Configuring PETSc to compile on your system
> 
> ===
> ===
> 
> * WARNING: F77 (set to
> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
>   use ./configure F77=$F77 if you really want to use that value **
> 
> 
> ===
> 
> 
> ===
> 
> * WARNING: Using default optimization C flags -O
> 
>You might consider manually setting
> optimal optimization flags for your system with
> 
>  COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
> examples
> 
>  
> ===
> 
> 
> ===
> 
> * WARNING: You have an older version of Gnu make,
> it will work,
> but may not support all the
> parallel testing options. You can install the
>   latest
> Gnu make with your package manager, such as brew or macports, or use
> 
> the --download-make option to get the latest Gnu make warning
> message *
> 
> ===
> 
>   TESTING: configureMPIEXEC from
> config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)
> 
> ***
>  UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
> details):
> ---
> Unable to run jsrun -g 1 with option "-n 1"
> Error: It is only possible to use js commands within a job allocation
> unless CSM is running
> 09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
> Exiting.
> ***
> 
> 18:20 1 (cb53a04...) ~/petsc-karl$
> 
> 
> > [note: the branch is rebased - so 'git pull' won't work -(as you can
> > see from the "(forced update)" message - and '<>' status from git
> > prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> > deal with in detached mode - which makes this obvious]
> >
> 
> I got this <> and "fixed" it by deleting the branch and repulling it. I
> guess I needed to fetch also.
> 
> Mark
> 
> 
> >
> > Satish
> >
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > I will test this now but 
> > >
> > > 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> > > remote: Enumerating objects: 119, done.
> > > remote: Counting objects: 100% (119/119), done.
> > > remote: Compressing objects: 100% (91/91), done.
> > > remote: Total 119 (delta 49), reused 74 (delta 28)
> > > Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> > > Resolving deltas: 100% (49/49), completed with 1 local objects.
> > > >From https://gitlab.com/petsc/petsc
> > >  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> > > origin/balay/fix-mpiexec-shell-escape  (forced update)
> > >  + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
> > > origin/jczhang/feature-sf-on-gpu  (forced update)
> > >cb9de97..f9ff08a  jolivet/fix-error-col-row ->
> > > origin/jolivet/fix-error-col-row
> > >40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
> > > origin/oanam/jacobf/cell-to-ref-mapping
> > >  + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
> > > origin/stefanozampini/hypre-cuda-rebased  (forced update)
> > > 18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
> > > origin/balay/fix-mpiexec-shell-escape
> > > Note: checking out 'origin/balay/fix-mpiexec-shell-escape'.
> > >
> >

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
This log is from the wrong build. It says:

Defined "VERSION_GIT" to ""v3.11.3-2242-gb5e99a5""

i.e its not with commit cb53a04

Satish

On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> Here is the log.
> 
> On Wed, Sep 25, 2019 at 8:34 PM Mark Adams  wrote:
> 
> >
> >
> > On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:
> >
> >> > 18:16 (cb53a04...) ~/petsc-karl$
> >>
> >> So this is the commit I recommended you test against - and that's what
> >> you have got now. Please go ahead and test.
> >>
> >>
> > I sent the log for this. This is the output:
> >
> > 18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
> > PETSC_DIR=$PWD
> >
> > ===
> >  Configuring PETSc to compile on your system
> >
> >
> > ===
> > ===
> >
> > * WARNING: F77 (set to
> > /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
> >   use ./configure F77=$F77 if you really want to use that value **
> >
> >
> > ===
> >
> >
> > ===
> >
> > * WARNING: Using default optimization C flags -O
> >
> >You might consider manually setting
> > optimal optimization flags for your system with
> >
> >  COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
> > examples
> >
> >  
> > ===
> >
> >
> > ===
> >
> > * WARNING: You have an older version of Gnu make,
> > it will work,
> > but may not support all the
> > parallel testing options. You can install the
> >   latest
> > Gnu make with your package manager, such as brew or macports, or use
> >
> > the --download-make option to get the latest Gnu make warning
> > message *
> >
> > ===
> >
> >   TESTING: configureMPIEXEC from
> > config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)
> >
> > ***
> >  UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
> > details):
> >
> > ---
> > Unable to run jsrun -g 1 with option "-n 1"
> > Error: It is only possible to use js commands within a job allocation
> > unless CSM is running
> > 09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
> > Exiting.
> >
> > ***
> >
> > 18:20 1 (cb53a04...) ~/petsc-karl$
> >
> >
> >> [note: the branch is rebased - so 'git pull' won't work -(as you can
> >> see from the "(forced update)" message - and '<>' status from git
> >> prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> >> deal with in detached mode - which makes this obvious]
> >>
> >
> > I got this <> and "fixed" it by deleting the branch and repulling it. I
> > guess I needed to fetch also.
> >
> > Mark
> >
> >
> >>
> >> Satish
> >>
> >>
> >> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >>
> >> > I will test this now but 
> >> >
> >> > 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> >> > remote: Enumerating objects: 119, done.
> >> > remote: Counting objects: 100% (119/119), done.
> >> > remote: Compressing objects: 100% (91/91), done.
> >> > remote: Total 119 (delta 49), reused 74 (delta 28)
> >> > Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> >> > Resolving deltas: 100% (49/49), completed with 1 local objects.
> >> > >From https://gitlab.com/petsc/petsc
> >> >  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> >> > origin/balay/fix-mpiexec-shell-escape  (forced update)
> >> >  + ffdc635...7eeb5f9 jczhang/feature-sf-on-gpu ->
> >> > origin/jczhang/feature-sf-on-gpu  (forced update)
> >> >cb9de97..f9ff08a  jolivet/fix-error-col-row ->
> >> > origin/jolivet/fix-error-col-row
> >> >40ea605..de5ad60  oanam/jacobf/cell-to-ref-mapping ->
> >> > origin/oanam/jacobf/cell-to-ref-mapping
> >> >  + ecac953...9fb579e stefanozampini/hypre-cuda-rebased ->
> >> > origin/stefanozampini/hypre-cuda-rebased  (forced update)
> >> > 18:16 balay/fix-mpiexec-shell-escape<> ~/petsc-karl$ git checkout
> >> > origin/balay/fix-mpiexec-shell-escape

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
On Wed, Sep 25, 2019 at 8:40 PM Balay, Satish  wrote:

> > Unable to run jsrun -g 1 with option "-n 1"
> > Error: It is only possible to use js commands within a job allocation
> > unless CSM is running
>
>
> Nope  this is a different error message.
>
> The message suggests - you can't run 'jsrun -g 1 -n 1 binary' Can you try
> this manually and see
> what you get?
>
> jsrun -g 1 -n 1 printenv
>

I tested this earlier today and originally when I was figuring out the/a
minimal run command:

22:08  /gpfs/alpine/geo127/scratch/adams$ jsrun -g 1 -n 1 printenv
GIT_PS1_SHOWDIRTYSTATE=1
XDG_SESSION_ID=494
SHELL=/bin/bash
HISTSIZE=100
PETSC_ARCH=arch-summit-opt64-pgi-cuda
SSH_CLIENT=160.91.202.152 48626 22
LC_ALL=
USER=adams
 ...


>
> Satish
>
>
> On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
>
> > On Wed, Sep 25, 2019 at 6:23 PM Balay, Satish  wrote:
> >
> > > > 18:16 (cb53a04...) ~/petsc-karl$
> > >
> > > So this is the commit I recommended you test against - and that's what
> > > you have got now. Please go ahead and test.
> > >
> > >
> > I sent the log for this. This is the output:
> >
> > 18:16 (cb53a04...) ~/petsc-karl$ ../arch-summit-opt64idx-pgi-cuda.py
> > PETSC_DIR=$PWD
> >
> ===
> >  Configuring PETSc to compile on your system
> >
> >
> ===
> >
> ===
> >
> > * WARNING: F77 (set to
> >
> /autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linux
> >   use ./configure F77=$F77 if you really want to use that value
> **
> >
> >
> >
> ===
> >
> >
> >
> ===
> >
> > * WARNING: Using default optimization C flags -O
> >
> >You might consider manually
> setting
> > optimal optimization flags for your system with
> >
> >  COPTFLAGS="optimization flags" see config/examples/arch-*-opt.py for
> > examples
> >
> >
> ===
> >
> >
> >
> ===
> >
> > * WARNING: You have an older version of Gnu make,
> > it will work,
> > but may not support all the
> > parallel testing options. You can install the
> >   latest
> > Gnu make with your package manager, such as brew or macports, or use
> >
> > the --download-make option to get the latest Gnu make warning
> > message *
> >
> >
> ===
> >
> >   TESTING: configureMPIEXEC from
> > config.packages.MPI(config/BuildSystem/config/packages/MPI.py:174)
> >
> >
> ***
> >  UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for
> > details):
> >
> ---
> > Unable to run jsrun -g 1 with option "-n 1"
> > Error: It is only possible to use js commands within a job allocation
> > unless CSM is running
> > 09-25-2019 18:20:13:224 108023 main: Error initializing RM connection.
> > Exiting.
> >
> ***
> >
> > 18:20 1 (cb53a04...) ~/petsc-karl$
> >
> >
> > > [note: the branch is rebased - so 'git pull' won't work -(as you can
> > > see from the "(forced update)" message - and '<>' status from git
> > > prompt on balay/fix-mpiexec-shell-escape). So perhaps its easier to
> > > deal with in detached mode - which makes this obvious]
> > >
> >
> > I got this <> and "fixed" it by deleting the branch and repulling it. I
> > guess I needed to fetch also.
> >
> > Mark
> >
> >
> > >
> > > Satish
> > >
> > >
> > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > >
> > > > I will test this now but 
> > > >
> > > > 17:52 balay/fix-mpiexec-shell-escape= ~/petsc-karl$ git fetch
> > > > remote: Enumerating objects: 119, done.
> > > > remote: Counting objects: 100% (119/119), done.
> > > > remote: Compressing objects: 100% (91/91), done.
> > > > remote: Total 119 (delta 49), reused 74 (delta 28)
> > > > Receiving objects: 100% (119/119), 132.88 KiB | 0 bytes/s, done.
> > > > Resolving deltas: 100% (49/49), completed with 1 local objects.
> > > > >From https://gitlab.com/petsc/petsc
> > > >  + b5e99a5...cb53a04 balay/fix-mpiexec-shell-escape ->
> > > > origin/balay/fix-mpiexec-shell-escape  (forced update)
> > > >  + ffdc635

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Balay, Satish via petsc-dev
On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:

> On Wed, Sep 25, 2019 at 8:40 PM Balay, Satish  wrote:
> 
> > > Unable to run jsrun -g 1 with option "-n 1"
> > > Error: It is only possible to use js commands within a job allocation
> > > unless CSM is running
> >
> >
> > Nope  this is a different error message.
> >
> > The message suggests - you can't run 'jsrun -g 1 -n 1 binary' Can you try
> > this manually and see
> > what you get?
> >
> > jsrun -g 1 -n 1 printenv
> >
> 
> I tested this earlier today and originally when I was figuring out the/a
> minimal run command:
> 
> 22:08  /gpfs/alpine/geo127/scratch/adams$ jsrun -g 1 -n 1 printenv
> GIT_PS1_SHOWDIRTYSTATE=1
> XDG_SESSION_ID=494
> SHELL=/bin/bash
> HISTSIZE=100
> PETSC_ARCH=arch-summit-opt64-pgi-cuda
> SSH_CLIENT=160.91.202.152 48626 22
> LC_ALL=
> USER=adams

from configure.log:

>
Executing: jsrun -g 1 -n 1 printenv

Unable to run jsrun -g 1 with option "-n 1"
Error: It is only possible to use js commands within a job allocation unless 
CSM is running
09-25-2019 22:11:56:169 68747 main: Error initializing RM connection. Exiting.
<

Its the exact same command. I don't know why it would work from shell for you - 
but not form configure.

If jsrun is not functional from configure, alternatives are 
--with-mpiexec=/bin/true or --with-batch=1

Satish


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mark Adams via petsc-dev
>
> If jsrun is not functional from configure, alternatives are
> --with-mpiexec=/bin/true or --with-batch=1
>
>
--with-mpiexec=/bin/true  seems to be working.

Thanks,
Mark


> Satish
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-26 Thread Mark Adams via petsc-dev
Karl, I have it running but I am not seeing any difference from master. I
wonder if I have the right version:

Using Petsc Development GIT revision: v3.11.3-2207-ga8e311a

I could not find karlrupp/fix-cuda-streams on the gitlab page to check your
last commit SHA1 (???), and now I get:

08:37 karlrupp/fix-cuda-streams= ~/petsc-karl$ git pull origin
karlrupp/fix-cuda-streams
fatal: Couldn't find remote ref karlrupp/fix-cuda-streams
Unexpected end of command stream
10:09 1 karlrupp/fix-cuda-streams= ~/petsc-karl$



On Wed, Sep 25, 2019 at 8:51 AM Karl Rupp  wrote:

>
> > I double checked that a clean build of your (master) branch has this
> > error by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > stuff from Barry that is not yet in master, works.
>
> so did master work recently (i.e. right before my branch got merged)?
>
> Best regards,
> Karli
>
>
>
> >
> > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev
> > mailto:petsc-dev@mcs.anl.gov>> wrote:
> >
> >
> >
> > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> >  > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > get this
> >  > error:
> >  >
> >  > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe
> -n 1
> >  > printenv']":
> >  > Error, invalid argument:  1
> >  >
> >  > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did edit
> >  > the jsrun command but Karl's branch still fails. (SUMMIT was down
> > today
> >  > so there could have been updates).
> >  >
> >  > Any suggestions?
> >
> > Looks very much like a systems issue to me.
> >
> > Best regards,
> > Karli
> >
>


Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-26 Thread Balay, Satish via petsc-dev
Mark,

The branch karlrupp/fix-cuda-streams is already merged to master. [and
the branch is now deleted]

I guess - if you wish to compare the difference this feature makes - you
can compare with master snapshot before this merge.

i.e compare master (includes karlrupp/fix-cuda-streams feature) and 
daa275bd14416591c3d721c1d33cf5d68c84dfcc [master before this merge]

Satish

On Thu, 26 Sep 2019, Mark Adams via petsc-dev wrote:

> Karl, I have it running but I am not seeing any difference from master. I
> wonder if I have the right version:
> 
> Using Petsc Development GIT revision: v3.11.3-2207-ga8e311a
> 
> I could not find karlrupp/fix-cuda-streams on the gitlab page to check your
> last commit SHA1 (???), and now I get:
> 
> 08:37 karlrupp/fix-cuda-streams= ~/petsc-karl$ git pull origin
> karlrupp/fix-cuda-streams
> fatal: Couldn't find remote ref karlrupp/fix-cuda-streams
> Unexpected end of command stream
> 10:09 1 karlrupp/fix-cuda-streams= ~/petsc-karl$
> 
> 
> 
> On Wed, Sep 25, 2019 at 8:51 AM Karl Rupp  wrote:
> 
> >
> > > I double checked that a clean build of your (master) branch has this
> > > error by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > > stuff from Barry that is not yet in master, works.
> >
> > so did master work recently (i.e. right before my branch got merged)?
> >
> > Best regards,
> > Karli
> >
> >
> >
> > >
> > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev
> > > mailto:petsc-dev@mcs.anl.gov>> wrote:
> > >
> > >
> > >
> > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > >  > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > > get this
> > >  > error:
> > >  >
> > >  > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1 --oversubscribe
> > -n 1
> > >  > printenv']":
> > >  > Error, invalid argument:  1
> > >  >
> > >  > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > > did edit
> > >  > the jsrun command but Karl's branch still fails. (SUMMIT was down
> > > today
> > >  > so there could have been updates).
> > >  >
> > >  > Any suggestions?
> > >
> > > Looks very much like a systems issue to me.
> > >
> > > Best regards,
> > > Karli
> > >
> >
> 



Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-27 Thread Karl Rupp via petsc-dev

Hi Mark,

OK, so now the problem has shifted somewhat in that it now manifests 
itself on small cases. In earlier investigation I was drawn to 
MatTranspose but had a hard time pinning it down. The bug seems more 
stable now or you probably fixed what looks like all the other bugs.


I added print statements with norms of vectors in mg.c (v-cycle) and 
found that the diffs between the CPU and GPU runs came in MatRestrict, 
which calls MatMultTranspose. I added identical print statements in the 
two versions of MatMultTranspose and see this. (pinning to the CPU does 
not seem to make any difference). Note that the problem comes in the 2nd 
iteration where the *output* vector is non-zero coming in (this should 
not matter).


Karl, I zeroed out the output vector (yy) when I come into this method 
and it fixed the problem. This is with -n 4, and this always works with 
-n 3. See the attached process layouts. It looks like this comes when 
you use the 2nd socket.


So this looks like an Nvidia bug. Let me know what you think and I can 
pass it on to ORNL.


Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point. 
I've addressed some of them, but I can't confidently say that all of the 
issues were fixed. Thus, I don't think it's a problem in NVIDIA's 
cuSparse, but rather something we need to fix in PETSc. Note that the 
problem shows up with multiple MPI ranks; if it were a problem in 
cuSparse, it would show up on a single rank as well.


Best regards,
Karli





06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1 
./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse*

[0] 3465 global equations, 1155 vertices
[0] 3465 equations in vector, 1155 vertices
   0 SNES Function norm 1.725526579328e+01
     0 KSP Residual norm 1.725526579328e+01
         2) call Restrict with |r| = 1.402719214830704e+01
                         MatMultTranspose_MPIAIJCUSPARSE |x in| = 
1.40271921483070e+01
*                        MatMultTranspose_MPIAIJ |y in| = 
0.00e+00
*                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 
0.00e+00
                         *** MatMultTranspose_MPIAIJCUSPARSE |yy| = 
3.43436359545813e+00
                         MatMultTranspose_MPIAIJCUSPARSE final |yy| = 
1.29055494844681e+01

                 3) |R| = 1.290554948446808e+01
         2) call Restrict with |r| = 4.109771717986951e+00
                         MatMultTranspose_MPIAIJCUSPARSE |x in| = 
4.10977171798695e+00
*                        MatMultTranspose_MPIAIJ |y in| = 
0.00e+00
*                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 
0.00e+00
                         *** MatMultTranspose_MPIAIJCUSPARSE |yy| = 
1.79415048609144e-01
                         MatMultTranspose_MPIAIJCUSPARSE final |yy| = 
9.01083013948788e-01

                 3) |R| = 9.010830139487883e-01
                 4) |X| = 2.864698671963022e+02
                 5) |x| = 9.76328911783e+02
                 6) post smooth |x| = 8.940011621494751e+02
                 4) |X| = 8.940011621494751e+02
                 5) |x| = 1.005081556495388e+03
                 6) post smooth |x| = 1.029043994031627e+03
     1 KSP Residual norm 8.102614049404e+00
         2) call Restrict with |r| = 4.402603749876137e+00
                         MatMultTranspose_MPIAIJCUSPARSE |x in| = 
4.40260374987614e+00
*                        MatMultTranspose_MPIAIJ |y in| = 
1.29055494844681e+01
*                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 
0.00e+00
                         *** MatMultTranspose_MPIAIJCUSPARSE |yy| = 
1.68544559626318e+00
                         MatMultTranspose_MPIAIJCUSPARSE final |yy| = 
1.82129824300863e+00

                 3) |R| = 1.821298243008628e+00
         2) call Restrict with |r| = 1.068309793900564e+00
                         MatMultTranspose_MPIAIJCUSPARSE |x in| = 
1.06830979390056e+00
                         MatMultTranspose_MPIAIJ |y in| = 
9.01083013948788e-01
                         MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 
0.00e+00
                         *** MatMultTranspose_MPIAIJCUSPARSE |yy| = 
1.40519177065298e-01
                         MatMultTranspose_MPIAIJCUSPARSE final |yy| = 
1.01853904152812e-01

                 3) |R| = 1.018539041528117e-01
                 4) |X| = 4.949616392884510e+01
                 5) |x| = 9.309440014159884e+01
                 6) post smooth |x| = 5.432486021529479e+01
                 4) |X| = 5.432486021529479e+01
                 5) |x| = 8.246142532204632e+01
                 6) post smooth |x| = 7.605703654091440e+01
   Linear solve did not converge due to DIVERGED_ITS iterations 1
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
06:50  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 4 -a 4 -c 4 -g 1 
./ex56 -cells 8,12,16

[0] 3465 global equations, 1155 vertices
[0] 3465 equations in vector, 1155 ver

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-28 Thread Stefano Zampini via petsc-dev
Mark,


MatMultTransposeAdd_SeqAIJCUSPARSE checks if the matrix is in compressed
row storage, MatMultTranspose_SeqAIJCUSPARSE does not. Probably is this the
issue? The CUSPARSE classes are kind of messy



Il giorno sab 28 set 2019 alle ore 07:55 Karl Rupp via petsc-dev <
petsc-dev@mcs.anl.gov> ha scritto:

> Hi Mark,
>
> > OK, so now the problem has shifted somewhat in that it now manifests
> > itself on small cases. In earlier investigation I was drawn to
> > MatTranspose but had a hard time pinning it down. The bug seems more
> > stable now or you probably fixed what looks like all the other bugs.
> >
> > I added print statements with norms of vectors in mg.c (v-cycle) and
> > found that the diffs between the CPU and GPU runs came in MatRestrict,
> > which calls MatMultTranspose. I added identical print statements in the
> > two versions of MatMultTranspose and see this. (pinning to the CPU does
> > not seem to make any difference). Note that the problem comes in the 2nd
> > iteration where the *output* vector is non-zero coming in (this should
> > not matter).
> >
> > Karl, I zeroed out the output vector (yy) when I come into this method
> > and it fixed the problem. This is with -n 4, and this always works with
> > -n 3. See the attached process layouts. It looks like this comes when
> > you use the 2nd socket.
> >
> > So this looks like an Nvidia bug. Let me know what you think and I can
> > pass it on to ORNL.
>
> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
> I've addressed some of them, but I can't confidently say that all of the
> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
> cuSparse, but rather something we need to fix in PETSc. Note that the
> problem shows up with multiple MPI ranks; if it were a problem in
> cuSparse, it would show up on a single rank as well.
>
> Best regards,
> Karli
>
>
>
>
>
> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse*
> > [0] 3465 global equations, 1155 vertices
> > [0] 3465 equations in vector, 1155 vertices
> >0 SNES Function norm 1.725526579328e+01
> >  0 KSP Residual norm 1.725526579328e+01
> >  2) call Restrict with |r| = 1.402719214830704e+01
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.40271921483070e+01
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 3.43436359545813e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.29055494844681e+01
> >  3) |R| = 1.290554948446808e+01
> >  2) call Restrict with |r| = 4.109771717986951e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.10977171798695e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.79415048609144e-01
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 9.01083013948788e-01
> >  3) |R| = 9.010830139487883e-01
> >  4) |X| = 2.864698671963022e+02
> >  5) |x| = 9.76328911783e+02
> >  6) post smooth |x| = 8.940011621494751e+02
> >  4) |X| = 8.940011621494751e+02
> >  5) |x| = 1.005081556495388e+03
> >  6) post smooth |x| = 1.029043994031627e+03
> >  1 KSP Residual norm 8.102614049404e+00
> >  2) call Restrict with |r| = 4.402603749876137e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.40260374987614e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 1.29055494844681e+01
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.68544559626318e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.82129824300863e+00
> >  3) |R| = 1.821298243008628e+00
> >  2) call Restrict with |r| = 1.068309793900564e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.06830979390056e+00
> >  MatMultTranspose_MPIAIJ |y in| =
> > 9.01083013948788e-01
> >  MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.40519177065298e-01
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.01853904152812e-01
> >  3) |R| = 1.018539041528117e-01
> > 

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-28 Thread Mark Adams via petsc-dev
The logic is basically correct because I simple zero out yy vector (the
output vector) and it runs great now. The numerics look fine without CPU
pinning.

AND, it worked with 1,2, and 3 GPUs (one node, one socket), but failed with
4 GPU's which uses the second socket. Strange.

On Sat, Sep 28, 2019 at 3:43 AM Stefano Zampini 
wrote:

> Mark,
>
>
> MatMultTransposeAdd_SeqAIJCUSPARSE checks if the matrix is in compressed
> row storage, MatMultTranspose_SeqAIJCUSPARSE does not. Probably is this
> the issue? The CUSPARSE classes are kind of messy
>
>
>
> Il giorno sab 28 set 2019 alle ore 07:55 Karl Rupp via petsc-dev <
> petsc-dev@mcs.anl.gov> ha scritto:
>
>> Hi Mark,
>>
>> > OK, so now the problem has shifted somewhat in that it now manifests
>> > itself on small cases. In earlier investigation I was drawn to
>> > MatTranspose but had a hard time pinning it down. The bug seems more
>> > stable now or you probably fixed what looks like all the other bugs.
>> >
>> > I added print statements with norms of vectors in mg.c (v-cycle) and
>> > found that the diffs between the CPU and GPU runs came in MatRestrict,
>> > which calls MatMultTranspose. I added identical print statements in the
>> > two versions of MatMultTranspose and see this. (pinning to the CPU does
>> > not seem to make any difference). Note that the problem comes in the
>> 2nd
>> > iteration where the *output* vector is non-zero coming in (this should
>> > not matter).
>> >
>> > Karl, I zeroed out the output vector (yy) when I come into this method
>> > and it fixed the problem. This is with -n 4, and this always works with
>> > -n 3. See the attached process layouts. It looks like this comes when
>> > you use the 2nd socket.
>> >
>> > So this looks like an Nvidia bug. Let me know what you think and I can
>> > pass it on to ORNL.
>>
>> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
>> I've addressed some of them, but I can't confidently say that all of the
>> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
>> cuSparse, but rather something we need to fix in PETSc. Note that the
>> problem shows up with multiple MPI ranks; if it were a problem in
>> cuSparse, it would show up on a single rank as well.
>>
>> Best regards,
>> Karli
>>
>>
>>
>>
>>
>> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
>> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
>> aijcusparse*
>> > [0] 3465 global equations, 1155 vertices
>> > [0] 3465 equations in vector, 1155 vertices
>> >0 SNES Function norm 1.725526579328e+01
>> >  0 KSP Residual norm 1.725526579328e+01
>> >  2) call Restrict with |r| = 1.402719214830704e+01
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 1.40271921483070e+01
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 0.00e+00
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 3.43436359545813e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 1.29055494844681e+01
>> >  3) |R| = 1.290554948446808e+01
>> >  2) call Restrict with |r| = 4.109771717986951e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 4.10977171798695e+00
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 0.00e+00
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 1.79415048609144e-01
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 9.01083013948788e-01
>> >  3) |R| = 9.010830139487883e-01
>> >  4) |X| = 2.864698671963022e+02
>> >  5) |x| = 9.76328911783e+02
>> >  6) post smooth |x| = 8.940011621494751e+02
>> >  4) |X| = 8.940011621494751e+02
>> >  5) |x| = 1.005081556495388e+03
>> >  6) post smooth |x| = 1.029043994031627e+03
>> >  1 KSP Residual norm 8.102614049404e+00
>> >  2) call Restrict with |r| = 4.402603749876137e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 4.40260374987614e+00
>> > *MatMultTranspose_MPIAIJ |y in| =
>> > 1.29055494844681e+01
>> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
>> > 0.00e+00
>> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
>> > 1.68544559626318e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
>> > 1.82129824300863e+00
>> >  3) |R| = 1.821298243008628e+00
>> >  2) call Restrict with |r| = 1.068309793900564e+00
>> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
>> > 1.06830979390056e+00
>>

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-28 Thread Mark Adams via petsc-dev
On Sat, Sep 28, 2019 at 12:55 AM Karl Rupp  wrote:

> Hi Mark,
>
> > OK, so now the problem has shifted somewhat in that it now manifests
> > itself on small cases.


It is somewhat random and anecdotal but it does happen on the smaller test
problem now. When I try to narrow down when the problem manifests by
reducing the number of GPUs/procs the problem can not be too small (ie, the
bug does not manifest on even smaller problems).

But it is much more stable and there does seem to be only this one problem
with mat-transpose-mult. You made a lot of progress.


> In earlier investigation I was drawn to
> > MatTranspose but had a hard time pinning it down. The bug seems more
> > stable now or you probably fixed what looks like all the other bugs.
> >
> > I added print statements with norms of vectors in mg.c (v-cycle) and
> > found that the diffs between the CPU and GPU runs came in MatRestrict,
> > which calls MatMultTranspose. I added identical print statements in the
> > two versions of MatMultTranspose and see this. (pinning to the CPU does
> > not seem to make any difference). Note that the problem comes in the 2nd
> > iteration where the *output* vector is non-zero coming in (this should
> > not matter).
> >
> > Karl, I zeroed out the output vector (yy) when I come into this method
> > and it fixed the problem. This is with -n 4, and this always works with
> > -n 3. See the attached process layouts. It looks like this comes when
> > you use the 2nd socket.
> >
> > So this looks like an Nvidia bug. Let me know what you think and I can
> > pass it on to ORNL.
>
> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
> I've addressed some of them, but I can't confidently say that all of the
> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
> cuSparse, but rather something we need to fix in PETSc. Note that the
> problem shows up with multiple MPI ranks;


It seems to need to use two sockets. My current test works with 1,2, and 3
GPUs (one socket) but fails with 4, when you go to the second socket.


> if it were a problem in
> cuSparse, it would show up on a single rank as well.
>

What I am seeing is consistent with CUSPARSE having a race condition in
zeroing out the output vector in some way, But I don't know.


>
> Best regards,
> Karli
>
>
>
>
>
> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse*
> > [0] 3465 global equations, 1155 vertices
> > [0] 3465 equations in vector, 1155 vertices
> >0 SNES Function norm 1.725526579328e+01
> >  0 KSP Residual norm 1.725526579328e+01
> >  2) call Restrict with |r| = 1.402719214830704e+01
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.40271921483070e+01
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 3.43436359545813e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.29055494844681e+01
> >  3) |R| = 1.290554948446808e+01
> >  2) call Restrict with |r| = 4.109771717986951e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.10977171798695e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 0.00e+00
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.79415048609144e-01
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 9.01083013948788e-01
> >  3) |R| = 9.010830139487883e-01
> >  4) |X| = 2.864698671963022e+02
> >  5) |x| = 9.76328911783e+02
> >  6) post smooth |x| = 8.940011621494751e+02
> >  4) |X| = 8.940011621494751e+02
> >  5) |x| = 1.005081556495388e+03
> >  6) post smooth |x| = 1.029043994031627e+03
> >  1 KSP Residual norm 8.102614049404e+00
> >  2) call Restrict with |r| = 4.402603749876137e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.40260374987614e+00
> > *MatMultTranspose_MPIAIJ |y in| =
> > 1.29055494844681e+01
> > *MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00e+00
> >  *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.68544559626318e+00
> >  MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.82129824300863e+00
> >  3) |R| = 1.821298243008628e+00
> >  2) call Restrict with |r| = 1.068309793900564e+00
> >  MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.06830979390056e+00
> >