So the "bug" is not as ginormous as I originally thought. It will never 
produce incorrect results but can result in the errors you received.

   The problem is 

if (row_rank == 0)
    {
        PetscCall(VecCUDAReplaceArray(v, d_a));
    }

The place/replacearray routines are actually collective; and need to be called 
by all MPI processes that own a vector regardless of the local size. This is 
because the call can invalidate the previously known norm values that have been 
cached in the vector. If the norm values are invalidated on some MPI processes 
but not others you will get the error you have seen.

  Barry

  I will prepare a branch with better documentation and clearer error handling 
for this situation.




> On Nov 16, 2023, at 6:30 PM, Barry Smith <bsm...@petsc.dev> wrote:
> 
> 
>   Congratulations you have found a ginormous bug in PETSc! Thanks for the 
> detail information on the problem.
> 
>    I will post a fix shortly.
> 
>    Barry
> 
> 
>> On Nov 16, 2023, at 6:19 PM, Sreeram R Venkat <srven...@utexas.edu> wrote:
>> 
>> I have a program which reads a vector from file into an array, and then uses 
>> that array to create a PETSc Vec object. The Vec is defined on the global 
>> communicator, but not all processes actually contain entries of it. For 
>> example, suppose we have 4 processors, and the vector is of size 10. Rank 0 
>> will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks 2 and 3 
>> will not have any entries of the Vec.
>> 
>> This Vec is then used as an input to other parts of the code, and those work 
>> fine. However, if I try to take the norm of the Vec with VecNorm(), I get 
>> the error
>> 
>> `MPI_Allreduce() called in different locations (code lines) on different 
>> processors`
>> 
>> The stack trace shows that ranks 0 and 1 (from the above example) are still 
>> in the VecNorm() function while ranks 2 and 3 have moved on to a later part 
>> of the code. If I add a PetscBarrier() after the VecNorm(), I find that the 
>> program hangs. 
>> 
>> The funny thing is that part of the code duplicates the Vec with 
>> VecDuplicate() and assigns to the duplicated vector the result of some 
>> computations. The duplicated Vec has the same layout as the original Vec, 
>> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(), 
>> however, the copied Vec also causes VecNorm() to hang. I've printed out the 
>> original Vec, and there are no corrupted/NaN entries.
>> 
>> I have a temporary workaround where I perturb the original Vec slightly 
>> before copying it to another Vec. This causes the program to successfully 
>> terminate.
>> 
>> Any advice on how to get VecNorm() working with the original Vec?
>> 
>> Thanks,
>> Sreeram
> 

Reply via email to