[OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-26 Thread vasilis
Dear openMpi users, I am trying to develop a code that runs in parallel mode with openMPI (1.3.2 version). The code is written in Fortran 90, and I am running on a cluster If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the following error: [compute-2-6.local:18491

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-26 Thread Eugene Loh
vasilis wrote: Dear openMpi users, I am trying to develop a code that runs in parallel mode with openMPI (1.3.2 version). The code is written in Fortran 90, and I am running on a cluster If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the following error: [com

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread vasilis
Thank you Eugene for your suggestion. I used different tags for each variable, and now I do not get this error. The problem now is that I am getting a different solution when I use more than 2 CPUs. I checked the matrices and I found that they differ by a very small amount of the order 10^(-10

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread Eugene Loh
vasilis wrote: Thank you Eugene for your suggestion. I used different tags for each variable, and now I do not get this error. The problem now is that I am getting a different solution when I use more than 2 CPUs. I checked the matrices and I found that they differ by a very small amount of

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread vasilis
> Rank 0 accumulates all the res_cpu values into a single array, res. It > starts with its own res_cpu and then adds all other processes. When > np=2, that means the order is prescribed. When np>2, the order is no > longer prescribed and some floating-point rounding variations can start > to occ

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread Eugene Loh
vasilis wrote: Rank 0 accumulates all the res_cpu values into a single array, res. It starts with its own res_cpu and then adds all other processes. When np=2, that means the order is prescribed. When np>2, the order is no longer prescribed and some floating-point rounding variations

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread George Bosilca
This is a problem of numerical stability, and there is no solution for such a problem in MPI. Usually, preconditioning the input matrix improve the numerical stability. If you read the MPI standard, there is a __short__ section about what guarantees the MPI collective communications provid

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread Damien Hocking
I've seen this behaviour with MUMPS on shared-memory machines as well using MPI. I use the iterative refinement capability to sharpen the last few digits of the solution ( 2 or 3 iterations is usually enough). If you're not using that, give it a try, it will probably reduce the noise you're g

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-27 Thread Eugene Loh
George Bosilca wrote: This is a problem of numerical stability, and there is no solution for such a problem in MPI. Usually, preconditioning the input matrix improve the numerical stability. At the level of this particular e-mail thread, the issue seems to me to be different. Results are

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-28 Thread vasilis
> This is a problem of numerical stability, and there is no solution for > such a problem in MPI. Usually, preconditioning the input matrix > improve the numerical stability. It could be a numerical stability but this would imply that I have an ill- conditioned matrix. This is not my case. > If

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-28 Thread vasilis
On Wednesday 27 of May 2009 7:47:06 pm Damien Hocking wrote: > I've seen this behaviour with MUMPS on shared-memory machines as well > using MPI. I use the iterative refinement capability to sharpen the > last few digits of the solution ( 2 or 3 iterations is usually enough). > If you're not using

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-28 Thread vasilis
On Wednesday 27 of May 2009 8:35:49 pm Eugene Loh wrote: > George Bosilca wrote: > > This is a problem of numerical stability, and there is no solution > > for such a problem in MPI. Usually, preconditioning the input > > matrix improve the numerical stability. > > At the level of this particula

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-28 Thread vasilis
On Wednesday 27 of May 2009 7:47:06 pm Damien Hocking wrote: > I've seen this behaviour with MUMPS on shared-memory machines as well > using MPI. I use the iterative refinement capability to sharpen the > last few digits of the solution ( 2 or 3 iterations is usually enough). > If you're not using

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-28 Thread Eugene Loh
vasilis wrote: On Wednesday 27 of May 2009 8:35:49 pm Eugene Loh wrote: At the level of this particular e-mail thread, the issue seems to me to be different. Results are added together in some arbitrary order and there are variations on order of 10^-10. This is not an issue of nu

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-29 Thread vasilis
> The original issue, still reflected by the subject heading of this e-mail, > was that a message overran its receive buffer. That was fixed by using > tags to distinguish different kinds of messages (res, jacob, row, and col). > > I thought the next problem was the small (10^-10) variations in

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

2009-05-29 Thread Eugene Loh
vasilis wrote: The original issue, still reflected by the subject heading of this e-mail, was that a message overran its receive buffer. That was fixed by using tags to distinguish different kinds of messages (res, jacob, row, and col). I thought the next problem was the small (10^-