Hello,
Thank you, with the release r21172 and it works. But how i can dispatch the checkpoint on different storage nodes, because it is to costly that all computing nodes write on one storage node.




Josh Hursey a écrit :
I just realized that not all of the FileM fixes made it to the trunk in my previous commit. Sorry about that :( I just committed the remainder of the changes in r21167 if you wanted to try them out.

Cheers,
Josh

On May 4, 2009, at 8:48 AM, Josh Hursey wrote:

The command line looks fine. Can you send the output generated by the verbose arguments (there was no file attached to the last email)?

The version of the trunk that I was referring to was r21131, and can be downloaded via SVN or a nightly snapshot tarball from the links below:
http://www.open-mpi.org/svn/
http://www.open-mpi.org/nightly/trunk/

Best,
Josh

On May 4, 2009, at 3:44 AM, Bouguerra mohamed slim wrote:

Hello,
this is the global command that i use it to run the program.

/home/grenoble/msbouguerra/install/ompi-1.3.2/cr/bin/mpirun -mca orte_base_help_aggregate 0 -mca filem_rsh_rcp scp -mca filem_rsh_verbose 99 -mca filem_base_verbose 99 -mca snapc_base_verbose 1 -mca ompi_cr_verbose 1 -mca orte_cr_verbose 1 -mca opal_cr_verbose 1 -mca snapc_base_global_snapshot_dir /tmp/stable -mca snapc_base_store_in_place 0 -mca snapc_base_global_snapshot_ref ompi_global_snapshot_09_30 -np 20 -am ft-enable-cr -hostfile ./hostfile_04_05 --mca btl '^mx' ./nqueen 16

Then, i got always the same problem, the error stack in the file.

Finally can you tell me exactly the version of the development trunk.

Thank you,



Josh Hursey a écrit :

This typically this means that one or more of the rcp/scp or rsh/ssh commands failed. FileM should be printing an error message when one of the copy commands fail. Try turning up the verbose level to 10 to see if it indicates any problems:
-mca filem_rsh_verbose 10

Can you send me the MCA parameters that you are setting? That may help narrow down the problem as well. Also I cleaned up some of the filem (and snapc) error reporting in the development trunk if you want to give that a try.

Let me know what you find out.

Best,
Josh

On Apr 30, 2009, at 6:40 AM, Bouguerra mohamed slim wrote:

Hello,
I have a problem with the Filem module when i would checkpoint on a remote host without shared space file system. I use the new open-mpi 1.3.2 and it is the same problem as in the version 1.3.1. Indeed, when i use the NFS system file it works. Thus i guess that is a problem with the Filem.

[azur-6.fr:23223] filem:rsh: wait_all(): Wait failed (-1)
[azur-6.fr:23223] [[48784,0],0] ORTE_ERROR_LOG: Error in file /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c at line 1054

--
Cordialement,
Mohamed-Slim BOUGUERRA    PhD student INRIA-Grenoble / Projet MOAIS
ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN France
Tel :+33 (0)4 76 61 20 79
Fax :+33 (0)4 76 61 20 99

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Cordialement,
Mohamed-Slim BOUGUERRA    PhD student INRIA-Grenoble / Projet MOAIS
ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN France
Tel :+33 (0)4 76 61 20 79
Fax :+33 (0)4 76 61 20 99
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:base: process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]: Filename Requested (/tmp/opal_snapshot_1.ckpt) translated to (/tmp/opal_snapshot_1.ckpt) --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:base: process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]: Filename Requested (/tmp/opal_snapshot_5.ckpt) translated to (/tmp/opal_snapshot_5.ckpt) --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:base: process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]: Filename Requested (/tmp/opal_snapshot_9.ckpt) translated to (/tmp/opal_snapshot_9.ckpt) --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:base: process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]: Filename Requested (/tmp/opal_snapshot_13.ckpt) translated to (/tmp/opal_snapshot_13.ckpt) --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:base: process_get_remote_path_cmd: [[52993,0],0] -> [[52993,0],0]: Filename Requested (/tmp/opal_snapshot_17.ckpt) translated to (/tmp/opal_snapshot_17.ckpt) --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- --------------------------------------------------------------------------
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/stable/ompi_global_snapshot_09_30/0
Host: sol-7.sophia.grid5000.fr

Will continue attempting to launch the process.

-------------------------------------------------------------------------- [sol-7.sophia.grid5000.fr:04545] filem:rsh: wait_all(): Wait failed (-1) [sol-7.sophia.grid5000.fr:04545] [[52993,0],0] ORTE_ERROR_LOG: Error in file /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c at line 1054





--
Cordialement,
Mohamed-Slim BOUGUERRA    PhD student INRIA-Grenoble / Projet MOAIS
ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN France
Tel :+33 (0)4 76 61 20 79
Fax :+33 (0)4 76 61 20 99

Reply via email to