OK, I have resolved the issue. The cause of the problem is that my SGE cluster is set up such csh is the default shell rather than bash. For reasons that I don't understand, when a script is submitted to SGE, SGE will ignore the shebang line and instead run the script using the shell that SGE is configured to use.
This is a problem for EMS. In the steps directory, EMS creates a script for each task to be run. The first item in that script is a line that looks like this: PATH="/path/to/foo:/path/to/bar" This is invalid in csh. When any of the EMS-created scripts that start with that line get run by SGE under csh, the following error message gets printed out: PATH=/path/to/foo:/path/to/bar File not found So, the above error is the first line in *every* STDERR file. This is not a problem until we reach the run-giza and run-giza-inverse steps. The run-giza and run-giza-inverse steps ask experiment.perl to look for the message "not found" in the STDERR log, and if it find that message, experiment.perl believes that giza died. This happens even though in this case, giza actually ran successfully. The solution is to add the string "-S /bin/bash" to qsub-settings in the EMS config file. Doing so tells SGE to launch scripts using bash instead of csh. Might I suggest that since the created EMS scripts require bash that the -S /bin/bash flag be added as a default qsub flag by EMS? Would anyone have any objections if I were to make this change? Cheers, Lane On Thu, Sep 20, 2012 at 1:52 PM, Lane Schwartz <[email protected]> wrote: > The relevant digest files (steps/1/TRAINING_run-giza.1.STDERR.digest > and steps/1/TRAINING_run-giza-inverse.1.STDERR.digest) each contain > one line: > > not found > > The STDERR files for run-giza and run-giza-inverse when EMS crashes > while running via SGE are (modulo time-stamp messages) identical to > the respective STDERR files created for those steps when it > successfully executes when run locally (without the -cluster flag). > > I did a grep in the ems scripts directory for the message "not found" > - it appears in experiment.meta under the run-giza and > run-giza-inverse steps, but I don't know enough about EMS to know why > that error is being triggered. > > Any ideas for what else I should look for? > > Thanks, > Lane > > > On Thu, Sep 20, 2012 at 2:09 AM, Barry Haddow > <[email protected]> wrote: >> Hi Lane >> >> If ems failed on a given step, then there should be a message in the digest >> file for that step. What exactly does ems report? >> >> Cheers - Barry >> >> >> >> Sent from my ZX81 >> >> >> ----- Reply message ----- >> From: "Lane Schwartz" <[email protected]> >> Date: Wed, Sep 19, 2012 20:18 >> Subject: [Moses-support] EMS, mgiza, and SGE >> To: <[email protected]> >> >> I'm trying to get up to speed using EMS. I have a small dataset (IWSLT >> 2008) that I am using to train, tune, and test using EMS. >> >> I am able to reliably run EMS on my data on a single machine. >> >> My config file specifies jobs=10 and qsub-settings="-l >> hostname=*machinesA*|*machinesB*|*machinesC*" where the hostname >> patterns match machine names in my grid. >> >> When I run experiment.perl with the -cluster flag, the experiment >> runs, but it consistently dies while running run-giza and >> run-giza-inverse. Strangely, when I look in the steps directory and >> the training directory, it appears that mgiza has run successfully in >> both directions. I don't see any error messages. Does anyone have any >> idea what might be going on here? >> >> I am using the exact same config file, and it runs successfully when I >> launch experiment.perl without the -cluster flag. When I use the >> -cluster flag, everything runs successfully until it gets to the giza >> steps, which it appears to run, and then EMS dies. >> >> Thanks, >> Lane Schwartz >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
