Re: [Wien] PBS run

2017-09-08 Thread Gavin Abo
You might have a look at the "WIEN2k-notes of the University of Texas" 
document (slide 7) at:


http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html

The line:

echo -n 'lapw0:' > .machines

It looks like that writes to the .machines file:

lapw0:

However, you need to have it write the "machine names" after it. So 
something like:


lapw0:gamma:2 delta:2 epsilon:4

However, you are having it write:

lapw0:granularity:1

Thus, the error about it not being able to find and connect to a 
hostname called "granularity".


On 9/8/2017 2:41 AM, Subrata Jana wrote:

Hi Gavin Abo,
It looks I am facing the same problem.

##

#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source 
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh 
intel64

export OMP_NUM_THREADS=1
cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines

#source 
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh 
intel64


cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
echo -n 'lapw0:' > .machines
echo 'granularity:1' >>.machines
#awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
awk '{print "1:"$1":1"}' "$PBS_NODEFILE" >>.machines
echo 'extrafine:1' >>.machines
#/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I
run_lapw -p -i 40 -ec .0001 -I
#

My .machines file looks like


lapw0:granularity:1
1:r8n3:1
extrafine:1


out.log

ssh: Could not resolve hostname granularity: Name or service not known^M

>   stop error

Regards,
S. Jana
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] PBS run

2017-09-08 Thread Subrata Jana
Hi Gavin Abo,
It looks I am facing the same problem.

##

#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh
intel64
export OMP_NUM_THREADS=1
cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines

#source
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh
intel64

cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
echo -n 'lapw0:' > .machines
echo 'granularity:1' >>.machines
#awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
awk '{print "1:"$1":1"}' "$PBS_NODEFILE" >>.machines
echo 'extrafine:1' >>.machines
#/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I
run_lapw -p -i 40 -ec .0001 -I
#

My .machines file looks like


lapw0:granularity:1
1:r8n3:1
extrafine:1


out.log

ssh: Could not resolve hostname granularity: Name or service not known^M

>   stop error

Regards,
S. Jana
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] PBS run

2017-09-08 Thread Gavin Abo
It look like something is wrong with this line [ 
https://stackoverflow.com/questions/26816605/awk-fatal-cannot-open-file-for-reading-no-such-file-or-directory 
]:


awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines

Maybe quotes are needed around the $PBS_NODEFILE:

awk '{print "1:"$1":1"}' "$PBS_NODEFILE" >>.machines

On 9/8/2017 2:20 AM, Subrata Jana wrote:

Hi Gavin Abo,
I change my job script as follows:

#
#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source 
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh 
intel64


cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
*echo -n 'lapw0:' > .machines*
echo 'granularity:1' >>.machines
awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
echo 'extrafine:1' >>.machines
/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I
###

Now the out.log

awk: cmd. line:1: fatal: cannot open file `-np' for reading (No such 
file or directory)

ssh: Could not resolve hostname granularity: Name or service not known^M

>   stop error

Regards,
S. Jana







___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] PBS run

2017-09-08 Thread Subrata Jana
Hi Gavin Abo,
I change my job script as follows:

#
#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source 
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh
intel64

cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
*echo -n 'lapw0:' > .machines*
echo 'granularity:1' >>.machines
awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
echo 'extrafine:1' >>.machines
/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I
###

Now the out.log

awk: cmd. line:1: fatal: cannot open file `-np' for reading (No such file
or directory)
ssh: Could not resolve hostname granularity: Name or service not known^M

>   stop error

Regards,
S. Jana
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] PBS run

2017-09-08 Thread Gavin Abo

Does lapw0 exist in your WIEN2k directory (/home/sjana/WIEN2k_14.2)?

Maybe #PBS -V is needed [ 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15985.html 
].


On 9/8/2017 1:42 AM, Subrata Jana wrote:

Dear All,
 I am trying to run WIEN2k parallel. My shell script is looking like 
this. However, in the out.log file it is showing

--
lapw0: Command not found.

>   stop error
---

please help.

### script ##

#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source 
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh 
intel64


cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
echo 'granularity:1' >>.machines
awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
echo 'extrafine:1' >>.machines
/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I

Regards,
S. Jana

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] PBS run

2017-09-08 Thread Subrata Jana
Dear All,
 I am trying to run WIEN2k parallel. My shell script is looking like this.
However, in the out.log file it is showing
--
lapw0: Command not found.

>   stop error
---

please help.

### script ##

#!/bin/bash
#PBS -N wien2k
#PBS -o out.log
#PBS -j oe
#PBS -l nodes=1:ppn=1

# Load Intel environment
source
/apps/intel_2016_u2/compilers_and_libraries_2016.2.181/linux/bin/compilervars.sh
intel64

cd /home/sjana/WIEN2k/PBE/C_pbe
rm -f .machines
echo '#' > .machines
echo 'granularity:1' >>.machines
awk '{print "1:"$1":1"}' $PBS_NODEFILE >>.machines
echo 'extrafine:1' >>.machines
/home/sjana/WIEN2k_14.2/run_lapw -p -i 40 -ec .0001 -I

Regards,
S. Jana
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] PBS

2012-01-07 Thread Yundi Quan
Thank all for helping to tackle this problem. Actually, my system
administrator seems to have done something which makes my life much easier.
Now, everything is done automatically. When the job is killed, I will get
the following.
 .machine0 : 80 processors
 Child id   1 Process termination signal received
 Child id   2 Process termination signal received
 Child id   3 Process termination signal received
 Child id   4 Process termination signal received
 Child id   5 Process termination signal received
 Child id   6 Process termination signal received
 Child id   7 Process termination signal received
 Child id  19 Process termination signal received
 Child id  24 Process termination signal received
 Child id  35 Process termination signal received
 Child id  40 Process termination signal received
 Child id  77 Process termination signal received
 Child id  53 Process termination signal received
 Child id  59 Process termination signal received
 Child id  69 Process termination signal received
 Child id  20 Process termination signal received
 Child id  28 Process termination signal received
 Child id  37 Process termination signal received
 Child id  42 Process termination signal received
 Child id  72 Process termination signal received
 Child id  48 Process termination signal received
 Child id  57 Process termination signal received
 Child id  70 Process termination signal received
 Child id  21 Process termination signal received
 Child id  25 Process termination signal received
 Child id  32 Process termination signal received
 Child id  46 Process termination signal received
 Child id  73 Process termination signal received
 Child id  49 Process termination signal received
 Child id  60 Process termination signal received
 Child id  64 Process termination signal received
 Child id  23 Process termination signal received
 Child id  26 Process termination signal received
 Child id  33 Process termination signal received
 Child id  41 Process termination signal received
 Child id  76 Process termination signal received
 Child id  50 Process termination signal received
 Child id  56 Process termination signal received
 Child id  68 Process termination signal received
 Child id  17 Process termination signal received
 Child id  27 Process termination signal received
 Child id  38 Process termination signal received
 Child id  47 Process termination signal received
 Child id  78 Process termination signal received
 Child id  51 Process termination signal received
 Child id  62 Process termination signal received
 Child id  65 Process termination signal received
 Child id  18 Process termination signal received
 Child id  30 Process termination signal received
 Child id  34 Process termination signal received
 Child id  43 Process termination signal received
 Child id  74 Process termination signal received
 Child id  52 Process termination signal received
 Child id  58 Process termination signal received
 Child id  66 Process termination signal received
 Child id  22 Process termination signal received
 Child id  31 Process termination signal received
 Child id  39 Process termination signal received
 Child id  44 Process termination signal received
 Child id  75 Process termination signal received
 Child id  54 Process termination signal received
 Child id  61 Process termination signal received
 Child id  67 Process termination signal received
 Child id  16 Process termination signal received
 Child id  29 Process termination signal received
 Child id  36 Process termination signal received
 Child id  45 Process termination signal received
 Child id  79 Process termination signal received
 Child id  55 Process termination signal received
 Child id  63 Process termination signal received

Yundi

On Fri, Jan 6, 2012 at 7:35 AM, Florent Boucher  wrote:

> Dear Laurence,
> your last lines are exactly what we need !
> Thank you for this.
>
>> set remote = "/bin/csh $WIENROOT/pbsh"
>>
>> $WIENROOT/pbsh is just
>> mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "
>>
> I will try but I pretty sure that it will work fine.
> Regards
> Florent
>
> Le 05/01/2012 20:16, Laurence Marks a ?crit :
>
>  I gave a slightly jetlagged response -- for certain WIEN2k style works
>> fine with all queuing systems.
>>
>> But...it may not fit how the queuing system has been designed and
>> admins may not be accomodating. My understanding (second hand) is 

[Wien] PBS

2012-01-06 Thread Florent Boucher
Dear Laurence,
your last lines are exactly what we need !
Thank you for this.
> set remote = "/bin/csh $WIENROOT/pbsh"
>
> $WIENROOT/pbsh is just
> mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "
I will try but I pretty sure that it will work fine.
Regards
Florent

Le 05/01/2012 20:16, Laurence Marks a ?crit :
> I gave a slightly jetlagged response -- for certain WIEN2k style works
> fine with all queuing systems.
>
> But...it may not fit how the queuing system has been designed and
> admins may not be accomodating. My understanding (second hand) is that
> torque is designed to work well with openmpi for accounting, and by
> default knows nothing about tasks created by ssh. When the users time
> has elapsed it will terminate those tasks it knows about (the main one
> plus anything using mpirun) and ignore anything else. Hence for
> clusters where killing a ssh on node A does not propogate a kill to
> children on node B (which depends upon the ssh) one is left with
> processes that can run forever. There is something called an epilog
> script which maybe can do this, but it would need WIEN2k to create one
> every time it launches a set of tasks. Possible, but not trivial.
>
> Note: this is not just a WIEN2k problem. One of the admin's at NU
> large cluster is a friend and he tells me that every now an then he
> goes around and tries to clean up tasks left running like this on
> nodes from all sorts of software. Sometimes he has to reboot nodes
> since if torque believes there is nothing running on a node it will
> merrily create more tasks on it which can lead to heavy
> oversubscription and hang the node.
>
> And...just to make life more fun, torque knows nothing about MKL
> threading so on an 8-core node can easily start 8 different non-mpi
> jobs and if they all want 8 threads...
>
> Probably too long a response. Below is the parallel_options file that
> I use on a system with moab (similar, perhaps worse than pbs) where I
> try and be a "gentleman" and set the mkl threading as well as use
> miprun to launch tasks.
>
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
> -machinefile _HOSTS_ _EXEC_"
> set a=`grep -e "1:" .machines | grep -v lapw0 | head -1 | cut -f 3 -d:
> | cut -c 1-2`
> setenv MKL_NUM_THREADS $a
> setenv OMP_NUM_THREADS $a
> setenv MKL_DYNAMIC FALSE
> if (-e local_options ) source local_options
> set remote = "/bin/csh $WIENROOT/pbsh"
> set delay   = 0.25
>
> $WIENROOT/pbsh is just
> mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "
>
> With this at least I don't create problems (hopefully).
>
> On Thu, Jan 5, 2012 at 7:19 AM, Peter Blaha
>   wrote:
>> It is NOT true that queuing systems cannot do the "WIEN2k style".
>>
>> We have two big clusters and run on them all three types of jobs,
>> i) only ssh (k-parallel), ii) only mpi-parallel (no mpi) and also
>> of mixed type.
>>
>> And of course the administrators configured the "sun grid engine" so that it
>> makes sure that there are no processes running when a job finishes and
>> eventually
>> kill all processes of a batch job on all the assigned nodes after it has
>> finished.
>>
>> It's just a matter if the system programmers are willing (or able ??) to
>> reconfigure
>> the queuing system...
>>
>> PS: If you are running mpi-parallel   usesetenv MPI_REMOTE 0 in
>> $WIENROOT/parallel_options and ssh will not be used anyway.
>>
>> Am 05.01.2012 13:17, schrieb Laurence Marks:
>>> As Florent said, this is a known issue with some (not all) versions ofssh,
>>> and it is also a torque bug. What you have to do is use mpiruninstead of ssh
>>> to launch jobs which I think you can do by setting theMPI_REMOTE/USE_REMOTE
>>> switches. I think I posted how to do this sometime ago, so please search the
>>> mailing list. (I am in China and canprovide more information next week when
>>> I return if this is notenough, which it probably is not.)
>>> N.B., in case anyone wonders with torque (PBS) you are not "supposedto"
>>> use ssh to communicate the way Wien2k does. They are not going tomove on
>>> this so this is "WIen2k's fault". I've looked in to this quitea bit and
>>> there is no solution except to avoid ssh (or live withzombie processes).
>>> Indeed, torque has the weakness of leavingprocesses around if a code does
>>> anything more adventurous than justrun a single mpirun -- so it goes.
>>> On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha
>>>   wrote:>I've never done this myself, but as far as I know one can 
>>> define
>>> a>"prolog" script in all those queuing systems and this prolog script>
>>>   should ssh to all assigned nodes and kill all remaining jobs of this
>>> user.>>>Am 05.01.2012 10:17, schrieb Florent Boucher:>>>Dear 
>>> Yundi,>>
>>>   this is a known limitation of ssh and rsh that does not pass the
>>> interrupt>>signal to the remote host.>>Under LSF I had in the past a
>>> solution. 

[Wien] PBS

2012-01-05 Thread Peter Blaha
It is NOT true that queuing systems cannot do the "WIEN2k style".

We have two big clusters and run on them all three types of jobs,
i) only ssh (k-parallel), ii) only mpi-parallel (no mpi) and also
of mixed type.

And of course the administrators configured the "sun grid engine" so that it
makes sure that there are no processes running when a job finishes and 
eventually
kill all processes of a batch job on all the assigned nodes after it has 
finished.

It's just a matter if the system programmers are willing (or able ??) to 
reconfigure
the queuing system...

PS: If you are running mpi-parallel   usesetenv MPI_REMOTE 0 in
$WIENROOT/parallel_options and ssh will not be used anyway.

Am 05.01.2012 13:17, schrieb Laurence Marks:
> As Florent said, this is a known issue with some (not all) versions ofssh, 
> and it is also a torque bug. What you have to do is use mpiruninstead of ssh 
> to launch jobs which I think you can do by setting theMPI_REMOTE/USE_REMOTE 
> switches. I think I posted how to do this sometime ago, so please search the 
> mailing list. (I am in China and canprovide more information next week when I 
> return if this is notenough, which it probably is not.)
> N.B., in case anyone wonders with torque (PBS) you are not "supposedto" use 
> ssh to communicate the way Wien2k does. They are not going tomove on this so 
> this is "WIen2k's fault". I've looked in to this quitea bit and there is no 
> solution except to avoid ssh (or live withzombie processes). Indeed, torque 
> has the weakness of leavingprocesses around if a code does anything more 
> adventurous than justrun a single mpirun -- so it goes.
> On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha  
> wrote:>  I've never done this myself, but as far as I know one can define a>  
> "prolog" script in all those queuing systems and this prolog script>  should 
> ssh to all assigned nodes and kill all remaining jobs of this user.>>>  Am 
> 05.01.2012 10:17, schrieb Florent Boucher:>>>  Dear Yundi,>>  this is a known 
> limitation of ssh and rsh that does not pass the interrupt>>  signal to the 
> remote host.>>  Under LSF I had in the past a solution. It was a specific 
> rshlsf for doing>>  this.>>  Actually I use either SGE or PBS on two 
> different cluster and the problem>>  exists.>>  You will see that are not 
> even able to suspend a running job.>>  If some one has a solution, I will 
> also appreciate.>>  Regards>>  Florent  Le 04/01/2012 21:57, Yundi Quan a 
> ?crit :>>  I'm working on a cluster using torque queue system. I can 
> directly ssh to>>>  any nodes without using password. When I use qdel( or 
> canceljob) j
obid to>>>  terminate a running job, the>>>  job will be terminated in the 
queue system. However, when I ssh to the>>>  nodes, the job are still running. 
Does anyone know how to avoid this?  
___>>>  Wien mailing list>>>  Wien 
at zeus.theochem.tuwien.ac.at>>>  
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien  -->>
->>  | 
Florent BOUCHER|>>|>>  | Institut des Mat?riaux Jean 
Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr>>|>>  | 2, rue de la 
Houssini?re   | Phone: (33) 2 40 37 39 24>>|>>  | BP 32229  
 | Fax:   (33) 2 40 37 39 95>>|>>  | 44322 NANTES CEDEX 3 
(FRANCE)  |http://www.cnrs-imn.fr>>|>>
-
  ___>>  Wien mailing list>>  Wien 
at zeus.theoc
hem.tuwien.ac.at>>  http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>>>  
-->>P.Blaha>  
-->  
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna>  Phone: 
+43-1-58801-165300 FAX: +43-1-58801-165982>  Email: blaha at 
theochem.tuwien.ac.atWWW:>  http://info.tuwien.ac.at/theochem/>  
-->>>  
___>  Wien mailing list>  Wien at 
zeus.theochem.tuwien.ac.at>  
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> -- Professor Laurence MarksDepartment of Materials Science and 
> EngineeringNorthwestern Universitywww.numis.northwestern.edu 
> 1-847-491-3996"Research is to see what everybody else has seen, and to think 
> whatnobody else has thought"Albert 
> Szent-Gyorgi___Wien mailing 
> listWien at 
> zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

   P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300

[Wien] PBS

2012-01-05 Thread Laurence Marks
I gave a slightly jetlagged response -- for certain WIEN2k style works
fine with all queuing systems.

But...it may not fit how the queuing system has been designed and
admins may not be accomodating. My understanding (second hand) is that
torque is designed to work well with openmpi for accounting, and by
default knows nothing about tasks created by ssh. When the users time
has elapsed it will terminate those tasks it knows about (the main one
plus anything using mpirun) and ignore anything else. Hence for
clusters where killing a ssh on node A does not propogate a kill to
children on node B (which depends upon the ssh) one is left with
processes that can run forever. There is something called an epilog
script which maybe can do this, but it would need WIEN2k to create one
every time it launches a set of tasks. Possible, but not trivial.

Note: this is not just a WIEN2k problem. One of the admin's at NU
large cluster is a friend and he tells me that every now an then he
goes around and tries to clean up tasks left running like this on
nodes from all sorts of software. Sometimes he has to reboot nodes
since if torque believes there is nothing running on a node it will
merrily create more tasks on it which can lead to heavy
oversubscription and hang the node.

And...just to make life more fun, torque knows nothing about MKL
threading so on an 8-core node can easily start 8 different non-mpi
jobs and if they all want 8 threads...

Probably too long a response. Below is the parallel_options file that
I use on a system with moab (similar, perhaps worse than pbs) where I
try and be a "gentleman" and set the mkl threading as well as use
miprun to launch tasks.

setenv USE_REMOTE 1
setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ _EXEC_"
set a=`grep -e "1:" .machines | grep -v lapw0 | head -1 | cut -f 3 -d:
| cut -c 1-2`
setenv MKL_NUM_THREADS $a
setenv OMP_NUM_THREADS $a
setenv MKL_DYNAMIC FALSE
if (-e local_options ) source local_options
set remote = "/bin/csh $WIENROOT/pbsh"
set delay   = 0.25

$WIENROOT/pbsh is just
mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "

With this at least I don't create problems (hopefully).

On Thu, Jan 5, 2012 at 7:19 AM, Peter Blaha
 wrote:
> It is NOT true that queuing systems cannot do the "WIEN2k style".
>
> We have two big clusters and run on them all three types of jobs,
> i) only ssh (k-parallel), ii) only mpi-parallel (no mpi) and also
> of mixed type.
>
> And of course the administrators configured the "sun grid engine" so that it
> makes sure that there are no processes running when a job finishes and
> eventually
> kill all processes of a batch job on all the assigned nodes after it has
> finished.
>
> It's just a matter if the system programmers are willing (or able ??) to
> reconfigure
> the queuing system...
>
> PS: If you are running mpi-parallel ? use ? ?setenv MPI_REMOTE 0 in
> $WIENROOT/parallel_options and ssh will not be used anyway.
>
> Am 05.01.2012 13:17, schrieb Laurence Marks:
>>
>> As Florent said, this is a known issue with some (not all) versions ofssh,
>> and it is also a torque bug. What you have to do is use mpiruninstead of ssh
>> to launch jobs which I think you can do by setting theMPI_REMOTE/USE_REMOTE
>> switches. I think I posted how to do this sometime ago, so please search the
>> mailing list. (I am in China and canprovide more information next week when
>> I return if this is notenough, which it probably is not.)
>> N.B., in case anyone wonders with torque (PBS) you are not "supposedto"
>> use ssh to communicate the way Wien2k does. They are not going tomove on
>> this so this is "WIen2k's fault". I've looked in to this quitea bit and
>> there is no solution except to avoid ssh (or live withzombie processes).
>> Indeed, torque has the weakness of leavingprocesses around if a code does
>> anything more adventurous than justrun a single mpirun -- so it goes.
>> On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha
>> ?wrote:> ?I've never done this myself, but as far as I know one can define
>> a> ?"prolog" script in all those queuing systems and this prolog script>
>> ?should ssh to all assigned nodes and kill all remaining jobs of this
>> user.>>> ?Am 05.01.2012 10:17, schrieb Florent Boucher:>>> ?Dear Yundi,>>
>> ?this is a known limitation of ssh and rsh that does not pass the
>> interrupt>> ?signal to the remote host.>> ?Under LSF I had in the past a
>> solution. It was a specific rshlsf for doing>> ?this.>> ?Actually I use
>> either SGE or PBS on two different cluster and the problem>> ?exists.>> ?You
>> will see that are not even able to suspend a running job.>> ?If some one has
>> a solution, I will also appreciate.>> ?Regards>> ?Florent ?Le 04/01/2012
>> 21:57, Yundi Quan a ?crit :>> ?I'm working on a cluster using torque
>> queue system. I can directly ssh to>>> ?any nodes without using password.
>> When I use qdel( or canceljob) j
>
>

[Wien] PBS

2012-01-05 Thread Peter Blaha
I've never done this myself, but as far as I know one can define a
"prolog" script in all those queuing systems and this prolog script
should ssh to all assigned nodes and kill all remaining jobs of this user.


Am 05.01.2012 10:17, schrieb Florent Boucher:
> Dear Yundi,
> this is a known limitation of ssh and rsh that does not pass the interrupt 
> signal to the remote host.
> Under LSF I had in the past a solution. It was a specific rshlsf for doing 
> this.
> Actually I use either SGE or PBS on two different cluster and the problem 
> exists.
> You will see that are not even able to suspend a running job.
> If some one has a solution, I will also appreciate.
> Regards
> Florent
>
> Le 04/01/2012 21:57, Yundi Quan a ?crit :
>> I'm working on a cluster using torque queue system. I can directly ssh to 
>> any nodes without using password. When I use qdel( or canceljob) jobid to 
>> terminate a running job, the
>> job will be terminated in the queue system. However, when I ssh to the 
>> nodes, the job are still running. Does anyone know how to avoid this?
>>
>>
>>
>> ___
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> --
>   -
> | Florent BOUCHER||
> | Institut des Mat?riaux Jean Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr  |
> | 2, rue de la Houssini?re   | Phone: (33) 2 40 37 39 24  |
> | BP 32229   | Fax:   (33) 2 40 37 39 95  |
> | 44322 NANTES CEDEX 3 (FRANCE)  |http://www.cnrs-imn.fr  |
>   -
>
>
>
> ___
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

   P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.atWWW: http://info.tuwien.ac.at/theochem/
--



[Wien] PBS

2012-01-05 Thread Florent Boucher
Dear Yundi,
this is a known limitation of ssh and rsh that does not pass the interrupt 
signal to the remote host.
Under LSF I had in the past a solution. It was a specific rshlsf for doing this.
Actually I use either SGE or PBS on two different cluster and the problem 
exists.
You will see that are not even able to suspend a running job.
If some one has a solution, I will also appreciate.
Regards
Florent

  Le 04/01/2012 21:57, Yundi Quan a ?crit :
> I'm working on a cluster using torque queue system. I can directly ssh to any 
> nodes without using password. When I use qdel( or canceljob) jobid to 
> terminate a running job, the job will be terminated in the queue system. 
> However, when I ssh to the nodes, the job are still running. Does anyone know 
> how to avoid this?
>
>
>
> ___
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien


-- 
  -
| Florent BOUCHER||
| Institut des Mat?riaux Jean Rouxel | Mailto:Florent.Boucher at cnrs-imn.fr |
| 2, rue de la Houssini?re   | Phone: (33) 2 40 37 39 24  |
| BP 32229   | Fax:   (33) 2 40 37 39 95  |
| 44322 NANTES CEDEX 3 (FRANCE)  | http://www.cnrs-imn.fr |
  -

-- next part --
An HTML attachment was scrubbed...
URL: 



[Wien] PBS

2012-01-05 Thread Laurence Marks
As Florent said, this is a known issue with some (not all) versions of
ssh, and it is also a torque bug. What you have to do is use mpirun
instead of ssh to launch jobs which I think you can do by setting the
MPI_REMOTE/USE_REMOTE switches. I think I posted how to do this some
time ago, so please search the mailing list. (I am in China and can
provide more information next week when I return if this is not
enough, which it probably is not.)

N.B., in case anyone wonders with torque (PBS) you are not "supposed
to" use ssh to communicate the way Wien2k does. They are not going to
move on this so this is "WIen2k's fault". I've looked in to this quite
a bit and there is no solution except to avoid ssh (or live with
zombie processes). Indeed, torque has the weakness of leaving
processes around if a code does anything more adventurous than just
run a single mpirun -- so it goes.

On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha
 wrote:
> I've never done this myself, but as far as I know one can define a
> "prolog" script in all those queuing systems and this prolog script
> should ssh to all assigned nodes and kill all remaining jobs of this user.
>
>
> Am 05.01.2012 10:17, schrieb Florent Boucher:
>
>> Dear Yundi,
>> this is a known limitation of ssh and rsh that does not pass the interrupt
>> signal to the remote host.
>> Under LSF I had in the past a solution. It was a specific rshlsf for doing
>> this.
>> Actually I use either SGE or PBS on two different cluster and the problem
>> exists.
>> You will see that are not even able to suspend a running job.
>> If some one has a solution, I will also appreciate.
>> Regards
>> Florent
>>
>> Le 04/01/2012 21:57, Yundi Quan a ?crit :
>>>
>>> I'm working on a cluster using torque queue system. I can directly ssh to
>>> any nodes without using password. When I use qdel( or canceljob) jobid to
>>> terminate a running job, the
>>> job will be terminated in the queue system. However, when I ssh to the
>>> nodes, the job are still running. Does anyone know how to avoid this?
>>>
>>>
>>>
>>> ___
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>> --
>> ?-
>> | Florent BOUCHER ? ? ? ? ? ? ? ? ? ?|
>> ?|
>> | Institut des Mat?riaux Jean Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr
>> ?|
>> | 2, rue de la Houssini?re ? ? ? ? ? | Phone: (33) 2 40 37 39 24
>> ?|
>> | BP 32229 ? ? ? ? ? ? ? ? ? ? ? ? ? | Fax: ? (33) 2 40 37 39 95
>> ?|
>> | 44322 NANTES CEDEX 3 (FRANCE) ? ? ?|http://www.cnrs-imn.fr
>> ?|
>> ?-
>>
>>
>>
>> ___
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> --
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?P.Blaha
> --
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 ? ? ? ? ? ? FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at ? ?WWW:
> http://info.tuwien.ac.at/theochem/
> --
>
>
> ___
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


[Wien] PBS

2012-01-04 Thread Yundi Quan
I'm working on a cluster using torque queue system. I can directly ssh to
any nodes without using password. When I use qdel( or canceljob) jobid to
terminate a running job, the job will be terminated in the queue system.
However, when I ssh to the nodes, the job are still running. Does anyone
know how to avoid this?
-- next part --
An HTML attachment was scrubbed...
URL: