Roc Wang <[email protected]> writes:

>> From: [email protected]
>> To: [email protected]
>> CC: [email protected]
>> Subject: RE: [petsc-users] Scalability of PETSc on vesta.alcf
>> Date: Mon, 20 Jan 2014 10:32:32 -0700
>> 
>> Roc Wang <[email protected]> writes:
>> >   I tried c16 for 1024 ranks and 2048 ranks, but the job cannot run
>> >   successfully. It seems the job was started but the program didn't
>> >   execute. Please take a look at the attached log file for 1024 with
>> >   c16 mode. Is this because some environment parameters I didn't set
>> >   right? Actually, the same program is only able to run with 1024
>> >   ranks in c1, c2 and c32, c64 modes and 2048 ranks in c64 mode.
>> 
>> You have non-scalable "Generate Vector" and VecView (the latter maybe
>> because you don't use MPI-IO?).  It is probably failing at this step.
>> 
>> | qsub -A SUGAR -t 00:10:00 -n 512 --proccount 2048 --mode script ./vesta.job
>> 
>> I thought you said you were trying c16?
>
> Yes, I said so. But, I tried both ways:  qsub the executable and qsub script. 
>  The command is like this:

I was trying to rectify the inconsistency between "-n 512 --proccount
2048" and using c16.  Anyway, I suspect something going wrong in your
I/O or (possibly, we've seen a few lately) tripping over a system bug
causing something to hang.

You should start reducing the code/problem size and perhaps profiling
with more wall time to see what is taking so long.  Find something that
doesn't crash/hang/exceed wall time, then bisect to identify the
underlying problem.  We can't say more with the information you have
provided.

> qsub -n 64 -t 10 --mode c16 -O p1024_c16 --env "F00=a:BAR=b" ./x.r -ksp_type 
> bcgsl -ksp_bcgsl_ell 1 -sub_pc_type ilu -sub_pc_factor_levels 3 -sub_ksp_type 
> preonly -my_ksp_monitor true -ksp_view -log_summary
>
> the script:
>
> #!/bin/bash
>
> proN=1024
>
> preName=p$proN
>
> echo "Script JOB with Jobid COBALT_JOBID="$preName
>
>
> qsub -A SUGAR -t 00:10:00 -n 64   --proccount $proN  --mode script ./vesta.job
>
>
> and vesta.job:
>
> #!/bin/sh
> Nrank=1024
> echo Starting Cobalt job script
>
> LOCARGS="--block $COBALT_PARTNAME ${COBALT_CORNER:+--corner} $COBALT_CORNER 
> ${COBALT_SHAPE:+--shape} $COBALT_SHAPE"
>
> runjob $LOCARGS -n $Nrank -p 16 :  x.r -ksp_type bcgsl -ksp_bcgsl_ell 1 
> -sub_pc_type ilu -sub_pc_factor_levels 3 -sub_ksp_type preonly 
> -my_ksp_monitor true -ksp_view -log_summary
>
> echo End of jobscript.sh
>
> exit 0
>
> Both of them cannot run the program successfully. In these two ways, the 
> runtime log showed the job started but no output to stdout file.
>
> I just run the same program by:
> qsub -n 16 -t 10 --mode c64 -O n1024_c64 --env "F00=a:BAR=b" ./x.r -ksp_type 
> bcgsl -ksp_bcgsl_ell 1 -sub_pc_type ilu -sub_pc_factor_levels 3 -sub_ksp_type 
> preonly -my_ksp_monitor true -ksp_view -log_summary
>
> The job was able to run and the stdout file showed all the runtime output.  
> If there is non-scalable "Generate Vector" and VecView (the latter maybe> 
> because you don't use MPI-IO?), why is c64 mode able to run? It's sort of 
> strange to me. Thanks.
>                                         

Attachment: pgpZ6CS_EmbeZ.pgp
Description: PGP signature

Reply via email to