Do the "thousands upon thousands" of sub-processes have dependencies among
them or are they fully independent of each other?  Is it necessary to spawn
them using srun; i.e., are you using srun to provide job step accounting or
to make them subject to scheduling policies or what?  Just trying to
understand your context.

Gary D. Brown
Adaptive Computing


On Thu, Apr 14, 2016 at 4:02 PM, Pyramid Bioengineering <
[email protected]> wrote:

> Hi All,
>
> Our team is using Slurm to distribute tasks across a cluster, but our
> implementation may be a little different than what the typical person is
> doing... maybe?
>
> We'll submit a very simple sbatch, like so:
>
> ```
> #!/bin/bash
> #SBATCH --errror=/tmp/error.log
> #SBATCH --output/tmp/output.log
> execute_algorithm arg1 arg2
> ```
>
> `execute_alogorithm` is where things get a bit funny, it can be some
> variant of a complex C algorithm of ours that will spawn potentially
> thousands upon thousands of subprocess invocations. Each of these
> subprocess invocations are executed with `srun`, and slurm is successfully
> recognizing them as job steps. It should also be noted that we are waiting
> for all the threads to exit before exiting `execute_algorithm` itself.
>
> The question here is can slurm handle this sort of srun task allotment,
> firing them all out at once?
>
> In testing, this has appeared to work on very small jobs that are just
> above the limits of our node resources. I can actually see entries in the
> job error log that slurm is recognizing that it's hitting task capacity and
> waiting:
>
> `srun: Job step creation temporarily disabled, retrying`
>
> We have yet to get to a point where we can run the "thousands" of tasks
> that I speak of, but that will be coming up at the end of the month, and
> frankly I'm skeptical.
>
> Is this a common approach? If we are stuck with this approach and there is
> not another way to do it, do we just build some internal scheduling logic
> into the `execute_algorithm`?
>
> Thanks!
>

Reply via email to