Alright. I didn't see that option for GNU parallel. Retrying a task that failed 
for good reasons, makes maybe not much sense (e.g. due to OOM). And if the 
farming job timed out, on restart that job, GNU parallel does not start from 
the former state, does it? I guess book-keeping is an extra issue, why Magnus 
probably also used some server including some data base or so.

But ok. GNU parallel's documentation is indeed quite vast. I try to parse it's 
other/new features (it is also still developed on ... ).


Concerning Dask ... I heard of it. But never tried ('cos Intel advertised it 
... 😏 ).

Maybe I should reconsider that.


Thank you for this input!

KR, Martin


________________________________
Von: slurm-users <slurm-users-boun...@lists.schedmd.com> im Auftrag von Ward 
Poelmans <ward.poelm...@vub.be>
Gesendet: Mittwoch, 18. Januar 2023 15:35
An: slurm-users@lists.schedmd.com
Betreff: Re: [slurm-users] srun jobfarming hassle question

On 18/01/2023 15:22, Ohlerich, Martin wrote:
> But Magnus (Thanks for the Link!) is right. This is still far away from a 
> feature rich job- or task-farming concept, where at least some overview of 
> the passed/failed/missing task statistics is available etc.

GNU parallel has log output and options to retry failed jobs.

If you want really fancy stuff, maybe look at dask combined with slurm plugins? 
It has dashboards for jupyter I believe.

Ward

Reply via email to