On Sat, May 15, 2010 at 06:24:54AM -0400, Andrew Piskorski wrote: > 1. I have lots of embarrassingly parallel tree-structured jobs which I > dynamically generate and submit from top-level user code (which > happens to be written in R). E.g., my user code generates 10 or 100 > or 1000 jobs, and each of those jobs might itself generate N jobs. > Any given job cannot complete until all its children complete.
Condor's "MW" master-worker API and DAGMan both sound potentially useful for my tree-structured jobs. However... Does MW support multiple levels of masters and workers? (That's what I need.) The docs never mention it, not even when discussing the scalability limitations of a single master process, so I presume it does not. MW also requires both Condor and Condor-PVM. http://www.cs.wisc.edu/condor/mw/ http://www.cs.wisc.edu/condor/mw/overview.html http://www.cs.wisc.edu/condor/pvm/ Since Condor does not itself understand inter-job dependencies at all, it seems that two MW master programs running at the same time could readily deadlock each other. At least, I don't see anything in either MW or Condor proper that would prevent or ameliorate that risk. >From its docs, DAGMan is purely static, it has to know about all the jobs ahead of time before any of them start, and cannot dynamically submit new jobs (no good for me). It sits as a separate layer above Condor; Condor itself does not understand inter-job dependencies at all. DAGMan's docs also say it has no way to recover if even a single one of its jobs fail, it aborts the entire DAG. That seems strange, as I'd have thought that Condor itself must support some sort of job restart when a node goes down (or is otherwise removed from the Condor pool) - does it really not? http://www.cs.wisc.edu/condor/dagman/ http://www.cs.wisc.edu/condor/manual/v6.1/2_11Inter_job_Dependencies.html The DAGMan stuff sounds like a research hack that's not really fully supported by Condor. AFAICT MW and DAGMan are also entirely unrelated to each other. Does anybody actually use either of those tools? And of course, it's not clear whether Condor in general would really meet the needs I laid out earlier. -- Andrew Piskorski <[email protected]> http://www.piskorski.com/ _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
