[galaxy-dev] Galaxy job scheduler slows down, when too many jobs are in the queue

Hans-Rudolf Hotz Wed, 02 Dec 2015 01:48:54 -0800

Hi

Over the last few days, I have encountered a very bizarre behavior ofthe internal Galaxy job scheduler. It all started with using the API togenerate Data Libraries:

Instead of generating individual Data Libraries when requested, Idecided to make all HiSEQ and MiSEQ data which has been produced in ourinstitute for the last two years available.


I used the following call from BioBlend

upload_from_galaxy_filesystem(library_id, filesystem_paths,
folder[0]["id"], type, dbkey='?', link_data_only='link_to_files', roles='')

to link ~19000 (fastq and metadata) files into several sub-folders ofeither the 'HiSEQ' or the 'MiSEQ' Data Library. That all worked verywell - btw a big Thank You to all the BioBlend developers!

Using the "Data libraries Beta" page I could nicely follow how my scriptis working down all the files.

Unfortunately, I realized too late, that although, the files wereshowing up correctly (i.e with the right path to the original file) inthe "Data libraries Beta" page, the actual 'upload' job had not beenfinished. So, when my script was done, I ended up with about ~16000unfinished jobs waiting in the queue.

We use the internal scheduler, and the settings in the job_conf.xml,were set to <limit type="registered_user_concurrent_jobs">2</limit> . Atthe beginning, the 'upload' jobs were running one after the other.However, the more jobs were in the queue, the longer it took between thetwo jobs were started. At the hight, two jobs were started only every~60 minutes.

During that hour, nothing happened and no job was set to "running". Evenif someone else was using the Galaxy server, there was a wait of an hourfor that job to be executed. Luckily I did all this on our developmentserver, so no actual user was affected.

I changed the settings in the job_conf.xml file to allow 100 jobs peruser with a total of 105 concurrent jobs. I restarted the server, andnow, every hour 100 'upload' jobs were executed. But again, there wereabout 60 minutes in between, when nothing happened.

I was playing with the 'cache_user_job_count' setting ("True"/"False")but that didn't change anything.

With 100 jobs executed every hour, the queue became eventually smallerand smaller. At about 5000 jobs to go, the gap reduced to ~30 minutesand at about 2000 jobs to go, the waiting time was about 10 minutes andeventually it went down to zero again.


Has anyone else seen such a behavior before?


Thank very much for any help or suggestions
Regards, Hans-Rudolf



PS: I now modifying the script, with a call to the database to check
    whether all jobs have been done, before making the call to upload
    more files to the Data Libraries.




--



Hans-Rudolf Hotz, PhD
Bioinformatics Support

Friedrich Miescher Institute for Biomedical Research
Maulbeerstrasse 66
4058 Basel/Switzerland
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Galaxy job scheduler slows down, when too many jobs are in the queue

Reply via email to