Hi Intikhab, intikhab alam wrote: > Hi everybody, > > I have a dataset of 347899 protein sequences which I want to compare > to each other (all-against-all blast). I have access to the compute > cluster which is running Score (version 5.8.4.r3) as an mpi > environment and have 25 nodes, each with 4 cores and 8 GB of RAM. > > We have the latest version of mpiblast installed. I started a mpiblast > job, (for comparing 347899 sequences against each other), on 44 > processors using the following commandline: > > > mpiformatdb -i 36FungalJGIanigNbcin_M40 --nfrags=42 -p > T --skip-reorder > > mpisub 44 /usr/local/mpiblast_tool/bin/mpiblast -p blastp -d > 36FungalJGIanigNbcin_M40 -i > /users/zzalssn4/scratch/mpiblast/work/36FungalJGIanigNbcin_M40 -m 8 -e > 1e-5 -o > /users/zzalssn4/scratch/mpiblast/work/36FungalJGIanigNbcin_M40.outF42C44 > > This job was running for about 12 days and only 22% or 10122202 > matches of the total 47342483 known significant matches were received, > still all the processes running to the full (>90% usage) on all > specified processors. > > > The same all-against-all blast job using standard blast on 36 > processors, where I made 36 chunks of the dataset and blasted each > chunk against the complete dataset on a single processor, got > completed in less than 24hrs, resulting in 47342483 significant > sequence matches. > > May be I am missing something in running mpiblast properly, so here I > need some help in whether I could improve the running time of mpiblast > on the size of the datasets mentioned above. > > Hope to hear from you soon. > > Regards, > > Intikhab > I can think of two reasons why mpiBLAST may be suffering on this compute job. I'll start with the less-likely problem first: If the database and query sets contain amino acid sequences and are large, mpiblast 1.4.0 can take a long time to compute the effective search space required for exact e-value calculation. If that's the problem, then you would find just one mpiblast process consuming 100% cpu on the rank 0 node for hours or days, without any output. The trouble is that 1.4.0 doesn't parallelize the effective search space calculation. I've cobbled a workaround for this issue, which may be satisfactory if you can tolerate some discrepancy in e-values between mpiblast and NCBI blast. See this e-mail: http://www.mail-archive.com/[email protected]/msg00177.html
The more likely limiting factor is load imbalance on the cluster. If some database fragments happen to have a large number of hits and others have few, and the database is distributed as one fragment per node, then the computation may be heavily imbalanced and may run quite slowly. CPU consumption as given by a CPU monitoring tool may not be indicative of useful work being done on the nodes since workers can do a timed spin-wait for new work. I can suggest two avenues to achieve better load balance with mpiblast 1.4.0. First, partition the database into more fragments, possibly two or three times as many as you currently have. Second, use the --db-replicate-count option to mpiblast. The default value for the db-replicate-count is 1, which indicates that mpiblast will distribute a single copy of your database across worker nodes. For your setup, each node was probably getting a single fragment. By setting --db-replicate-count to something like 5, each fragment would be copied to five different compute nodes, and thus five nodes would be available to search fragments that happen to have lots of hits. In the extreme case you could set --db-replicate-count equal to the number of fragments, which would be fine if per-node memory and disk space is substantially larger than the total size of the formatted database. I just noticed that our documentation at mpiblast.lanl.gov doesn't include mention of the --db-replicate-count parameter. I'm fairly certain it had been documented but lost when the mpiblast.org web server crashed and burned. In any case, that command-line parameter allows you to control the degree of redundancy and load balancing that mpiblast will use on your cluster. In your particular situation, it may also help to randomize the order of sequences in the database to minimize "fragment hotspots" which could result from a database self-search. At the moment mpiblast doesn't have code to accomplish such a feat, but I think others (Jason Gans?) have written code for this in the past. -Aaron ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Mpiblast-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/mpiblast-users
