Short version:
I'm starting to suspect that the ncbi patch does not alter a code path
related to -l gi filtering.
Long version:
I've been unable to persue this full time, but having gotten some
encouragement from Aaron Darling off-list. So I'll add a bit more info
to see if it rings any bells.
The blast on the master seems to be working way too hard. I've
temporarily removed refseq_genomic, and am working only with the 'nt'
database. As a result I no longer crash mpiblast as often. Despite that
consession my Master still works for hours, while my workers sit idle
99% of time time. I've tracked this down to the master's call to
runBLAST( COLLECT_STATS_MODE, 0, query_count - 1 );
in main().
I don't know the specifics, but I'd expect that COLLECT_STATS_MODE
should be quick and shallow. Instead this task is taking 99% of my time.
Looking over into blast_hooks.c the COLLECT_STATS_MODE seems only to
trigger:
options->calculate_statistics_and_exit = TRUE;
This in turn is a prominent feature of the MPIBlast NCBI patch, but
never occurs in the unpatched NCBI code. So it seems one of the
significant changes MPIBlast makes to ncbi blast, is to introduce this
additional stats gathering mode.
I know from past discussions on this list I may be one of the few
(only?) people here using blast's '-l' switch to filter out certain GIs
from the database. And that this sort of filtering would affect the
apparent database size, which would in turn affect the statistics.
So I'm starting to suspect that the ncbi patch does not alter a code
path related to -l gi filtering. Ironically I don't really care all that
much about exact statistics.
Knowing more is going to requiring digging into the ncbi toolbox. And
2am seems like a bad time to start doing that.
Michael Cariaso wrote:
at blast_hooks.c:1763 I see the following:
/* mpiBLAST: some NCBI functions do not use the
* SeqMgr's bioseq fetch function (e.g. MuskSeqIdWrite)
* In order to allow such functions to look up our bioseqs
* we need to preload them into the SeqMgr
* This should probably get reported as a bug to NCBI...
*/
prune = BlastPruneHitsFromSeqAlign(curr_seqalign,
number_of_alignments, NULL);
curr_seqannot->data = prune->sap;
indexSubjectBioseqs(curr_seqannot->data);
I've determined that this loop is running on the master, while my worker
nodes are doing the heavy lifting. For a particular query (I've found
many such) the master will crash due to a failed memory allocation
several layers down below the indexSubjectBioseqs() in the ncbi
codebase. In fairness its trying to alloc 230M, so I can't blame it. It
does this after printing the header, and table of scores. But prior to
the alignments.
I've noticed on previous passes through the loop lots of swap activity
here. I would guess that the preloading for SeqMgr is the source, but
maybe someone here can help me to better understand.
Since some queries trigger this, and others don't I'm guessing that the
issue may be related to the specific hits. My blastable DB has all of
refseq_genomic and all of nt. Some of the seqs in refseq_genomic are
large single segments (250M or more).
Could these be the origin of the problem?
How might this be fixed / avoided?
Has this been reported to NCBI?
I see there have been several revisions of the toolbox since mpiblast
was last synchronized with it. Does this suggest any possibilities?
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Mpiblast-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mpiblast-users