On 1/30/21 6:13 PM, methonash wrote:
Greetings all,
Many thanks for sharing your collective perspective and advice thus far!
It has been very helpful and instructive. I return bearing live data and
a minimally complete, compilable, and executable program to experiment
with and potentially optimize. The dataset can be pulled from here:
https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg
Running "cksum" on this file:
1477520542 2199192 seqs.fasta.gz
Naturally, you'll need to gunzip this file. The decompressed file
contains strings on every even-numbered line that have already been
reduced to the unique de-duplicated subset, and they have already been
sorted by descending length and alphabetical identity.
From my initial post, the focus is now entirely on step #4:
finding/removing strings that can be entirely absorbed (substringed) by
their largest possible parent.
And now for the code:
import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;
void main()
{
string[] seqs;
foreach( line; stdin.byLine() )
{
if( line[ 0 ] == '>' ) continue;
else seqs ~= to!string( line );
}
foreach( i; 0 .. seqs.length )
{
if( seqs[ i ].length == 0 ) continue;
foreach( j; i + 1 .. seqs.length )
{
if( seqs[ j ].length == 0 ||
seqs[ i ].length == seqs[ j ].length )
continue;
if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
{
seqs[ j ] = "";
writefln( "%s contains %s", i, j );
}
}
}
}
Compile the source and then run the executable via redirecting stdin:
./substr < seqs.fasta
See any straightforward optimization paths here?
For curiosity, I experimented with use of string[] and ubyte[][] and
several functions (indexOf, canFind, countUntil) to assess for the best
potential performer. My off-the-cuff results:
string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes
Resultantly, the code above uses string[] with indexOf(). Tests were
performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.
I have additional questions/concerns/confusion surrounding the foreach()
syntax I have had to apply above, but performance remains my chief
immediate concern.
The code looks pretty minimal.
I'd suggest trying it in reverse. If you have the sequence "cba", "ba",
"a", then determining "a" is in "ba" is probably cheaper than
determining "a" is in "cba".
Are you still convinced that it's possible to do it in under 2 seconds?
That would seem a huge discrepancy. If not, what specifically are you
looking for in terms of performance?
-Steve