On 1/30/21 6:13 PM, methonash wrote:
Greetings all,

Many thanks for sharing your collective perspective and advice thus far! It has been very helpful and instructive. I return bearing live data and a minimally complete, compilable, and executable program to experiment with and potentially optimize. The dataset can be pulled from here:

https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz

Naturally, you'll need to gunzip this file. The decompressed file contains strings on every even-numbered line that have already been reduced to the unique de-duplicated subset, and they have already been sorted by descending length and alphabetical identity.

From my initial post, the focus is now entirely on step #4: finding/removing strings that can be entirely absorbed (substringed) by their largest possible parent.

And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
         string[] seqs;

         foreach( line; stdin.byLine() )
         {
                 if( line[ 0 ] == '>' ) continue;
                 else seqs ~= to!string( line );
         }

         foreach( i; 0 .. seqs.length )
         {
                 if( seqs[ i ].length == 0 ) continue;

                 foreach( j; i + 1 .. seqs.length )
                 {
                         if( seqs[ j ].length == 0 ||
                            seqs[ i ].length == seqs[ j ].length ) continue;

                         if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                         {
                                 seqs[ j ] = "";

                                 writefln( "%s contains %s", i, j );
                         }
                 }
         }
}


Compile the source and then run the executable via redirecting stdin:

./substr < seqs.fasta

See any straightforward optimization paths here?

For curiosity, I experimented with use of string[] and ubyte[][] and several functions (indexOf, canFind, countUntil) to assess for the best potential performer. My off-the-cuff results:

string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes

Resultantly, the code above uses string[] with indexOf(). Tests were performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.

I have additional questions/concerns/confusion surrounding the foreach() syntax I have had to apply above, but performance remains my chief immediate concern.

The code looks pretty minimal.

I'd suggest trying it in reverse. If you have the sequence "cba", "ba", "a", then determining "a" is in "ba" is probably cheaper than determining "a" is in "cba".

Are you still convinced that it's possible to do it in under 2 seconds? That would seem a huge discrepancy. If not, what specifically are you looking for in terms of performance?

-Steve

Reply via email to