Re: 200-600x slower Dlang performance with nested foreach loop

Steven Schveighoffer via Digitalmars-d-learn Sat, 30 Jan 2021 16:55:37 -0800

On 1/30/21 6:13 PM, methonash wrote:

Greetings all,
Many thanks for sharing your collective perspective and advice thus far!It has been very helpful and instructive. I return bearing live data anda minimally complete, compilable, and executable program to experimentwith and potentially optimize. The dataset can be pulled from here:
https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg

Running "cksum" on this file:

1477520542 2199192 seqs.fasta.gz
Naturally, you'll need to gunzip this file. The decompressed filecontains strings on every even-numbered line that have already beenreduced to the unique de-duplicated subset, and they have already beensorted by descending length and alphabetical identity.
From my initial post, the focus is now entirely on step #4:finding/removing strings that can be entirely absorbed (substringed) bytheir largest possible parent.
And now for the code:


import std.stdio : writefln, File, stdin;
import std.conv : to;
import std.string : indexOf;

void main()
{
         string[] seqs;

         foreach( line; stdin.byLine() )
         {
                 if( line[ 0 ] == '>' ) continue;
                 else seqs ~= to!string( line );
         }

         foreach( i; 0 .. seqs.length )
         {
                 if( seqs[ i ].length == 0 ) continue;

                 foreach( j; i + 1 .. seqs.length )
                 {
                         if( seqs[ j ].length == 0 ||
seqs[ i ].length == seqs[ j ].length )continue;
                         if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
                         {
                                 seqs[ j ] = "";

                                 writefln( "%s contains %s", i, j );
                         }
                 }
         }
}


Compile the source and then run the executable via redirecting stdin:

./substr < seqs.fasta

See any straightforward optimization paths here?
For curiosity, I experimented with use of string[] and ubyte[][] andseveral functions (indexOf, canFind, countUntil) to assess for the bestpotential performer. My off-the-cuff results:
string[] with indexOf() :: 26.5-27 minutes
string[] with canFind() :: >28 minutes
ubyte[][] with canFind() :: 27.5 minutes
ubyte[][] with countUntil() :: 27.5 minutes
Resultantly, the code above uses string[] with indexOf(). Tests wereperformed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.
I have additional questions/concerns/confusion surrounding the foreach()syntax I have had to apply above, but performance remains my chiefimmediate concern.


The code looks pretty minimal.

I'd suggest trying it in reverse. If you have the sequence "cba", "ba","a", then determining "a" is in "ba" is probably cheaper thandetermining "a" is in "cba".

Are you still convinced that it's possible to do it in under 2 seconds?That would seem a huge discrepancy. If not, what specifically are youlooking for in terms of performance?


-Steve

Re: 200-600x slower Dlang performance with nested foreach loop

Reply via email to