Re: Storing and Searching large text lists

Nicholas Wilson via Digitalmars-d-learn Thu, 31 Dec 2015 18:25:34 -0800

On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:

I have a large list, B, of string items. For each item in thatlarge list, I need to see if it is in the smaller list, A.
I have been using a simple string array for the storage of A

string[] A
and then using foreach to go through all the items of B andcheck they are in A
foreach(string;B)
/* this looks hacky but wasn't working without the !=0 bit )
    if(find(A,string) != 0)
        writeln("Found a line: ", string);
While this works for small datasets, but when either A or B getlarge (A could be up to 150k records, B in the millions) ittakes quite a while to run.
I'd like to know what is the best way to store lists of textfor searching? Is there a better container than a simply array?Neither A nor B need to be ordered for my purpose, but wouldsorting help the search? Would it help enough to be worth theCPU expense?
Regards
B


Your problem is that your algorithm is O(m*n). (B is m ,A is n)

A simple speedup would be to sort A and then use a binary search(O(m*log(n)))

(partially sorting B may also help but only for cache)

Fully sorting may be worth doing if you have other steps afterthat also benefit in speed when working on sorted data (likesearching does).


Changing A to a trie is another possibility.

But as always it help to know your data. How long are thestrings(are they of the same length)? Are there any similarities(e.g. are they all email addresses)? Are there duplicates allowedin B?

Nic

Re: Storing and Searching large text lists

Reply via email to