On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
I have a large list, B, of string items. For each item in that
large list, I need to see if it is in the smaller list, A.
I have been using a simple string array for the storage of A
string[] A
and then using foreach to go through all the items of B and
check they are in A
foreach(string;B)
/* this looks hacky but wasn't working without the !=0 bit )
if(find(A,string) != 0)
writeln("Found a line: ", string);
While this works for small datasets, but when either A or B get
large (A could be up to 150k records, B in the millions) it
takes quite a while to run.
I'd like to know what is the best way to store lists of text
for searching? Is there a better container than a simply array?
Neither A nor B need to be ordered for my purpose, but would
sorting help the search? Would it help enough to be worth the
CPU expense?
Regards
B
Your problem is that your algorithm is O(m*n). (B is m ,A is n)
A simple speedup would be to sort A and then use a binary search
(O(m*log(n)))
(partially sorting B may also help but only for cache)
Fully sorting may be worth doing if you have other steps after
that also benefit in speed when working on sorted data (like
searching does).
Changing A to a trie is another possibility.
But as always it help to know your data. How long are the
strings(are they of the same length)? Are there any similarities
(e.g. are they all email addresses)? Are there duplicates allowed
in B?
Nic