Caligo: > I'm going to ignore the C version because it's ugly and uses a hash.
Some of the others too use a hash. You can write nice looking code in C too, but you need more skills :-) > I'm also going to ignore the fastest C++ version because it uses a digital > trie > (it's very fast but extremely memory hungry; the complexity is constant over > the size of the input and linear over the length of the word being searched > for). The fastest C++ version uses more memory, but sometimes if you need more performance it may become the right choice. > I just wanted to focus on the language and the std library and not > have to implement a data structure. One of the few advantages of D over Python is that in D you are able to implement efficient and custom data structures without leaving the D language itself :-) > For D I pretty much used the example from TDPL. As far as I can tell, the > associate array used is closer to std::map (or maybe std::unordered_map ?) D built-in AAs are a hash map, but they use comparisons to resolve collisions. This makes D AAs strong against malicious attacks. Python dicts are faster but they are a pure hash map. > than std::unordered_set, but I don't know of any other data structures in D > for this (I'm still learning). A unordered_set is not present in stc.collections yet. > Here are the measurements (average of 3 runs): Your timings lack information about the CPU, compilation switches used, and C++ compiler version used. Are those really averages? > I'm interested to see a better D version than the one I posted. If you want to use only the built-ins and std lib I think you can't improve your code a lot. To go faster you need to go lower level. Regarding your code, break and continue statements are not Structured Programming, so it's better to avoid them when possible. I write your code like this: import std.stdio, std.string; void main() { size_t[string] dictionary; foreach (line; stdin.byLine()) foreach (word; line.strip().splitter()) if (word !in dictionary) dictionary[word.idup] = 1; writeln("Words: ", dictionary.length); } This Python2 version is as fast as the D-DMD version with a 8.7 MB file that contains about 120_000 words: from sys import stdin import psyco def main(): dictionary = {} for line in stdin: for word in line.split(): if word not in dictionary: dictionary[word] = 1 print "Words:", len(dictionary) psyco.bind(main) main() Bye, bearophile