Caligo:

> I'm going to ignore the C version because it's ugly and uses a hash.

Some of the others too use a hash. You can write nice looking code in C too, 
but you need more skills :-)


> I'm also going to ignore the fastest C++ version because it uses a digital 
> trie
> (it's very fast but extremely memory hungry; the complexity is constant over
> the size of the input and linear over the length of the word being searched
> for).

The fastest C++ version uses more memory, but sometimes if you need more 
performance it may become the right choice.


> I just wanted to focus on the language and the std library and not
> have to implement a data structure.

One of the few advantages of D over Python is that in D you are able to 
implement efficient and custom data structures without leaving the D language 
itself :-)


> For D I pretty much used the example from TDPL.  As far as I can tell, the
> associate array used is closer to std::map (or maybe std::unordered_map ?)

D built-in AAs are a hash map, but they use comparisons to resolve collisions. 
This makes D AAs strong against malicious attacks. Python dicts are faster but 
they are a pure hash map.


> than std::unordered_set, but I don't know of any other data structures in D
> for this (I'm still learning).

A unordered_set is not present in stc.collections yet.


> Here are the measurements (average of 3 runs):

Your timings lack information about the CPU, compilation switches used, and C++ 
compiler version used.
Are those really averages?


> I'm interested to see a better D version than the one I posted.

If you want to use only the built-ins and std lib I think you can't improve 
your code a lot. To go faster you need to go lower level.

Regarding your code, break and continue statements are not Structured 
Programming, so it's better to avoid them when possible. I write your code like 
this:

import std.stdio, std.string;

void main() {
    size_t[string] dictionary;

    foreach (line; stdin.byLine())
        foreach (word; line.strip().splitter())
            if (word !in dictionary)
                dictionary[word.idup] = 1;

    writeln("Words: ", dictionary.length);
}


This Python2 version is as fast as the D-DMD version with a 8.7 MB file that 
contains about 120_000 words:

from sys import stdin
import psyco

def main():
    dictionary = {}
    for line in stdin:
        for word in line.split():
            if word not in dictionary:
                dictionary[word] = 1
    print "Words:", len(dictionary)

psyco.bind(main)
main()

Bye,
bearophile

Reply via email to