This is the page that would require your attention:
http://unthought.net/c++/c_vs_c++.html

I'm going to ignore the C version because it's ugly and uses a hash.  I'm
also going to ignore the fastest C++ version because it uses a digital trie
(it's very fast but extremely memory hungry; the complexity is constant over
the size of the input and linear over the length of the word being searched
for).  I just wanted to focus on the language and the std library and not
have to implement a data structure.

Here is the C++ code:

#include <unordered_set>
#include <string>
#include <iostream>
#include <stdio.h>

int main(int argc, char* argv[]){

  using namespace std;
  char buf[8192];
  string word;
  unordered_set<string> wordcount;
  while( scanf("%s", buf) != EOF ) wordcount.insert(buf);
  cout << "Words: " << wordcount.size() << endl;

  return 0;
}

For D I pretty much used the example from TDPL.  As far as I can tell, the
associate array used is closer to std::map (or maybe std::unordered_map ?)
than std::unordered_set, but I don't know of any other data structures in D
for this (I'm still learning).
Here is the D code:

import std.stdio;
import std.string;

void main(){

  size_t[string] dictionary;
  foreach(line; stdin.byLine()){
    foreach(word; splitter(strip(line))){
      if(word in dictionary) continue;
      dictionary[word.idup] = 1;
    }
  }
  writeln("Words: ", dictionary.length);
}

Here are the measurements (average of 3 runs):

C++
===
Data size: 990K with 23K unique words
real    0m0.055s
user   0m0.046s
sys     0m0.000

Data size: 9.7M with 23K unique words
real    0m0.492s
user   0m0.470s
sys    0m0.013

Data size: 5.1M with 65K unique words
real    0m0.298s
user   0m0.277s
sys    0m0.013

Data size: 51M with 65K unique words
real    0m2.589s
user   0m2.533s
sys    0m0.070


DMD D 2.051
===
Data size: 990K with 23K unique words
real    0m0.064s
user   0m0.053s
sys     0m0.006

Data size: 9.7M with 23K unique words
real    0m0.513s
user   0m0.487s
sys    0m0.013

Data size: 5.1M with 65K unique words
real    0m0.305s
user   0m0.287s
sys    0m0.007

Data size: 51M with 65K unique words
real    0m2.683s
user   0m2.590s
sys    0m0.103


GDC D 2.051
===
Data size: 990K with 23K unique words
real    0m0.146s
user   0m0.140s
sys     0m0.000

Data size: 9.7M with 23K unique words
Segmentation fault

Data size: 5.1M with 65K unique words
Segmentation fault

Data size: 51M with 65K unique words
Segmentation fault


GDC fails for some reason with large number of unique words and/or large
data.  Also, GDC doesn't always give correct results; the word count is
usually off by a few hundred.

D and C++ are very close.  Without scanf() C++ is almost twice as slow.
Also, using std::unordered_set, the performance almost doubles.

I'm interested to see a better D version than the one I posted.

P.S.
No flame wars please.

Reply via email to