Nothing against discussions on how to write fast code, but I don’t believe that 
this is normally necessary.

About 20 years ago, I was counting words, not just how many, but how many of 
each word, on gigabytes of text.  
(Full text US patents for two years.)  I did it in Java (with JIT compiler on), 
and it was plenty fast enough.

I did it using the Java StringTokenizer:  
https://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

which takes a regular expression for the delimiter.  Then each word found was 
either added to a HashTable,
or the count for it was incremented.

As computers are much faster now, it should be able to do terabytes of text, 
today.

There was one non-obvious thing about the Java code, though.  It seems that the 
way Java
normally does substrings is with a reference to the whole character array, 
which in my case
was a line of text.  That filled up memory faster than it should have.  Using 
new String() on
each word, fixed that problem.  (It only does that for the actual entry in the 
hash table.)

But if you do have exabytes of text, then there might be need for assembly 
speed-up.
Well, OK, petabytes are enough.

Oh, you might also look at the unix wc command, which counts words.  (More 
specifically,
the GNU utilities version, with source available.)

About 25 years ago, I compiled the GNU utilities (as they then existed) to run 
on my OS/2
system.  (That is before Linux, and such, that are so convenient today.)

Reply via email to