Nothing against discussions on how to write fast code, but I don’t believe that this is normally necessary.
About 20 years ago, I was counting words, not just how many, but how many of each word, on gigabytes of text. (Full text US patents for two years.) I did it in Java (with JIT compiler on), and it was plenty fast enough. I did it using the Java StringTokenizer: https://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html which takes a regular expression for the delimiter. Then each word found was either added to a HashTable, or the count for it was incremented. As computers are much faster now, it should be able to do terabytes of text, today. There was one non-obvious thing about the Java code, though. It seems that the way Java normally does substrings is with a reference to the whole character array, which in my case was a line of text. That filled up memory faster than it should have. Using new String() on each word, fixed that problem. (It only does that for the actual entry in the hash table.) But if you do have exabytes of text, then there might be need for assembly speed-up. Well, OK, petabytes are enough. Oh, you might also look at the unix wc command, which counts words. (More specifically, the GNU utilities version, with source available.) About 25 years ago, I compiled the GNU utilities (as they then existed) to run on my OS/2 system. (That is before Linux, and such, that are so convenient today.)