You could consider using a StreamTokenizer:
(import '(java.io StreamTokenizer BufferedReader FileReader))
(defn wordfreq [filename]
(with-local-vars [words {}]
(let [st (StreamTokenizer. (BufferedReader. (FileReader.
filename)))]
(loop [tt (.nextToken st)]
(when (not= tt StreamTokenizer/TT_EOF)
(if (= tt StreamTokenizer/TT_WORD)
(let [w (.toLowerCase (.sval st))]
(var-set words (assoc @words w (inc (@words w 0))))))
(recur (.nextToken st)))))
(println (reverse (sort (map (fn [[k v]] [v k]) @words))))))
For me it was faster (even ignoring output):
user=> (time (wordfreq "wordfreq.txt"))
"Elapsed time: 444.171796 msecs"
user=> (time (top-words "wordfreq.txt" "out.txt"))
"Elapsed time: 618.196978 msecs"
Obviously if you wanted to take this approach you could rework to
apply your existing printer for a better comparison.
Interestingly when I compared 3 implementations:
1) by Chouser here:
http://groups.google.com/group/clojure/browse_thread/thread/d03e75812de6c6e2/5c47c243474c999d?lnk=gst&q=sort+by+value#5c47c243474c999d
2) top-words as described
3) Using a StreamTokenizer
I get 3 different histograms using a test file! All very similar but
slightly different. It is probably largely related to my test file
having opposite architecture newlines... shows that word counting is
not necessarily a cut and dried thing! Hahahaha, so how just how many
words are in this file ??? :)
Regards,
Tim.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---