Hi Jeremy,

On Monday 05 March 2007 21:05, Jeremy C. Reed wrote:
> On Mon, 5 Mar 2007, Steve Litt wrote:
> > In preparation to create my index for my book, I created a Ruby program
> > to list every word in a file (in this case the .lyx file).
> >
> > Now of course this could be done with a simple one-liner using sed and
> > sort -u, but my program lists the words in 2 different orders, first in
> > alpha order, which of course could be done by the 1 liner, and then in
> > descending order of occurrence, which can't be.
>
> fmt -1 | sort | uniq -c | sort -rn

Knowing that would have saved me two hours :-) I wasn't familiar with the fmt 
and the uniq commands. Thanks.

> Also could had some tr and sed to clean out junk spacing and to lowercase
> everything.

Yes. Here's my final answer, merging everything into lower case, and blowing 
off leading and trailing space and punctuation:

fmt -1 < tsjustfacts.txt  |  sed -e "s/^[[:space:][:punct:]]*//" | 
sed -e "s/[[:space:][:punct:]]*$//" | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn

That's sweet. Simpler than the Ruby, and probably faster, expecially on 
multicore/multiprocessor machines. Thanks!

The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?

Another filter that might be useful in this chain is:

grep -v ^[[:space:][:digit:][:punct:]]*$

In other words, if a line is consumed with nothing but space, digits and 
punctuation, it's probably not an index candidate and can be deleted, saving 
future processing and reducing extraneous output.

I'm not sure whether it's a good idea to lowercase everything. I think 
sometimes case serves as a reminder of the meaning of a word. To not force 
everything to lower case, simply remove the tr [:upper:] [:lower:].

>
> By the way, I did something similar when doing some indexing.

Indexing is the most distasteful, boring, and tedious part of writing a book. 
Making word lists like this at least makes it a brainless activity.

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/

Reply via email to