On Mon, 26 Apr 2004 02:44:13 +0300
robin <[EMAIL PROTECTED]> wrote:

> This looks promising, but it only works on lines; I need words (sorry,
> I should have been more specific when I said "strings"). What I'm
> trying 

What you're probably looking for could be done using either awk or perl,
using one of the neat features included in both systems -- associative
arrays. An associative array, basically, is a structure where the index
is any arbitrary text, rather than a numeric index, like most
programming languages. The downside to this method -- it can use gobs of
memory (I tried it on a newsgroup once, just to see how much RAM it
would take), but if you have enough RAM that shouldn't be a concern,
depending of course on the size of your texts.  Thus you can query
something like occurences of the text "blue" with an expression
count(blue). Reading in each word, you can turn on its associated
counter, after which you could then do a similar counting operation on
the second file, and then compare the two word lists. 

I happen to have a sample awk script that counts frequencies of words
(the one I tried on a Usenet newsgroup, just for grins). Just make it
executable and run it against a text file. Note that the counting is
done inside the short for loop, which just counts each word that comes
in.

This isn't exactly what you're looking for, but I guess it could be
modified. The difficulty, I would surmise, is how the data is going to
be presented. You want to do it on a word basis, but say you have two
texts:

text 1
This is a test

text 2
A test this is, of the emergency broadcast system

A word-count based approach would tell me (if we ignore capitalization
and punctuation) that there are common words "this", "is", "a" "test",
but it would not tell me that the two texts are quite different. If I
use diff, which looks at lines, then I definitely see the two texts have
differences. If, for instance, text 1 read as "This is a test of the
emergency broadcast system", then diff would see two distinct lines, but
you would not be able to identify the common substring "emergency
broadcast system" in the two texts. If I used a pure word-based
approach, I'd end up concluding that the word "emergency" occured in
both texts, which is not altogether useful.

Two suggestions - get to a CPAN site and look around. There might be
already extant perl scripts that you can use as is or adapt. I don't
grok Perl :(.

Second, Usenet newsgroups comp.unix.questions and/or comp.unix.shell
might garner some good feedback.

hth

------------- wordfreq
#! /bin/sh
###     wordfreq - count number of occurrences of each word in input
###     Usage: wordfreq [-i] [files]
##
##      wordfreq COUNTS THE NUMBER OF OCCURRENCES OF EACH WORD IN ITS INPUT.
##      IF YOU GIVE IT FILES, IT READS FROM THEM; OTHERWISE IT READS stdin.
##      THE -i OPTION FOLDS UPPER CASE INTO LOWER CASE (CAPITALIZED LETTERS
##      WILL COUNT THE SAME AS LOWER-CASE).
##
##      Modified to work with gawk.
##      To use awk, replace gawk -- with awk -e

awkscr='{
        for (i = 1; i <= NF; i++)
                num[$i]++
}
END {
        for (word in num)
                print word, num[word]
}'

# sed EXPRESSION TO TAKE OFF PUNCTUATION BEFORE AND AFTER WORDS
# (ACTUALLY, AT SPACES, BEGINS AND ENDS OF LINES), SO PUNCTUATION WON'T
# TRASH WORD COUNTS:
strippunc='s/[,.-?!)"]* / /g
s/[,.-?!)"]*$//g
s/ ["(]/ /g
s/^["(]//g'

case "$1" in
-i)     shift
        sed "
        y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
        $strippunc
        " ${1+"$@"} |
        gawk -- "$awkscr"
        ;;
*)      sed "$strippunc" ${1+"$@"} | gawk -- "$awkscr" ;;
esac

------------ end script


> Sir Robin

-- 
------------------------------------------------------------------------
David E. Fox                              Thanks for letting me
[EMAIL PROTECTED]                            change magnetic patterns
[EMAIL PROTECTED]               on your hard disk.
-----------------------------------------------------------------------

____________________________________________________
Want to buy your Pack or Services from MandrakeSoft? 
Go to http://www.mandrakestore.com
Join the Club : http://www.mandrakeclub.com
____________________________________________________

Reply via email to