On Mon, 26 Apr 2004 02:44:13 +0300 robin <[EMAIL PROTECTED]> wrote:
> This looks promising, but it only works on lines; I need words (sorry, > I should have been more specific when I said "strings"). What I'm > trying What you're probably looking for could be done using either awk or perl, using one of the neat features included in both systems -- associative arrays. An associative array, basically, is a structure where the index is any arbitrary text, rather than a numeric index, like most programming languages. The downside to this method -- it can use gobs of memory (I tried it on a newsgroup once, just to see how much RAM it would take), but if you have enough RAM that shouldn't be a concern, depending of course on the size of your texts. Thus you can query something like occurences of the text "blue" with an expression count(blue). Reading in each word, you can turn on its associated counter, after which you could then do a similar counting operation on the second file, and then compare the two word lists. I happen to have a sample awk script that counts frequencies of words (the one I tried on a Usenet newsgroup, just for grins). Just make it executable and run it against a text file. Note that the counting is done inside the short for loop, which just counts each word that comes in. This isn't exactly what you're looking for, but I guess it could be modified. The difficulty, I would surmise, is how the data is going to be presented. You want to do it on a word basis, but say you have two texts: text 1 This is a test text 2 A test this is, of the emergency broadcast system A word-count based approach would tell me (if we ignore capitalization and punctuation) that there are common words "this", "is", "a" "test", but it would not tell me that the two texts are quite different. If I use diff, which looks at lines, then I definitely see the two texts have differences. If, for instance, text 1 read as "This is a test of the emergency broadcast system", then diff would see two distinct lines, but you would not be able to identify the common substring "emergency broadcast system" in the two texts. If I used a pure word-based approach, I'd end up concluding that the word "emergency" occured in both texts, which is not altogether useful. Two suggestions - get to a CPAN site and look around. There might be already extant perl scripts that you can use as is or adapt. I don't grok Perl :(. Second, Usenet newsgroups comp.unix.questions and/or comp.unix.shell might garner some good feedback. hth ------------- wordfreq #! /bin/sh ### wordfreq - count number of occurrences of each word in input ### Usage: wordfreq [-i] [files] ## ## wordfreq COUNTS THE NUMBER OF OCCURRENCES OF EACH WORD IN ITS INPUT. ## IF YOU GIVE IT FILES, IT READS FROM THEM; OTHERWISE IT READS stdin. ## THE -i OPTION FOLDS UPPER CASE INTO LOWER CASE (CAPITALIZED LETTERS ## WILL COUNT THE SAME AS LOWER-CASE). ## ## Modified to work with gawk. ## To use awk, replace gawk -- with awk -e awkscr='{ for (i = 1; i <= NF; i++) num[$i]++ } END { for (word in num) print word, num[word] }' # sed EXPRESSION TO TAKE OFF PUNCTUATION BEFORE AND AFTER WORDS # (ACTUALLY, AT SPACES, BEGINS AND ENDS OF LINES), SO PUNCTUATION WON'T # TRASH WORD COUNTS: strippunc='s/[,.-?!)"]* / /g s/[,.-?!)"]*$//g s/ ["(]/ /g s/^["(]//g' case "$1" in -i) shift sed " y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ $strippunc " ${1+"$@"} | gawk -- "$awkscr" ;; *) sed "$strippunc" ${1+"$@"} | gawk -- "$awkscr" ;; esac ------------ end script > Sir Robin -- ------------------------------------------------------------------------ David E. Fox Thanks for letting me [EMAIL PROTECTED] change magnetic patterns [EMAIL PROTECTED] on your hard disk. -----------------------------------------------------------------------
____________________________________________________ Want to buy your Pack or Services from MandrakeSoft? Go to http://www.mandrakestore.com Join the Club : http://www.mandrakeclub.com ____________________________________________________