Re: word similarity measure

Telemachus Mon, 02 Mar 2009 03:55:08 -0800

On Fri Feb 27 2009 @  8:24, Susan wrote:
> If my data looks like this:
> 
> word 1: 100    101     101    102    102     102    106    106
> word 2: 101    104     106    110    113     129    131    148
> word 3: 101    153     175    180    381
> word 4: 106    110     113    122    131     137    142    148
> word 5: 120    165     169
> 
> where word 1,2,3,4,5 represent different words, numbers represent
> different attributes of words.
> 
> How can I calculate similarity between words?


Like John and Chas, I'm not entirely sure what you mean by similarity. My
guess is that you want to see what attributes the various words share. So,
for example, word 1 and word 2 both have attributes 101 and 106. However,
I'm not sure why a word can have the same attribute twice or even more.
That is, I don't know if that means anything (word 1, for example, has 101
twice, 102 three times and 106 twice - whatever that means). Still, it's a
snow day for me here, and Chas's version made me think more about this, so
here's a stab at it. 

This compares words two at a time, and I wonder if you actually wanted to 
compare them against each other all at once. If so, it might be better to
turn your data inside out - make the attributes the keys and create lists
of words that have that attribute. Also, I didn't clean up the "results"
print out, so right now you get each result twice (word 1 and word 5
share...and later word 5 and word 1 share...).


    #!/usr/bin/env perl
    use strict;
    use warnings;

    my %word_attributes;

    while (<DATA>) {
      chomp;
      $_ =~ s/^(.*)://;
      next unless $1;
      my $word = $1;
      $_ =~ s/^\s+//;

      $word_attributes{$word} = [ split /\s+/, $_ ];
    }

    foreach my $key (sort keys %word_attributes) {
      my @outer = @{ $word_attributes{$key} };
      @outer    = remove_dups(@outer);

      foreach my $key2 (sort keys %word_attributes) {
        next if $key2 eq $key;
        my @inner = @{ $word_attributes{$key2} };
        @inner    = remove_dups(@inner);

        my (@intersection, %count);

        foreach my $attribute (@outer, @inner) { $count{$attribute}++ }
        foreach my $attribute (keys %count) {
          if ($count{$attribute} > 1) {
            push @intersection, $attribute;
          }
        }

        print "$key and $key2 share ";
        if (@intersection) {
          @intersection = sort { $a <=> $b } @intersection;  
          print "these attributes: @intersection\n";
        }
        else {
          print "nada\n";
        }
      }
      print "\n";
    }

    sub remove_dups {
      my @attributes = @_;
      my %seen;
      @attributes = grep { !$seen{$_}++ } @attributes;
    }

    __DATA__
    word 1: 100    101     101    102    102     102    106    106
    word 2: 101    104     106    110    113     129    131    148
    word 3: 101    153     175    180    381
    word 4: 106    110     113    122    131     137    142    148
    word 5: 120    165     169

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: word similarity measure

Reply via email to