On Fri Feb 27 2009 @ 8:24, Susan wrote:
> If my data looks like this:
>
> word 1: 100 101 101 102 102 102 106 106
> word 2: 101 104 106 110 113 129 131 148
> word 3: 101 153 175 180 381
> word 4: 106 110 113 122 131 137 142 148
> word 5: 120 165 169
>
> where word 1,2,3,4,5 represent different words, numbers represent
> different attributes of words.
>
> How can I calculate similarity between words?
Like John and Chas, I'm not entirely sure what you mean by similarity. My
guess is that you want to see what attributes the various words share. So,
for example, word 1 and word 2 both have attributes 101 and 106. However,
I'm not sure why a word can have the same attribute twice or even more.
That is, I don't know if that means anything (word 1, for example, has 101
twice, 102 three times and 106 twice - whatever that means). Still, it's a
snow day for me here, and Chas's version made me think more about this, so
here's a stab at it.
This compares words two at a time, and I wonder if you actually wanted to
compare them against each other all at once. If so, it might be better to
turn your data inside out - make the attributes the keys and create lists
of words that have that attribute. Also, I didn't clean up the "results"
print out, so right now you get each result twice (word 1 and word 5
share...and later word 5 and word 1 share...).
#!/usr/bin/env perl
use strict;
use warnings;
my %word_attributes;
while (<DATA>) {
chomp;
$_ =~ s/^(.*)://;
next unless $1;
my $word = $1;
$_ =~ s/^\s+//;
$word_attributes{$word} = [ split /\s+/, $_ ];
}
foreach my $key (sort keys %word_attributes) {
my @outer = @{ $word_attributes{$key} };
@outer = remove_dups(@outer);
foreach my $key2 (sort keys %word_attributes) {
next if $key2 eq $key;
my @inner = @{ $word_attributes{$key2} };
@inner = remove_dups(@inner);
my (@intersection, %count);
foreach my $attribute (@outer, @inner) { $count{$attribute}++ }
foreach my $attribute (keys %count) {
if ($count{$attribute} > 1) {
push @intersection, $attribute;
}
}
print "$key and $key2 share ";
if (@intersection) {
@intersection = sort { $a <=> $b } @intersection;
print "these attributes: @intersection\n";
}
else {
print "nada\n";
}
}
print "\n";
}
sub remove_dups {
my @attributes = @_;
my %seen;
@attributes = grep { !$seen{$_}++ } @attributes;
}
__DATA__
word 1: 100 101 101 102 102 102 106 106
word 2: 101 104 106 110 113 129 131 148
word 3: 101 153 175 180 381
word 4: 106 110 113 122 131 137 142 148
word 5: 120 165 169
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/