On Fri Feb 27 2009 @ 8:24, Susan wrote: > If my data looks like this: > > word 1: 100 101 101 102 102 102 106 106 > word 2: 101 104 106 110 113 129 131 148 > word 3: 101 153 175 180 381 > word 4: 106 110 113 122 131 137 142 148 > word 5: 120 165 169 > > where word 1,2,3,4,5 represent different words, numbers represent > different attributes of words. > > How can I calculate similarity between words?
Like John and Chas, I'm not entirely sure what you mean by similarity. My guess is that you want to see what attributes the various words share. So, for example, word 1 and word 2 both have attributes 101 and 106. However, I'm not sure why a word can have the same attribute twice or even more. That is, I don't know if that means anything (word 1, for example, has 101 twice, 102 three times and 106 twice - whatever that means). Still, it's a snow day for me here, and Chas's version made me think more about this, so here's a stab at it. This compares words two at a time, and I wonder if you actually wanted to compare them against each other all at once. If so, it might be better to turn your data inside out - make the attributes the keys and create lists of words that have that attribute. Also, I didn't clean up the "results" print out, so right now you get each result twice (word 1 and word 5 share...and later word 5 and word 1 share...). #!/usr/bin/env perl use strict; use warnings; my %word_attributes; while (<DATA>) { chomp; $_ =~ s/^(.*)://; next unless $1; my $word = $1; $_ =~ s/^\s+//; $word_attributes{$word} = [ split /\s+/, $_ ]; } foreach my $key (sort keys %word_attributes) { my @outer = @{ $word_attributes{$key} }; @outer = remove_dups(@outer); foreach my $key2 (sort keys %word_attributes) { next if $key2 eq $key; my @inner = @{ $word_attributes{$key2} }; @inner = remove_dups(@inner); my (@intersection, %count); foreach my $attribute (@outer, @inner) { $count{$attribute}++ } foreach my $attribute (keys %count) { if ($count{$attribute} > 1) { push @intersection, $attribute; } } print "$key and $key2 share "; if (@intersection) { @intersection = sort { $a <=> $b } @intersection; print "these attributes: @intersection\n"; } else { print "nada\n"; } } print "\n"; } sub remove_dups { my @attributes = @_; my %seen; @attributes = grep { !$seen{$_}++ } @attributes; } __DATA__ word 1: 100 101 101 102 102 102 106 106 word 2: 101 104 106 110 113 129 131 148 word 3: 101 153 175 180 381 word 4: 106 110 113 122 131 137 142 148 word 5: 120 165 169 -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/