[Senseclusters-users] using SenseClusters for word clustering/synonym finding

ted pedersen Sun, 30 Jan 2005 18:15:06 -0800


We have come up with a small set of changes to discriminate.pl that will
allow SenseClusters to be used to do word clustering. I am sure that we
will integrate this functionality into subsequent releases of
SenseClusters, but for now you can hack this into place yourself, or we'd
be happy to provide you with a modified version of discriminate.pl that
will take care of synonym identification.


Before getting into the details of how this works, a great deal of credit
must be given to Amruta - she figured this out a long time ago and it's
only recently that I've been able to catch up with her. She's been very
helpful lately in explaining all this to me and helping me figure this
out, so I give her a big thank you here.

Synonym finding is supported if you use 2nd order vectors and bigram or
co-occurrence features. This is because we need to create a word by word
co-occurrence matrix to do word clustering, and neither 1st order vectors
nor unigrams support this.

When using bigram features, a bigram matrix is created, where the rows
are the first words in a bigram, and the columns are the second words in
a bigram.  With co-occurrence features the matrix is symmetric and  simply
indicates if the two words occur together without respect to their
order. The cells of the bigram or co-occurrence matrix can have binary
values in them, frequency counts, or measures of association. Once the
bigram or co-occurrence matrix  is created, you can optionally perform
SVD on that matrix to reduce the dimensionality of the columns (the rows
are unaffected). Thereafter the matrix is passed to cluto, where the rows
are clustered. In this case each row represents a word, so this is how the
word clustering is carried out.

Supporting this in SenseClusters turns out to be very simple. We merely
omit order2vec.pl from the sequence of processing a 2nd order feature
space. order2vec.pl finds all the word vectors in the bigram or
co-occurrence matrix that are associated with the words in a context to
be clustered, and then averages them together to create a representation
of that context. However, in word clustering we just want to cluster the
words together rather than finding an averaged vector to cluster contexts,
so we disable order2vec.pl and have the wordvec.pl output serve as the
input to the clustering process (rather than the order2vec.pl output).
Note that wordvec.pl is the program that creates the bigram or
co-occurrence matrix.

In order to allow cluto to cluster the output of wordvec.pl, we must make
sure to provide cluto an --rlabel file that specifies the names of the
words to be clustered. fortunately that --rlabel file is already created
by wordvec.pl and we must simply rename it for cluto.

In effect then, to carry out word clustering, we simply have cluto process
the output of wordvec.pl rather than order2vec.pl.

We must also make a minor change to format_clusters.pl, since it will not
be displaying contexts in clusters. Instead, each word to be clustered is
treated as an instance id, and the options to format_clusters.pl are
changed so that they show the instance ids in each cluster.

So that's how it's done. The required code modifications are minor, and
they are described below.

There are two places in discriminate.pl where changes need to be made.
Below I show those changes as made within discriminate.pl. The changes
start at lines 947 and 1145...

The following starts at line 947 of discriminate.pl ....

        # --------------------------
        # Creating Context Vectors
        # --------------------------

        if(defined $opt_verbose)
        {
                print STDERR "Building 2nd Order Context Vectors ...\n";
        }
        $context_string="--token $token --rlabel $rlabel ";
        if(defined $opt_svd)
        {
                $context_string.="--dense ";
        }
        if(defined $opt_binary)
        {
                $context_string.="--binary ";
        }
## --------------------------------------------------------------------
### changes to support synonym finding/word clusters in senseclusters
### by tdp, jan 30, 2005

### bypass the use of order2vec.pl
### must create vectors and rlabel files

#       $status=system("order2vec.pl --format $format $context_string
$rclass_s$
#       die "Error while running order2vec.pl on $test_context.\n" unless
$stat$

## use the wordvec.pl output as input to cluto rather than order2vec.pl
## wordvec is a word co-occurrence matrix

        $status=system("mv $wordvec $vectors");
        die "Error while creating $clabel.\n" unless $status==0;

## words now serve as row labels/instance ids

        $status=system("mv $features $rlabel");
        die "Error while creating $clabel.\n" unless $status==0;

### end of these modifications, must also modify format clusters
### --------------------------------------------------------------------

  $status=system("mv $dims $clabel");
  die "Error while creating $clabel.\n" unless $status==0;

======================================================================
...Later in discriminate.pl...at line 1145
======================================================================

# formatting clustering solution, show instances in each cluster

$clusters="$prefix.clusters";

## --------------------------------------------------------------------
## modifications to support word clustering...tdp jan 30, 2005

## remove senseval2 formatted test file from format_clusters, we only
## want to display the words in each cluster
## note that each word is considered and dispalyed as an instance id, so
## if you look at the instances that make up a cluster, you will see the
## word cluster

##$status=system("format_clusters.pl $cluster_solution $rlabel --senseval2 $tes$
$status=system("format_clusters.pl $cluster_solution $rlabel > $clusters");

## --------------------------------------------------------------------


--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] using SenseClusters for word clustering/synonym finding

Reply via email to