Hi Paula,

> Dear Ted,
>
> I have been using SenseClusters and there are some things that I cannot
> understand.
> So here are my questions,
>
> - how can I construct a training file? What type of sentences should I
> choose to put in the training file for SenseClusters?
>

The training file should be plain text, and it can be whatever text
you think is representative of the data you wish to process. A
training file is not required however, you do have the option of
getting your features directly from the test data you are clustering.

In general, if your test data is fairly large (a few hundred contexts
perhaps) then you may not need a training file. If however your test
data is small (10 or 20 contexts, for example) then you may not have
enough data there to acquire meaningful features.

Here's an example from one of my domains. Sometimes we cluster person
names that we find in web contexts. Suppose we have 10 web contexts
that contain the name George Miller. There is probably not enough text
in 10 web contexts to acquire very many features, so in this case we
might use a few million lines of New York Times text as training data,
thinking that we'll get generally useful features from newspaper text,
and plenty of it. Then, when it comes to representing those 10
contexts, we'll have pretty rich features available.

For example, suppose one of the 10 contexts in the test data is something like :

Princeton professor George Miller is a brilliant man.

In those 10 contexts we might not have any other mentions of
Princeton. However, in the New York Times Princeton is probably
mentioned many times, and we can build a rich co-occurrence vector for
Princeton based on that information, that is then used to replace the
word Princeton in the context above. If we are able to do that with
all of the content words in the context above and the other 9
contexts, maybe we can then cluster those 10 contexts more
successfully.

So, the choice of whether or not you use training data depends very
much on how much test data you have, and then if you have a good or
reasonable choice for training data.

Please note that the training data is never clustered, it is just used
for feature selection. The test portion of the data is all that is
every clustered.

> - I have been using Traget Word Clustering with head marked files. I noticed
> that the clusters work better if I use lemmatized sentences instead of the
> original ones (I think this is because of the variety of morphological forms
> in the Portuguese language). Can I trust the results of SenseClustesr if I
> use the lemmas?

Yes, using lemmatized forms should be no problem. The only point to
make here would be that if you use a separate set of training data
then that should probably be lemmatized too.

> - how can I tell SenseClusters to make as many clusters as it needs to,
> instead of specifying an initial number of clusters?

You have the option in the Web interface to select a cluster stopping
measure, otherwise it will default to 2 clusters. You will see this on
the second screen of the web interface, and you must select the option
to "use cluster stopping measures"

You might want to start with PK2, that seems to generally work well.
PK2 just compares the criterion function values for successively
increasing numbers of clusters and stops the clustering when there is
no improvement in the clustering criterion function values.

If you are  using the command line script discriminate.pl, you can
specify the clutstop option. Here's a the online help for that, just
to give you an idea of what it looks like and a very quick summary of
the cluster stopping measures:

Cluster-Stopping Options:

--cluststop CS
        Specify the cluster stopping measure to be used to predict the number
        the number of clusters.

        The possible option values:

        pk1 - Use PK1 measure
        [PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM]))]
        pk2 - Use PK2 measure
        [PK2[m] = (crfun[m]/crfun[m-1])]
        pk3 - Use PK3 measure
        [PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1]))]
        gap - Use Adapted Gap Statistic.
        pk  - Use all the PK measures.
        all - Use all the four cluster stopping measures.

        More about these measures can be found in the documentation of
        Toolkit/clusterstop/clusterstopping.pl

> Thanks in advance for your patience. If there is something that contains
> this information and that I can read, please tell me.

You might just want to look at the following for more info about
cluster stopping:

http://senseclusters.sourceforge.net/Toolkit_Docs/clusterstop/clusterstopping.html

If you are trying to use the command line interface (via
discriminate.pl) you might want to check out some of the examples in
the Demos directory to get an idea of how those are set up.

Do let you know if you have any questions about any of this!

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to