[Senseclusters-users] tokenization fun

ted pedersen Fri, 05 Aug 2005 05:32:21 -0700

One thing to keep in mind about SenseClusters is that you can achieve
tremendous variation in your experiments simply by changing around
the tokenization a little.


Now, this is our standard tokenization file, and you get this
automatically:

/<head[^<]*>\s*\w+\s*<\/head>/
/<sat.*>\s*\w+\s*<\/sat>/
/\w+'\w+/
/\w+/

These are Perl regular expressions, and are basically saying that
strings of the form

<head>word</head>
<sat>word</sat>
word
word's

will be identified as tokens. This corresponds to our standard view
of words as being string separated tokens. So if you have a context
like

His boat's are by the <head>water</head>

Then you get the following tokens.

His
boat's
are
by
the
<head>water</head>

This is all well and good...but, what about the following tokenization
scheme?

/<head[^<]*>\s*\w+\s*<\/head>/
/\w\w\w/

What is this doing? Well, it considers tokens to be <head>words</head>
tags and also 3 character sequences! We call this a poor man's stemmer. :)

So in the above example, we'd end up with the following tokens :

His
boa
are
the
<head>water</head>

now, what is interesting about this is that you would then go on
to identify features based on these three word sequences (for example,
if you used bigrams they would be two three word sequences that
occurred in order).

Here I did an experiment with the Mexico-Brazil data where I used
4 character sequences as tokens, and then made up bigrams of these
four character tokens.

http://marimba.d.umn.edu/SC-htdocs/gram4-mexico-brazil1123242792/

What is fairly interesting about this is that I got a rather nice
result of F-measure 71% in using these features.

http://marimba.d.umn.edu/SC-htdocs/gram4-mexico-brazil1123242792/gram4-mexico-brazil.report

If I do "normal" tokenization, then the result drops to 61%!

http://marimba.d.umn.edu/SC-htdocs/user1123243053/

To be clear, the only difference between the experiment above and this
one is how the tokenization was performed!

http://marimba.d.umn.edu/SC-htdocs/user1123243053/user.report

Now, you can also go in the other direction, and have tokens be
represented by more than one word. For example, suppose you use the
following tokenization scheme:

/<head[^<]*>\s*\w+\s*<\/head>/
/\w+\s+\w+/

So here your tokens are two word sequences, so if you find bigrams, for
example, they are made up of two two word sequences that occur in a
particular order.

Now, it seems like for this kind of feature you need a lot more data,
so I ran on a larger set of the Mexico Brazil data, and I used a lower
frequency cutoff.

The results aren't good at all (52%) but the idea here is to simply
give an example of how different the feature space looks.

Here is the directory of output files:
http://marimba.d.umn.edu/SC-htdocs/user1123243701/

and here you can see the "bigrams of bigrams" features...
http://marimba.d.umn.edu/SC-htdocs/user1123243701/user.bigrams

Now we have a 2nd order representation where we have a bigram by bigram
matrix, and the context is represented by an averaged set bigram vectors
that represent which bigrams each bigram in the context co-occurs with.
Is this is a good idea? Well, that I don't know. But it's certainly
a different representation than you might expect, and in certain sorts
of text might be a nice choice.

In any case, I find this to be very interesting, and strongly encourage
you to contemplate the user of alternative tokenzation schemes. I'm
especially encouraged by the results with the 4 character tokens,
which are quite good really.

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] tokenization fun

Reply via email to