[Senseclusters-users] question on tokenization/need for preprocess.pl

ted pedersen Wed, 10 Aug 2005 08:30:58 -0700

Hi Anagha,

Thanks for your THRC.xml file and the token.regex file. The problem
you reported had to do with features as discovered by SenseClusters
not being found in the text you were seeking to cluster, largely
due to differences in tokenization. In your features you might have
a bigram like


Beautiful Mind

but in your test data that might have adjacent punctuation such as

"Beautiful Mind"

(with the quotes) and as a result the feature would not be matched.

So, before I get into a very long explanation, here is the solution
I think. You need to run preprocess.pl on your input text, to get it
tokenized properly, prior to using discriminate.pl or the web
interface.

The feature identification programs that come from NSP (count and
statistic) do tokenization with your given -token file. order1vec and
order2vec (which do feature matching) do not. order1vec takes the input
file as given and does the feature matching. So, in effect you have
NSP providing you with features that have been identified using your
token.regex tokenization scheme, while order1vec is matching features
in text that has been tokenized differently (or not at all)!

So, if you run

preprocess.pl --token token.regex THRC.xml

you will get two files named

T_R.xml
T_R.count

They will be tokenized identically, the .xml file will be in
sval2 format, and the .count file will be in plain text format. You
could run nsp on the .count file to get features, and then match
those features in T_R.xml, and I think all will be well.

Or, if you are using discriminate.pl or the web interface, you should
use T_R.xml as your input and essentially discard T_R.count. A
count-style file will again be created for you, simply by stripping away
the xml tags.

I went back into discriminate.pl, and verified that preprocess.pl
is not called. I also verified that order1vec.pl does not accept
a --token file as an option, so therefore must not be doing tokenization.
So, I am nearly certain of my explanation and solution, but you
should double check of course.

As a matter of procedure then, I would suggest the following - when using
discriminate.pl or the web interface, think of the --token file as
representing how your test and training data have **already been**
tokenized, and if they have have not been tokenized in such a fashion,
then you should use preprocess.pl prior to processing to get your
data into the right format.

In general I would strongly recommend using preprocess.pl on any data
that is going to be processed by SenseClusters, making sure to provide
whatever tokenization file you may use with preprocess.pl to
SenseClusters as well.

There is of course a question here - why doesn't discriminate.pl or the
web interface call preprocess.pl. The reason for that is that there
are so many optionswith preprocess.pl that it just didn't seem feasible,
and I think we wanted users to take responsibility for their tokenization
choices, as they have a profound impact on later processing. We should
probably make this clearer in the documentation, hence this rather lengthy
note. :)

I hope this all makes sense, and that it actually fixes your problem.
Let me know in either case!

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] question on tokenization/need for preprocess.pl

Reply via email to