[Senseclusters-users] tokenization of languages other than English

ted pedersen Sat, 06 Aug 2005 06:18:44 -0700

There is sometimes a question of whether or not SenseClusters can
process languages other than English. In fact, it is very easy to
modify the types of text SenseClusters can handle - you simply need
to change the tokenization file.


The default tokenization file uses \w+ as the main identifier of
tokens - this corresponds to the ranges of characters 0-9, A-Z,
a-z and _. Now, it does not include accented characters, etc. So,
if you are working with a language that uses, for example, extended
ASCII characters, you can modify the token file to look more like
this:

/<head[^<]*>\s*[\w+\x80-\xff]+\s*<\/head>/
/[\w+\x80-\xff]+/

This will recognize head tags and strings that include both
the standard \w class of characters, as well as those in the range
of hex 80 to hex ff in the extended ascii table. You can see what those
characters consist of by looking at a table like this:

http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

Now, it's probably the case that we should make the extended ascii
token file the default for SenseClusters, so that will be coming in
a future release. But, til then it's very easy to adjust to process
your favorite extended ascii text.

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] tokenization of languages other than English

Reply via email to