Hello,

W dniu 2012-08-13 12:17, Mike Unwalla pisze:
> Hello,
>
> Examples of characters that cause tokenization: space . ! "  { } [ /
> Examples of characters that do not cause tokenization:  # $ % ^ _ + = * @ ~
>
> I looked on languagetool.wikidot.com and on http://www.languagetool.org/, but 
> I did not find a list of the characters that cause tokenization. What 
> characters cause tokenization?

It depends on the language.

This is how it's defined in the code (EnglishWordTokenizer):

final StringTokenizer st = new StringTokenizer(text,
         "\u0020\u00A0\u115f\u1160\u1680"
         + "\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007"
         + "\u2008\u2009\u200A\u200B\u200c\u200d\u200e\u200f"
         + "\u2028\u2029\u202a\u202b\u202c\u202d\u202e\u202f"
         + "\u205F\u2060\u2061\u2062\u2063\u206A\u206b\u206c\u206d"
         + "\u206E\u206F\u3000\u3164\ufeff\uffa0\ufff9\ufffa\ufffb"
         + "—,.;()[]{}!?:\"'’‘„“”…\\/\t\n", true);

So there are also Unicode characters. Actually, I don't think that _ 
should break a word, and neither should @, $ or %. for other symbols, 
I'm not so sure.

Regards,
Marcin

>
> Regards,
>
> Mike Unwalla
> Contact: www.techscribe.co.uk/techw/contact.htm
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to