Marcin Miłkowski <[email protected]> wrote:

> Hello,
>
> W dniu 2012-08-13 12:17, Mike Unwalla pisze:
>> Hello,
>>
>> Examples of characters that cause tokenization: space . ! "  { } [ /
>> Examples of characters that do not cause tokenization:  # $ % ^ _ + = * @ ~
>>
>> I looked on languagetool.wikidot.com and on http://www.languagetool.org/, 
>> but I did not find a list of the characters that cause tokenization. What 
>> characters cause tokenization?
>
> It depends on the language.
>
> This is how it's defined in the code (EnglishWordTokenizer):
>
> final StringTokenizer st = new StringTokenizer(text,
>          "\u0020\u00A0\u115f\u1160\u1680"
>          + "\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007"
>          + "\u2008\u2009\u200A\u200B\u200c\u200d\u200e\u200f"
>          + "\u2028\u2029\u202a\u202b\u202c\u202d\u202e\u202f"
>          + "\u205F\u2060\u2061\u2062\u2063\u206A\u206b\u206c\u206d"
>          + "\u206E\u206F\u3000\u3164\ufeff\uffa0\ufff9\ufffa\ufffb"
>          + "—,.;()[]{}!?:\"'’‘„“”…\\/\t\n", true);
>
> So there are also Unicode characters. Actually, I don't think that _
> should break a word, and neither should @, $ or %. for other symbols,
> I'm not so sure.
>
> Regards,
> Marcin


I would prefer if the star * was considered as a word delimiter.
It's quite frequent to use *stars* to emphasize words in text files.
Some mark up languages use it such as reStructuredText or
MarkDown.

Right now, checking above paragraph with LT says that
"*stars*" is a spelling mistake (using language en-US
).

I would also split words with at least the backticks ` and pipe |.
I don't really disadvantages in not splitting at those characters.

Dominique

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to