Marcin Miłkowski <[email protected]> wrote: > Hello, > > W dniu 2012-08-13 12:17, Mike Unwalla pisze: >> Hello, >> >> Examples of characters that cause tokenization: space . ! " { } [ / >> Examples of characters that do not cause tokenization: # $ % ^ _ + = * @ ~ >> >> I looked on languagetool.wikidot.com and on http://www.languagetool.org/, >> but I did not find a list of the characters that cause tokenization. What >> characters cause tokenization? > > It depends on the language. > > This is how it's defined in the code (EnglishWordTokenizer): > > final StringTokenizer st = new StringTokenizer(text, > "\u0020\u00A0\u115f\u1160\u1680" > + "\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007" > + "\u2008\u2009\u200A\u200B\u200c\u200d\u200e\u200f" > + "\u2028\u2029\u202a\u202b\u202c\u202d\u202e\u202f" > + "\u205F\u2060\u2061\u2062\u2063\u206A\u206b\u206c\u206d" > + "\u206E\u206F\u3000\u3164\ufeff\uffa0\ufff9\ufffa\ufffb" > + "—,.;()[]{}!?:\"'’‘„“”…\\/\t\n", true); > > So there are also Unicode characters. Actually, I don't think that _ > should break a word, and neither should @, $ or %. for other symbols, > I'm not so sure. > > Regards, > Marcin
I would prefer if the star * was considered as a word delimiter. It's quite frequent to use *stars* to emphasize words in text files. Some mark up languages use it such as reStructuredText or MarkDown. Right now, checking above paragraph with LT says that "*stars*" is a spelling mistake (using language en-US ). I would also split words with at least the backticks ` and pipe |. I don't really disadvantages in not splitting at those characters. Dominique ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
