Re: [Languagetool] What characters cause tokenization?

Marcin Miłkowski Tue, 14 Aug 2012 05:33:45 -0700

Hi Mike,

W dniu 2012-08-14 10:30, Mike Unwalla pisze:
> For my projects, the only delimiter that I want is space. All other 
> delimiters can cause problems. A term can contain any character, including 
> space. (I know that at least one character must be a delimiter.)
>
> The following one-word terms are in my dictionaries. All the terms contain 
> full-stops or other symbols:
> .NET
> ASP.NET
> add-on
> Model-View-Controller
> project_directory/Client_routines.pm
> Bottom_hole_pressure
> %filechooser%
> Diagnostics>Qualities
> InfoPlus+
> Yahoo!
>
> Here are some multi-word terms that contain non-alphabetic characters:
> Yahoo! Babel Fish
> ISO/IEC 26514:2008
> <meta> tag
>
> Email addresses and website addresses contain these characters (and others):
> . @ / #
>
> Product names, company names, variables, values... they can contain almost 
> any characters.
>
> Wish list: give users the option to specify which characters are delimiters.


As Dominique already mentioned, this is not a good solution. But you're 
trying to solve a different problem with it. Namely, all you need to 
have is a way to process multiword expressions into strings of elements 
for our rules. Your dictionaries, if they are static (and I believe they 
don't change during the checking of a single document) can be converted 
into a list of individual elements to be matched by a rule in LT. This 
can be achieved in numerous ways, the easiest of which is to:

(1) tokenize the terms using LT;
(2) generate the rules for LT based on the tokenized elements.

You could also have a special Java rule that reads the dictionary and 
builds a simple text-matching rule based on it.

In general, I don't think that we should split anything like 
%filechooser%, as this is a variable in a text, not a word, and all we 
could do is to immunize it from spell-checking. But it should not be 
split (what for? filechooser is not an English word anyway, the same 
with paths such as 'project_directory' - they do not need to be spelled 
correctly).

I do not see a reason to include spaces at all. You simply write:

<token>Yahoo</token>
<token>!</token>
<token>Babel</token>
<token>Fish</token>

what's the problem with this?

Regards,
Marcin

>
> Regards,
>
> Mike Unwalla
> Contact: www.techscribe.co.uk/techw/contact.htm
>
> -----Original Message-----
> From: Dominique Pellé [mailto:[email protected]]
> <snip>
>
> I would prefer if the star * was considered as a word delimiter.
> It's quite frequent to use *stars* to emphasize words in text files.
> Some mark up languages use it such as reStructuredText or
> MarkDown.
>
> Right now, checking above paragraph with LT says that
> "*stars*" is a spelling mistake (using language en-US
> ).
>
> I would also split words with at least the backticks ` and pipe |.
> I don't really disadvantages in not splitting at those characters.
>
> <snip>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] What characters cause tokenization?

Reply via email to