Re: [Languagetool] What characters cause tokenization?

R.J. Baars Wed, 15 Aug 2012 03:40:12 -0700

Maybe you would default need spacebefore="no" for checking the
non-token-tokens are in fact concatenated.


Ruud

> Hi Marcin,
>
> Thanks for your comments.
>
>         I do not see a reason to include spaces at all. You simply write:
>         <token>Yahoo</token>
>         <token>!</token>
>         <token>Babel</token>
>         <token>Fish</token>
>
>         what's the problem with this?
>
> Yes. That is what I do.
>
> Many of the terms that I am interested in contain non-alphabetic
> characters. Creating the rules was tiresome. Now that I have a list of
> delimiters, rule creation will be much simpler for me.
>
> I was not thinking clearly.  Thanks for helping me to clarify my thinking!
>
> Regards,
>
> Mike
>
>
> -----Original Message-----
> From: Marcin MiÅkowski [mailto:[email protected]]
> Sent: 14 August 2012 13:33
> To: [email protected]
> Subject: Re: [Languagetool] What characters cause tokenization?
>
> Hi Mike,
>
> W dniu 2012-08-14 10:30, Mike Unwalla pisze:
>> For my projects, the only delimiter that I want is space. All other
>> delimiters can cause problems. A term can contain any character,
>> including space. (I know that at least one character must be a
>> delimiter.)
>>
> <snip>
>> Wish list: give users the option to specify which characters are
>> delimiters.
>
> As Dominique already mentioned, this is not a good solution. But you're
> trying to solve a different problem with it. Namely, all you need to
> have is a way to process multiword expressions into strings of elements
> for our rules. Your dictionaries, if they are static (and I believe they
> don't change during the checking of a single document) can be converted
> into a list of individual elements to be matched by a rule in LT. This
> can be achieved in numerous ways, the easiest of which is to:
>
> (1) tokenize the terms using LT;
> (2) generate the rules for LT based on the tokenized elements.
>
> You could also have a special Java rule that reads the dictionary and
> builds a simple text-matching rule based on it.
>
> In general, I don't think that we should split anything like
> %filechooser%, as this is a variable in a text, not a word, and all we
> could do is to immunize it from spell-checking. But it should not be
> split (what for? filechooser is not an English word anyway, the same
> with paths such as 'project_directory' - they do not need to be spelled
> correctly).
>
> I do not see a reason to include spaces at all. You simply write:
>
> <token>Yahoo</token>
> <token>!</token>
> <token>Babel</token>
> <token>Fish</token>
>
> what's the problem with this?
>
> Regards,
> Marcin
>
> <snip>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] What characters cause tokenization?

Reply via email to