On 11 July 2014 22:29, Daniel Naber <daniel.na...@languagetool.org> wrote:

>
> Then, we will need to add a sentence tokenizer that detects sentence
> boundaries. This is described at
> http://wiki.languagetool.org/customizing-sentence-segmentation-in-srx-rules
> .
> Can you work with that?
>

I've created the following. Not really sure if I'm on the right track. I am
also not sure if "\b" works for unicode characters, as I understand it is
tied to \w, which only recognizes ASCII.

<languagerules>
<languagerule languagerulename="Tamil">
<rule break="no">
<beforebreak>\b(ஜன|பிப்|மார்|ஏப்|ஆக|செப்|அக்|நவ|டிச)\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="no">
<beforebreak>\b(ரூ|ரி\.ம|பக்)\.\s</beforebreak>
<afterbreak>\p{N}</afterbreak>
</rule>
<rule break="no">
<beforebreak>\b(ஐ\.நா|தி\.மு\.க|அ\.இ\.அ\.தி\.மு\.க|அ\.தி\.மு\.க|ம\.தி\.மு\.க|ம\.இ\.கா|இ\.ஆ\.ப|ஐ\.ஏ\.எஸ்|எம்\.பி|எம்\.எல்\.ஏ|எம்\.ஜி\.ஆர்|டி\.எம்\.எஸ்)\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="no">
<beforebreak>\b(கி\.பி|கி\.மு)\.\s</beforebreak>
<afterbreak>\p{N}</afterbreak>
</rule>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck&#174;
Code Sight&#153; - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to