Re: [lingu-dev] Anyone familiar with the ICU?

Németh László Wed, 10 Jun 2009 09:05:24 -0700

2009/6/10 Thomas Lange - Sun Germany - ham02 - Hamburg <thomas.la...@sun.com>:
>
> Hello László,  :-)


Hello Thomas,

Glad to hear about the word breaking fixes. :) Unfortunatelly, I have
had no time to follow the Issue 64400, yet. I have also found a
relevant bug in Hunspell 1.2.8. I fixed it in the OpenOffice.org in
the last minute before the OOo 3.1 code freeze, but not yet for the
OpenOffice.org distributions with external Hunspell 1.2.8. The bug is
special enough: the words with these dashes at the ends cause seg
fault under thesaurus usage (the improved thesaurus uses Hunspell for
stemming), but I didn't want to force solving this issue before
Hunspell 1.2.9 release.

>> Hi,
>>
>> See extended ALetter definitions of the Hungarian word breaking rules:
>>
>> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/dict_word_hu.txt
>> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/edit_word_hu.txt
>>
>> By the way, it also contains numbers and other special signs, because
>> Hungarian uses their affixed forms. (For example, "with 25%" is
>> "25%-kal" in Hungarian, and not the frequent bad form "25%-al"):
>>
>> $ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name=
>> HEBREW PUNCTUATION GERESH:]
>>                 [:name = PERCENT SIGN:] [:name = PER MILLE SIGN:]
>> [:name = PER TEN THOUSAND SIGN:]
>>                 [:name = SECTION SIGN:] [:name = DEGREE SIGN:] [:name
>> = EURO SIGN:]
>>                 [:name = HYPHEN-MINUS:] [:name = EN DASH:] [:name = EM DASH:]
>>                 [:name = DIGIT ZERO:]
>>                 [:name = DIGIT ONE:]
>>                 [:name = DIGIT TWO:]
>>                 [:name = DIGIT THREE:]
>>                 [:name = DIGIT FOUR:]
>>                 [:name = DIGIT FIVE:]
>>                 [:name = DIGIT SIX:]
>>                 [:name = DIGIT SEVEN:]
>>                 [:name = DIGIT EIGHT:]
>>                 [:name = DIGIT NINE:]
>>                            - $Ideographic
>>                            - $Katakana
>>                            - $Hangul
>>                            - [:Script = Thai:]
>>                            - [:Script = Lao:]
>>                            - [:Script = Hiragana:]];
>>
>
> I tried something similar. I did the following changes:
>
> $ALetter   = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]
> [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:]
>                           - $Ideographic
>                           - $Katakana
>                           - $Hangul
>                           - [:Script = Thai:]
>                           - [:Script = Lao:]
>                           - [:Script = Hiragana:]];
> ...
> $SufixLetter = [:name= FULL STOP:] [:name = HYPHEN-MINUS:] [:name = EN
> DASH:];
>
> Basically it worked, but an unwanted side effect was that multiple
> dashes got accepted at the start or end of the word. That is "---water"
> and "river---" were regarded as one word. Whereas if I use text like
> "...water" and "river...", always only one of the dashes was included
> with the word. Thus I am wondering if it could be done similar for the
> dashes...
> Also, since I'm completely new to the ICU, I don't know if my above try
> has any unwanted side effects.
>
> Do you have any clues for me?

It seems, ICU uses regex-like syntax, so a similar definition may help:

$attheend = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]];

And the modification of the first line of the LetterSequency
definition for the optional dashes:

$LetterSequence = $attheend? $ALetterEx ($FormatEx* $MidLetterEx?
$FormatEx* $ALetterEx $attheend?)*;

Regards,
László

>
> Regards,
> Thomas
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Re: [lingu-dev] Anyone familiar with the ICU?

Reply via email to