Hello László,  :-)
> Hi,
>
> See extended ALetter definitions of the Hungarian word breaking rules:
>
> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/dict_word_hu.txt
> http://svn.services.openoffice.org/ooo/branches/OOO310/i18npool/source/breakiterator/data/edit_word_hu.txt
>
> By the way, it also contains numbers and other special signs, because
> Hungarian uses their affixed forms. (For example, "with 25%" is
> "25%-kal" in Hungarian, and not the frequent bad form "25%-al"):
>
> $ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name=
> HEBREW PUNCTUATION GERESH:]
>                 [:name = PERCENT SIGN:] [:name = PER MILLE SIGN:]
> [:name = PER TEN THOUSAND SIGN:]
>                 [:name = SECTION SIGN:] [:name = DEGREE SIGN:] [:name
> = EURO SIGN:]
>                 [:name = HYPHEN-MINUS:] [:name = EN DASH:] [:name = EM DASH:]
>                 [:name = DIGIT ZERO:]
>                 [:name = DIGIT ONE:]
>                 [:name = DIGIT TWO:]
>                 [:name = DIGIT THREE:]
>                 [:name = DIGIT FOUR:]
>                 [:name = DIGIT FIVE:]
>                 [:name = DIGIT SIX:]
>                 [:name = DIGIT SEVEN:]
>                 [:name = DIGIT EIGHT:]
>                 [:name = DIGIT NINE:]
>                            - $Ideographic
>                            - $Katakana
>                            - $Hangul
>                            - [:Script = Thai:]
>                            - [:Script = Lao:]
>                            - [:Script = Hiragana:]];
>   

I tried something similar. I did the following changes:

$ALetter   = [\u0002 [:name = HYPHEN-MINUS:] [:name = EN DASH:]
[:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:]
                           - $Ideographic
                           - $Katakana
                           - $Hangul
                           - [:Script = Thai:]
                           - [:Script = Lao:]
                           - [:Script = Hiragana:]];
...
$SufixLetter = [:name= FULL STOP:] [:name = HYPHEN-MINUS:] [:name = EN
DASH:];

Basically it worked, but an unwanted side effect was that multiple
dashes got accepted at the start or end of the word. That is "---water"
and "river---" were regarded as one word. Whereas if I use text like
"...water" and "river...", always only one of the dashes was included
with the word. Thus I am wondering if it could be done similar for the
dashes...
Also, since I'm completely new to the ICU, I don't know if my above try
has any unwanted side effects.

Do you have any clues for me?

Regards,
Thomas


Reply via email to