Hello!

Sorry for bringing this old subject, but I have been facing some dificulties improving the en_GB dictionary (at the moment for Mozilla).

I have posted in the Mozilla ML:
Hello!

It seems the en_GB .aff file has some bugs.

I have been testing it in the field.

for example, in the suffix "ing":
"modeling" suggests "modelling" (with two "l"s)

The "G" code is a mess in the original .aff file.

As you can see, it uses nonsense rules like: [^aeio][aeiou]b

Could someone suggest a fix?:
SFX G Y 24
SFX G e ing [^eioy]e
SFX G 0 ing [eoy]e
SFX G ie ying ie
SFX G 0 bing [^aeio][aeiou]b
SFX G 0 king [^aeio][aeiou]c
SFX G 0 ding [^aeio][aeiou]d
SFX G 0 fing [^aeio][aeiou]f
SFX G 0 ging [^aeio][aeiou]g
SFX G 0 king [^aeio][aeiou]k
SFX G 0 ling [^aeio][eiou]l
SFX G 0 ing [aeio][eiou]l
SFX G 0 ling [^aeo]al
SFX G 0 ing [aeo]al
SFX G 0 ming [^aeio][aeiou]m
SFX G 0 ning [^aeio][aeiou]n
SFX G 0 ping [^aeio][aeiou]p
SFX G 0 ring [^aeio][aeiou]r
SFX G 0 sing [^aeio][aeiou]s
SFX G 0 ting [^aeio][aeiou]t
SFX G 0 ving [^aeio][aeiou]v
SFX G 0 zing [^aeio][aeiou]z
SFX G 0 ing [aeio][aeiou][bcdfgkmnprstvz]
SFX G 0 ing [^aeiou][bcdfgklmnprstvz]
SFX G 0 ing [^ebcdfgklmnprstvz]

Someone replied to this rule:
[^aeio][aeiou]b
I'm not really familiar with the details of the .aff file, but I assume this is a regular-_expression_-like sequence, meaning that the letter before the "b" must be one of "aeiou", and the letter before THAT must be something OTHER than "aeio" (because this is a negated set, [^aeio]).

So the two character sets in this rule correspond to the TWO letters preceding the "b"; they're not conflicting requirements for the SAME letter.


Can Ricardo or someone help?

Thanks!

Kind regards,
        >Marco A.G.Pinto
          -----------------------


On 13/08/2013 11:11, Ricardo Palomares Martínez wrote:
El 13/08/13 10:30, Marco A.G.Pinto escribió:> Hello!
Thanks for your suggestion Andrea,

Please, don't duplicate threads. People has a life and may need more
than an hour to reply you. If in doubt whether your message has
reached the list, either configure to subscription to get your own
posts (and then don't use Gmail, as it eats them), or check the list
archive:

http://mail-archives.apache.org/mod_mbox/openoffice-l10n/201308.mbox/browser


but I haven't been able to find in the archive any file with the
80 examples.

It is right there, inside the /tests folder of the .tar.gz Andrea
pointed you to. Each test seems to be made of several files which
share the same name but end with different extensions (it is the first
time I see them, too):

- aff: a sample affixes file for the test
- dic: a sample dic file for the test
- sug: a file with the list of expected suggestions for Hunspell to offer
- test: a Unix shell file to launch the test
- wrong: a list of badly written words so Hunspell, using the test
dictionary, offers you suggestions (that should match the list in the
.sug file). Depending on the test, this file could not exist
- good: a list of correctly written words so Hunspell, using the test
dictionary and related affixes file, shouldn't complain for incorrect
words.

Keep in mind that the tests are intended to test Hunspell itself, so
if anyone makes changes to the Hunspell source code and wants to be
sure his/her changes don't break Hunspell, he/she can run all tests
and check that the modified version still passes all the tests.

Put it in another way: if you're creating a tool able to parse dic and
aff files and derivate all the words, it should pass all these tests.


So, I have edited a word from the en_GB dictionary of Thunderbird
and I am posting here the PFX/SFX codes in the hope that someone can
explain how they work.

Please see the image here which has basic coding and shows wrong
results: http://i.imgur.com/JEIxOmv.png

Here is the .AFF code for the word: *abdicate/DNGSn*
Please notice that this word only has suffixes. Please give an
example with prefixes too, if it doesn't give much trouble.
(...)

I've learned by example while editing the dictionaries, so I may be
wrong. Also, there are some keywords in the aff test files that I
don't recognize. I've taken the shortest suffix rule you've posted, so
you can get the idea:

*S:*
*SFX S Y 9*

SFX -> It is a suffix (PFX would mean a preffix)
S   -> It is the suffix identifier
Y   -> Y for Yes. It means that the rule can be cross-used with other
prefixes and suffixes. If N, you can't apply this rule together with
other affixes the word might have.
9   -> The number of lines related to this rule


SFX S y ies [^aeiou]y

SFX -> It is a suffix (PFX would mean a preffix)
S   -> It is the suffix identifier
y   -> For a suffix, it is the letter(s) that must be removed from the
end of the word. If it were a prefix, the letter(s) would be removed
from the beginning of the word.
ies -> For a suffix, it is the letter(s) that must be added to the end
of the stripped word (once you've applied the previous field stripping).
[^aeiou]y -> Condition in regexp notation. In this case, this rule
would only be applied to words ending in "y" with the next to last
letter is not a, e, i, o or u.

In your sample with "abdicate", the last letter is not an "y", so this
rule wouldn't apply.

SFX S 0 s [aeiou]y
SFX -> It is a suffix (PFX would mean a preffix)
S   -> It is the suffix identifier
0   -> 0 means that no character should be stripped.
s   -> In this case, the letter "s" would be added to the word.
[aeiou]y -> Condition in regexp notation. In this case, this rule
would only be applied to words ending in "y" with the next to last
letter IS a, e, i, o or u.


SFX S 0 es [sxz]
This rule applies when the word ends in s, x or z, and would add "es"
to the main word, without removing any letter.


SFX S 0 es [cs]h
SFX S 0 s [^cs]h

Rules for words with this suffix ending in h.


SFX S 0 s [ae]u
SFX S 0 x [ae]u
Words with this suffix ending in u with the next to last letter not
being e or u can create derivative words adding an s or an x.


SFX S 0 s [^ae]u
Words with this suffix ending in u with the next to last letter being
e or u can create derivative words adding an s only.


SFX S 0 s [^hsuxyz]

If you've reached this far, you should have deduced this rule is the
one to be applied to the word "abdicate" (since the word does not end
in h, s, u, x, y or z). The rule would not strip any letter and add
"s" to create "abdicates".

Preffixes work the same, but at the beginning of the words.

I guess you know by now that thesaurus and hyphenation each use
separate files than aff/dic files. I don't know anything about
hyphenation, and I also learned about thesaurus by (broken) examples.

Good luck with your software. To be honest, I was thinking of writting
a Java program for the aff/dic part, but I'm lazy and I thought the
first thing to do would be to port Hunspell to Java, which is a huge
task for me (I'd rather not use JNDI).



--

Reply via email to