El dj 26 de 08 de 2010 a les 03:35 +0100, en/na Jimmy O'Regan va escriure: > On 25 August 2010 21:00, Francis Tyers <[email protected]> wrote: > > Hey all, > > > > Tihomir and myself are proud to announce the first release of > > apertium-mk-bg (Macedonian and Bulgarian). This is the first Slavic > > language pair in Apertium -- and hopefully the first of many! > > > > Well, Slavic Lite - all the Slavic flavour, low in cholesterol and inflection.
:D > > You can try it online now. > > > > Here are some stats: > > > > ==Linguistic data== > > > > * Macedonian morphology: 4,010 > > * Macedonian morphology: 4,364 > > * Bilingual dictionary: 4,083 > > > > 4,083? The poor thing needs a good meal :) Haha, well, actually the coverage is pretty good. Tihomir worked from a frequency list from SETimes, so maybe there are only ~4,000 entries, but they are high frequency in the news domain. > > * Disambiguation rules Macedonian->Bulgarian: 9 > > > > * Transfer Macedonian->Bulgarian: 19 > > * Transfer Bulgarian->Macedonian: 18 > > > > ==Coverage== > > > > * Bulgarian Wikipedia: Total: 9834480, Known: 7391855 (75.16%) > > > > (other numbers pending, but on SETimes should be around 80%) > > > > I'd be interested in how the numbers look on the Bulgarian portion of > the JRC Acquis. I'll calculate this and send the result to the list. > > ==Accuracy== > > > > We have so far only tested the accuracy from Macedonian to Bulgarian. > > Approximately 1,000 words were taken from the SETimes corpus, translated > > using the system and then posteditted. Unknown words were allowed. > > > > ------------------------------------------------------- > > Edit distance: 292 > > Word error rate (WER): 26.67 % > > Number of position-independent word errors: 278 > > Position-independent word error rate (PER): 25.39 % > > ------------------------------------------------------- > > > > Seems a little on the high side, considering how closely related the > languages are, but I'd imagine unknowns contribute greatly to that. Do > you have numbers for sentences without unknowns? Well, it is comparable with (e.g. sits somewhere in between) Swedish--Danish and Norwegian Nynorsk--Bokmål, so it seems about right to me. There are still a lot of transfer rules that can be made though. E.g. at the moment we don't deal with the difference in the formation of the future tense. Also there is some more stuff we could do with clitic positioning. Fran ------------------------------------------------------------------------------ Sell apps to millions through the Intel(R) Atom(Tm) Developer Program Be part of this innovative community and reach millions of netbook users worldwide. Take advantage of special opportunities to increase revenue and speed time-to-market. Join now, and jumpstart your future. http://p.sf.net/sfu/intel-atom-d2d _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
