Ok, quite a lot of more (useful) information regarding this.
First of all, I created a branch that prints a lot of debug information
(probabilities, etc), that is only be useful for this specific
investigation. It'd be worth, though, do it properly and keep some of that
information for the existing debug mode.
https://github.com/apertium/apertium/tree/logging-hmm
Now, the data.
In a machine with Debian stretch, that works properly:
echo '^cumbre/cumbre<n><f><sg>$ ^en/en<pr>$
^Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$' |
apertium/apertium-tagger -gdmf /src/apertium-spa-cat/spa-cat.prob
WORD = ({NOMF} Word: cumbre) TAGSET: 22,4 - Prob: 0.0323806
^cumbre/cumbre<n><f><sg>$END: Word: {NOMF} Word: cumbre
WORD = ({PREP} Word: en) TAGSET: 43,22 - Prob: 0.23551
^en/en<pr>$END: Word: {PREP} Word: en
WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 36,43 - Prob: 0.000393184
WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 37,43 - Prob: 0.000966503
END: Word: {ANTROPONIM,TOPONIM} Word: Madrid
WORD = ({TAG_SENT} Word: .) TAGSET: 4,36 - Prob: 6.08016e-05
WORD = ({TAG_SENT} Word: .) TAGSET: 4,37 - Prob: 0.000192669
^=Madrid/Madrid<np><loc>/Madrid<np><ant>$^./.<sent>$END: Word:
{TAG_SENT} Word:
.
WORD = ({TAG_kEOF} Word: ) TAGSET: 5,4 - Prob: 0.00133422
In this case, tag 43 is PREP, tag 36 is ANTROPONIM (np.ant), tag 37 is
np.loc (TOPONIM). and tag 4 is SENT. We can see in the first yellow line
that the probability of prep + np.loc is 3x the probability of prep +
np.ant. Similarly, np.loc + sent is quite higher than np.ant + sent.
Overall, this makes apertium-tagger choice an easy one: np.loc over np.ant
Now, same results in a machine running Ubuntu 18.04 (bionic). Just to make
sure, both machines are running latest lttoolbox (from nighlty package),
with latest apertium-tagger (from code), with same probability file.
$ echo '^cumbre/cumbre<n><f><sg>$ ^en/en<pr>$
^Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$' |
apertium/apertium-tagger -gdmf ~/src/apertium/apertium-spa-cat/spa-cat.prob
WORD = ({NOMF} Word: cumbre) TAGSET: 22,4 - Prob: 4.53636e-12
^cumbre/cumbre<n><f><sg>$END: Word: {NOMF} Word: cumbre
WORD = ({PREP} Word: en) TAGSET: 43,22 - Prob: 2.56179e-11
^en/en<pr>$END: Word: {PREP} Word: en
WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 36,43 - Prob: 0.000561191
WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 37,43 - Prob: 5.13191e-12
END: Word: {ANTROPONIM,TOPONIM} Word: Madrid
WORD = ({TAG_SENT} Word: .) TAGSET: 4,36 - Prob: 8.67821e-15
WORD = ({TAG_SENT} Word: .) TAGSET: 4,37 - Prob: 1.02303e-22
^=Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$END: Word:
{TAG_SENT} Word:
.
WORD = ({TAG_kEOF} Word: ) TAGSET: 5,4 - Prob: 1.33422e-13
I've highlighted in this case the rows that make the tagger prefer np.ant
instead of np.loc. Probabilities arehigher, so the decission is also clear.
But it is very weird that proabilities are different with the same input
and the same .prob file. And not only for this, we can see the same thing
for every single probability computed by the tagger: *all of them are
different.*
Not sure how we should proceed about this, but IMHO is quite concerning to
have this type of inestability (can we call it "bug"?😊) in the core of
apertium's pipeline.
--
< Xavi Ivars >
< http://xavi.ivars.me >
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff