Re: [Apertium-stuff] Ask for help on HMM unsupervised training

Gang Chen Thu, 06 Jun 2013 22:52:29 -0700

Hi, Philipe, Fran,

I think there are two kinds of words that cause the "A new ambiguity class"
error.


(1) the first kind are the words like "Mar", that make sense in both cases,
whether a dot is appended or not:

I ran the command:
lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" | grep -v ":<:" |
awk 'BEGIN{FS=":>:|:"}{print $1 ".";}'  | apertium-destxt >es.dic.expanded

in the file es.dic.expanded, the contents are like these:
abyectas.[
]abyecta.[
]abyectos.[
]abyecto.[
...

and there is a line:
]Mar.[

Next, I ran the command:
lt-proc -a es-en.automorf.bin <es.dic.expanded > es.dic.expanded.morph
(before using "apertium-filter-ambiguity")

in the file es.dic.expanded.morph, the contents are like these:
^abyectas/abyecto<adj><f><pl>$^./.<sent>$[
]^abyecta/abyecto<adj><f><sg>$^./.<sent>$[
]^abyectos/abyecto<adj><m><pl>$^./.<sent>$[
]^abyecto/abyecto<adj><m><sg>$^./.<sent>$[
... ...

however, for the line "]Mar.[", the content is:
]^Mar./Mar.<n><m><sg>$[

So it seems the mophological analyser has recognized "Mar." as a word,
instead of two words, "Mar" and ".". The mophological analysis for the word
"Mar" is special (it is ambiguous), which is
"^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$", and no other words share this
ambiguity class, so the word "Mar" should have contributed this ambituity
class for the first time.

I guess it may be because of the *dot* int the awk command " awk
'BEGIN{FS=":>:|:"}{print $1 ".";}'".  Why do we need the dot here?

(2) the second kind of words are those may be recognized by the regular
expresions, just as Fran says.

as far as I tried, these numericals will case the "A new ambiguity class"
error,
VI
ID
MI
... ...

However, not all the roman numericals causes the error (eg. II is OK).
Roman numericals that can also have other lexical forms will do, for
example:
^VI/VI<num><mf><sg>/VER<vblex><ifi><p1><sg>$
^ID/ID<num><mf><sg>/IR<vblex><imp><p2><pl>$
^MI/MI<num><mf><sg>/MÍO<det><pos><mf><sg>$

these words' ambituity class didn't appear in the dic file, so the error
was caused.

What can we do to avoid this?


Many thanks :)

Gang



2013/6/6 Francis Tyers <fty...@prompsit.com>

> El dj 06 de 06 de 2013 a les 11:03 +0200, en/na Felipe Sánchez Martínez
> va escriure:
> > Hello Gang,
> >
> > I put the Apertium list in copy, just in case someone want to add
> something.
> >
> >
> > >     I experiment with the en-es language pair
> > >     (
> https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-en-es
> ),
> > >     and the newest "lt-toolbox" and "apertium" in the svn trunk. After
> a
> > >     compiling sucessfully, I go to the "apertium-en-es/es-tagger-data"
> > >     directory, and copy the "es-tagged.txt" to "es.crp.txt", and use
> the
> > >     latter file as the corpus for unsupervised training.
> > >
> > >     executing the command:
> > >     make -f es-en-unsupervised.make
> > >
> > >     and got the following log:
> > >
> > >     ================= log begin=========================
> > >     Generating es-tagger-data/es.dic
> > >     This may take some time. Please, take a cup of coffee and come back
> > >     later.
> > >     apertium-validate-dictionary apertium-en-es.es.dix
> > >     apertium-validate-tagger apertium-en-es.es.tsx
> > >     lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" | grep -v
> > >     ":<:" |\
> > >              awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt
> > >      >es.dic.expanded
> > >     lt-proc -a es-en.automorf.bin <es.dic.expanded | \
> > >              apertium-filter-ambiguity apertium-en-es.es.tsx >
> > >     es-tagger-data/es.dic
> > >     rm es.dic.expanded;
> > >     apertium-destxt < es-tagger-data/es.crp.txt | lt-proc
> > >     es-en.automorf.bin > es-tagger-data/es.crp
> > >     apertium-validate-tagger apertium-en-es.es.tsx
> > >     apertium-tagger -t 8 \
> > >                                 es-tagger-data/es.dic \
> > >                                 es-tagger-data/es.crp \
> > >                                 apertium-en-es.es.tsx \
> > >                                 es-en.prob;
> > >     Calculating ambiguity classes...
> > >
> > >     106 states and 335 ambiguity classes
> > >     Kupiec's initialization of transition and emission probabilities...
> > >     Error: A new ambiguity class was found. I cannot continue.
> > >     Word 'Mar' not found in the dictionary.
> > >     New ambiguity class: {NOMMF,ANTROPONIM}
> > >     Take a look at the dictionary and at the training corpus. Then,
> retrain.
> > >     make: *** [es-en.prob] error 1
> > >     ================= log end=========================
> > >
> > >     I debugged the word "Mar" with lt-proc:
> > >     echo "Mar" | lt-proc es-en.automorf.bin
> > >
> > >     with the output:
> > >     ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$
> >
> >
> > Normally this happens when you are not regenerating the file with the
> > dictionary from which the ambiguity classes are obtained.
> >
> > Please check if ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$ appears as a
> > result of the expansion of the dictionary:
> >
> > $ lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" |\
> >    grep -v ":<:" | awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' |\
> >    apertium-destxt | lt-proc -a es-en.automorf.bin > es.expand.dic
> >
> > And if it appears after filtering the ambiguity file:
> > $ apertium-filter-ambiguity apertium-en-es.es.tsx < es.expand.dic >
> es.dic
>
> I think the problem is that the extra analyses are added by regular
> expressions which are not covered in the expansion.
>
> Fran
>
>

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Ask for help on HMM unsupervised training

Reply via email to