Re: [Apertium-stuff] Ask for help on HMM unsupervised training

Felipe Sánchez Martínez Fri, 07 Jun 2013 01:15:30 -0700

Hello:

The problem with "Mar" is not the dot, but the upper-case letter. A way 
to avoid this would be to expand the dictionary and duplicate those 
entries with an initial upper-case letter so that you have both "Mar" 
and "mar". The problem with that is that there will be some unknown 
words you will have to rule out.


Cheers
--
Felipe

El 07/06/13 07:51, Gang Chen escribió:
> Hi, Philipe, Fran,
>
> I think there are two kinds of words that cause the "A new ambiguity
> class" error.
>
> (1) the first kind are the words like "Mar", that make sense in both
> cases, whether a dot is appended or not:
>
> I ran the command:
> lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" | grep -v ":<:" |
> awk 'BEGIN{FS=":>:|:"}{print $1 ".";}'  | apertium-destxt >es.dic.expanded
>
> in the file es.dic.expanded, the contents are like these:
> abyectas.[
> ]abyecta.[
> ]abyectos.[
> ]abyecto.[
> ...
>
> and there is a line:
> ]Mar.[
>
> Next, I ran the command:
> lt-proc -a es-en.automorf.bin <es.dic.expanded > es.dic.expanded.morph
>    (before using "apertium-filter-ambiguity")
>
> in the file es.dic.expanded.morph, the contents are like these:
> ^abyectas/abyecto<adj><f><pl>$^./.<sent>$[
> ]^abyecta/abyecto<adj><f><sg>$^./.<sent>$[
> ]^abyectos/abyecto<adj><m><pl>$^./.<sent>$[
> ]^abyecto/abyecto<adj><m><sg>$^./.<sent>$[
> ... ...
>
> however, for the line "]Mar.[", the content is:
> ]^Mar./Mar.<n><m><sg>$[
>
> So it seems the mophological analyser has recognized "Mar." as a word,
> instead of two words, "Mar" and ".". The mophological analysis for the
> word "Mar" is special (it is ambiguous), which is
> "^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$", and no other words share
> this ambiguity class, so the word "Mar" should have contributed this
> ambituity class for the first time.
>
> I guess it may be because of the *dot* int the awk command "awk
> 'BEGIN{FS=":>:|:"}{print $1".";}'".  Why do we need the dot here?
>
> (2) the second kind of words are those may be recognized by the regular
> expresions, just as Fran says.
>
> as far as I tried, these numericals will case the "A new ambiguity
> class" error,
> VI
> ID
> MI
> ... ...
>
> However, not all the roman numericals causes the error (eg. II is OK).
> Roman numericals that can also have other lexical forms will do, for
> example:
> ^VI/VI<num><mf><sg>/VER<vblex><ifi><p1><sg>$
> ^ID/ID<num><mf><sg>/IR<vblex><imp><p2><pl>$
> ^MI/MI<num><mf><sg>/MÍO<det><pos><mf><sg>$
>
> these words' ambituity class didn't appear in the dic file, so the error
> was caused.
>
> What can we do to avoid this?
>
>
> Many thanks :)
>
> Gang
>
>
>
> 2013/6/6 Francis Tyers <fty...@prompsit.com <mailto:fty...@prompsit.com>>
>
>     El dj 06 de 06 de 2013 a les 11:03 +0200, en/na Felipe Sánchez Martínez
>     va escriure:
>      > Hello Gang,
>      >
>      > I put the Apertium list in copy, just in case someone want to add
>     something.
>      >
>      >
>      > >     I experiment with the en-es language pair
>      > >
>     
> (https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-en-es),
>      > >     and the newest "lt-toolbox" and "apertium" in the svn
>     trunk. After a
>      > >     compiling sucessfully, I go to the
>     "apertium-en-es/es-tagger-data"
>      > >     directory, and copy the "es-tagged.txt" to "es.crp.txt",
>     and use the
>      > >     latter file as the corpus for unsupervised training.
>      > >
>      > >     executing the command:
>      > >     make -f es-en-unsupervised.make
>      > >
>      > >     and got the following log:
>      > >
>      > >     ================= log begin=========================
>      > >     Generating es-tagger-data/es.dic
>      > >     This may take some time. Please, take a cup of coffee and
>     come back
>      > >     later.
>      > >     apertium-validate-dictionary apertium-en-es.es.dix
>      > >     apertium-validate-tagger apertium-en-es.es.tsx
>      > >     lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" |
>     grep -v
>      > >     ":<:" |\
>      > >              awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' |
>     apertium-destxt
>      > >      >es.dic.expanded
>      > >     lt-proc -a es-en.automorf.bin <es.dic.expanded | \
>      > >              apertium-filter-ambiguity apertium-en-es.es.tsx >
>      > >     es-tagger-data/es.dic
>      > >     rm es.dic.expanded;
>      > >     apertium-destxt < es-tagger-data/es.crp.txt | lt-proc
>      > >     es-en.automorf.bin > es-tagger-data/es.crp
>      > >     apertium-validate-tagger apertium-en-es.es.tsx
>      > >     apertium-tagger -t 8 \
>      > >                                 es-tagger-data/es.dic \
>      > >                                 es-tagger-data/es.crp \
>      > >                                 apertium-en-es.es.tsx \
>      > >                                 es-en.prob;
>      > >     Calculating ambiguity classes...
>      > >
>      > >     106 states and 335 ambiguity classes
>      > >     Kupiec's initialization of transition and emission
>     probabilities...
>      > >     Error: A new ambiguity class was found. I cannot continue.
>      > >     Word 'Mar' not found in the dictionary.
>      > >     New ambiguity class: {NOMMF,ANTROPONIM}
>      > >     Take a look at the dictionary and at the training corpus.
>     Then, retrain.
>      > >     make: *** [es-en.prob] error 1
>      > >     ================= log end=========================
>      > >
>      > >     I debugged the word "Mar" with lt-proc:
>      > >     echo "Mar" | lt-proc es-en.automorf.bin
>      > >
>      > >     with the output:
>      > >     ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$
>      >
>      >
>      > Normally this happens when you are not regenerating the file with the
>      > dictionary from which the ambiguity classes are obtained.
>      >
>      > Please check if ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$ appears as a
>      > result of the expansion of the dictionary:
>      >
>      > $ lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" |\
>      >    grep -v ":<:" | awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' |\
>      >    apertium-destxt | lt-proc -a es-en.automorf.bin > es.expand.dic
>      >
>      > And if it appears after filtering the ambiguity file:
>      > $ apertium-filter-ambiguity apertium-en-es.es.tsx < es.expand.dic
>      > es.dic
>
>     I think the problem is that the extra analyses are added by regular
>     expressions which are not covered in the expansion.
>
>     Fran
>
>

-- 
Felipe Sánchez Martínez
Dep. de Llenguatges i Sistemes Informàtics
Universitat d'Alacant, E-03071 Alacant (Spain)
Tel.: +34 965 903 400, ext: 2966 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Ask for help on HMM unsupervised training

Reply via email to