El dj 06 de 06 de 2013 a les 11:03 +0200, en/na Felipe Sánchez Martínez va escriure: > Hello Gang, > > I put the Apertium list in copy, just in case someone want to add something. > > > > I experiment with the en-es language pair > > > > (https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-en-es), > > and the newest "lt-toolbox" and "apertium" in the svn trunk. After a > > compiling sucessfully, I go to the "apertium-en-es/es-tagger-data" > > directory, and copy the "es-tagged.txt" to "es.crp.txt", and use the > > latter file as the corpus for unsupervised training. > > > > executing the command: > > make -f es-en-unsupervised.make > > > > and got the following log: > > > > ================= log begin========================= > > Generating es-tagger-data/es.dic > > This may take some time. Please, take a cup of coffee and come back > > later. > > apertium-validate-dictionary apertium-en-es.es.dix > > apertium-validate-tagger apertium-en-es.es.tsx > > lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" | grep -v > > ":<:" |\ > > awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' | apertium-destxt > > >es.dic.expanded > > lt-proc -a es-en.automorf.bin <es.dic.expanded | \ > > apertium-filter-ambiguity apertium-en-es.es.tsx > > > es-tagger-data/es.dic > > rm es.dic.expanded; > > apertium-destxt < es-tagger-data/es.crp.txt | lt-proc > > es-en.automorf.bin > es-tagger-data/es.crp > > apertium-validate-tagger apertium-en-es.es.tsx > > apertium-tagger -t 8 \ > > es-tagger-data/es.dic \ > > es-tagger-data/es.crp \ > > apertium-en-es.es.tsx \ > > es-en.prob; > > Calculating ambiguity classes... > > > > 106 states and 335 ambiguity classes > > Kupiec's initialization of transition and emission probabilities... > > Error: A new ambiguity class was found. I cannot continue. > > Word 'Mar' not found in the dictionary. > > New ambiguity class: {NOMMF,ANTROPONIM} > > Take a look at the dictionary and at the training corpus. Then, retrain. > > make: *** [es-en.prob] error 1 > > ================= log end========================= > > > > I debugged the word "Mar" with lt-proc: > > echo "Mar" | lt-proc es-en.automorf.bin > > > > with the output: > > ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$ > > > Normally this happens when you are not regenerating the file with the > dictionary from which the ambiguity classes are obtained. > > Please check if ^Mar/Mar<n><mf><sg>/Mar<np><ant><f><sg>$ appears as a > result of the expansion of the dictionary: > > $ lt-expand apertium-en-es.es.dix | grep -v "__REGEXP__" |\ > grep -v ":<:" | awk 'BEGIN{FS=":>:|:"}{print $1 ".";}' |\ > apertium-destxt | lt-proc -a es-en.automorf.bin > es.expand.dic > > And if it appears after filtering the ambiguity file: > $ apertium-filter-ambiguity apertium-en-es.es.tsx < es.expand.dic > es.dic
I think the problem is that the extra analyses are added by regular expressions which are not covered in the expansion. Fran ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff