El dc 28 de 03 de 2012 a les 12:43 +0200, en/na Orosz György va
escriure:
> Thanks, for clarifying things.
> 
>         > It is clear. I am wondering about the supervised training:
>         is it
>         > possible to train the tagger (in a supervised manner)
>         without creating
>         > all the lexical resources used by the MT system? What is
>         > not obvious for me, that why are these parameters needed:
>         > "apertium-tagger[-d] -s=n DIC CRP TSX TAGGER_DATA HTAG
>         UNTAG"
>         
>         
>         And FILES are:
>          DIC:         full expanded dictionary file
>          CRP:         training text corpus file
>          TSX:         tagger specification file, in XML format
>          TAGGER_DATA: tagger data file, built in the training and used
>         while
>                       tagging
>          HTAG:        hand-tagged text corpus
>          UNTAG:       untagged text corpus, morphological analysis of
>         HTAG
>                       corpus to use both jointly with -s option
>         
>         
>         For Hungarian, "DIC" is not going to be possible as it relies
>         on
>         dictionary expansion,[1] the rest is possible (you just need
>         to convert
>         the resources you already have).
>         
>         Felipe: What is the dictionary expansion file used for when
>         training the
>         tagger, and could it be approximated in some way?
>         
>         Fran
>         
>         1. Well, you could just analyse the corpus with your
>         morphological
>         analyser, and then convert the set of analyses from the corpus
>         to an
>         Apertium .dix file, then expand it. This would be useless for
>         most
>         purposes but would allow you to train the tagger.
>         
>         
> 
> 
> Can you please confirm me whether it is the process of training or
> not?
> For tagging we need a untagged corpus (UNTAG), a disambiguated one
> (HTAG), and one which has all the possible analysises for each
> word(CRP). We also need a dictionary which has (a huge amount)
> wordform analysis pairs (DIC). (Is it a simulated morphological
> analyzer?)

It isn't used for morphological analysis, the morphological analyser is
used for that. I believe that the expansion is used, along with the TSX
file, for calculating the ambiguity classes. But someone else might know
better.

> TAGGER_DATA is created during the training, and TSX contains mapping
> between tags of the MA and the tagger. (One more question: is it
> possible to use identical relation as mapping, since the tagset we use
> is the one that the MA generates?)

You can make a TSX file which just has the same coarse tags as fine tags
yes.

Fran


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to