Re: [Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Francis Tyers Wed, 28 Mar 2012 16:23:27 -0700

El dt 27 de 03 de 2012 a les 06:37 +0530, en/na Sambhav Jain va
escriure:
> Hi,
> 
> I have shaped the idea for the "Corpus-based lexicalised feature
> transfer". Feedback is welcome.
> 
> http://apertium.codepad.org/F9AlEyZS  (better formatted text)
> 
>  Apertium
>  ========
> 
>   Corpus-based lexicalised feature transfer
>   -----------------------------------------
>   Problem Statement (as on wiki): Sometimes we get really inadequate
> translations even though you'd never hear stuff like that. One of
> those things is when we output something as definite when it is never
> used as definite. One way of dealing with this is a lot of rules and
> lists in transfer, but those are hard to do. So, how about looking at
> a corpus for information about some features like definiteness,
> aspect, evidentiality, impersonal/reflexive pronoun use in Romance
> languages etc.
> 
>   DEFINITENESS
>   ------------
> 
>   Proposing a module "Definiteness Adapter" which will lie just before
> morphological generation in the apertium pipeline.
> 
> 
>  1) What module does?
>     =================
>     -Remove explicit definiteness marker if not expected by the Language.
>     -Introduce an explicit definiteness marker if it is expected by
> the Langusge.
> 
>  2) Approach
>     ========
>     I am planning to use a hybrid module which primarily uses machine
> learning approach for making the decision of removing/inserting
> explicit 'definite' markers.
> 
>     On top of it (if required) a rule base layer, which can include
> PROHIBITION rules or others.


Good so far...

>  3) Architecture
>     ============
>     The task would require two modules.
>     1) Module 1, which uses trained model to make prediction and
> situated just before the morphological generator in the apertium
> pipeline. It will contain a rule based sub-module enabling user to add
> rules, based on the error analysis, which will be applied to the
> machine predicted output.
>     2) Module 2, a stand alone module which builds the model from a
> raw language corpus.
> 
> 
>                           Trained Lang. Model
>                                          |
>                           ________V________
>     sequence+feature --->|    Module 1     |--->sequence'+feature
>                          |_________________|
> 
> 
> 
>                     _________________
> _________________
>     Raw Corpus --->|    Morph        |---> Morph output stream  --->|
>   Module 2     |--->Trained Lang. Model
>                    |_________________|
> |_________________|
> 
> 
>  4) Learn form Corpus
>     =================
>     Idea is to learn a model which can predict existence/non-existence
> of the of explicit 'definite' marker. This clearly demands a large
> training corpus.

Why does it clearly demand a large training corpus ? 

>     Since for this task we only need to predict the definiteness
> markers so developing a training corpus is comparatively an easy task.
>         1. Every language has finite number of definiteness marker.
>         2. Raw text is easy to get.

True

>  5) Corpus Preparation
>     ==================
>     Procedure for preparing training corpus.
>     1. Choose a representative raw corpus (Wikipedia is a good option)
> for the language.
>     2. Run apertium morph analyzer of the respective language on this corpus.
>     3. Select and populate the training features and prediction label,
> and arrange in a format, that could be fed to a classifier.

For people not used to machine learning jargon, this might be a bit
impenetrable. Try and write as if you were writing for linguists, or
someone who hadn't heard about machine learning before.

>     Eg:- for English, the definiteness marker is "the", the file would
> look like something
> 
>     Sentence: At the beginning of the year 2009
>     Class Label : 1 if definiteness marker occur before the token, 0 otherwise
> 
>     LEMMA        FEATURE1(say POS)        FEATURE2(...)      CLASS
>     -----                   -----------------
> -------------              -----
>     at                             pr                             X
>                    0
>     beginning                 vb                             Y
>               1
>     of                             pr                             Z
>                    0
>     year                          n                             P
>                  1
>     2009                       num                           N
>               0
> 
> 
>  6) Choice of Classifier
>     ====================
>     Plan to use CRF as the classifier to predict the presence of
> 'definite' marker. CRF has an established reputation for sequence
> labeling tasks. In case the accuracies suffer then other available
> options are - SVM, Bayes etc.

How does "sequence labelling" apply to the task of choosing if something
is definite or not ? Is this a typical "sequence labelling" task ? 

>  7) Choice of Features
>     ==================
>     1) LEMMA - lemma in lower case
>                Using lower case to reduce the vocab and using case
> information as a separate feature instead. This will limit the vocab
> and hence sparsity.
>     2) POS   - POS tag of the token, use UNKNOWN if the POS tag is unknown.
>     3) CASE  - alphabetic case of the lemma
>                a. Initial Capital(IC)
>                      eg. England, Stanford,
>                b. All Upper(AU)
>                      eg. MIT, CPU
>                c. All Lower(AL)
>                      eg. core,pen
>                d. Number_Digits(Ni)
>                      eg. 2009(N4), 100(N3)
>                e. Alpha Numeric (AN)
>                      eg. F16, b12, 99ace
>                f. Others (O)
>                      eg. . , - @ + '

Where did you get these from ? What makes you think that capitalisation
has any effect on definiteness ? 

>  8) Learning Template
>     =================
>     An intelligent baseline should be to start with state of the art
> learning template for chunking. Then tweak it to increase accuracy.
> Difficult to methodalize, need a few hit and trials. Have done it
> earlier for other problems.
> 
> 
>  9) Post Editing - A rule based approach
>     ====================================
>     This sub-module, will be a rule base and facilitate overriding of
> the prediction by the system. Rules can be written by doing error
> analysis and finding frequent cases where the model makes mistake.
> 
> 
> 10) Comments
>     ========
>     1. Above description is provided for definiteness but this module
> can be easily ported to other concepts like aspect etc.
>     2. I feel linguistic feature could be added  based on the language.

I think that it might be a good idea to study the problem in more detail
before deciding on which approach to take. I believe that Hindi and
English don't mark definiteness in the same way, so perhaps you could do
the study based on these languages.

Regards, hope the comments are useful!

Fran


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Reply via email to