El dt 27 de 03 de 2012 a les 06:37 +0530, en/na Sambhav Jain va escriure: > Hi, > > I have shaped the idea for the "Corpus-based lexicalised feature > transfer". Feedback is welcome. > > http://apertium.codepad.org/F9AlEyZS (better formatted text) > > Apertium > ======== > > Corpus-based lexicalised feature transfer > ----------------------------------------- > Problem Statement (as on wiki): Sometimes we get really inadequate > translations even though you'd never hear stuff like that. One of > those things is when we output something as definite when it is never > used as definite. One way of dealing with this is a lot of rules and > lists in transfer, but those are hard to do. So, how about looking at > a corpus for information about some features like definiteness, > aspect, evidentiality, impersonal/reflexive pronoun use in Romance > languages etc. > > DEFINITENESS > ------------ > > Proposing a module "Definiteness Adapter" which will lie just before > morphological generation in the apertium pipeline. > > > 1) What module does? > ================= > -Remove explicit definiteness marker if not expected by the Language. > -Introduce an explicit definiteness marker if it is expected by > the Langusge. > > 2) Approach > ======== > I am planning to use a hybrid module which primarily uses machine > learning approach for making the decision of removing/inserting > explicit 'definite' markers. > > On top of it (if required) a rule base layer, which can include > PROHIBITION rules or others.
Good so far... > 3) Architecture > ============ > The task would require two modules. > 1) Module 1, which uses trained model to make prediction and > situated just before the morphological generator in the apertium > pipeline. It will contain a rule based sub-module enabling user to add > rules, based on the error analysis, which will be applied to the > machine predicted output. > 2) Module 2, a stand alone module which builds the model from a > raw language corpus. > > > Trained Lang. Model > | > ________V________ > sequence+feature --->| Module 1 |--->sequence'+feature > |_________________| > > > > _________________ > _________________ > Raw Corpus --->| Morph |---> Morph output stream --->| > Module 2 |--->Trained Lang. Model > |_________________| > |_________________| > > > 4) Learn form Corpus > ================= > Idea is to learn a model which can predict existence/non-existence > of the of explicit 'definite' marker. This clearly demands a large > training corpus. Why does it clearly demand a large training corpus ? > Since for this task we only need to predict the definiteness > markers so developing a training corpus is comparatively an easy task. > 1. Every language has finite number of definiteness marker. > 2. Raw text is easy to get. True > 5) Corpus Preparation > ================== > Procedure for preparing training corpus. > 1. Choose a representative raw corpus (Wikipedia is a good option) > for the language. > 2. Run apertium morph analyzer of the respective language on this corpus. > 3. Select and populate the training features and prediction label, > and arrange in a format, that could be fed to a classifier. For people not used to machine learning jargon, this might be a bit impenetrable. Try and write as if you were writing for linguists, or someone who hadn't heard about machine learning before. > Eg:- for English, the definiteness marker is "the", the file would > look like something > > Sentence: At the beginning of the year 2009 > Class Label : 1 if definiteness marker occur before the token, 0 otherwise > > LEMMA FEATURE1(say POS) FEATURE2(...) CLASS > ----- ----------------- > ------------- ----- > at pr X > 0 > beginning vb Y > 1 > of pr Z > 0 > year n P > 1 > 2009 num N > 0 > > > 6) Choice of Classifier > ==================== > Plan to use CRF as the classifier to predict the presence of > 'definite' marker. CRF has an established reputation for sequence > labeling tasks. In case the accuracies suffer then other available > options are - SVM, Bayes etc. How does "sequence labelling" apply to the task of choosing if something is definite or not ? Is this a typical "sequence labelling" task ? > 7) Choice of Features > ================== > 1) LEMMA - lemma in lower case > Using lower case to reduce the vocab and using case > information as a separate feature instead. This will limit the vocab > and hence sparsity. > 2) POS - POS tag of the token, use UNKNOWN if the POS tag is unknown. > 3) CASE - alphabetic case of the lemma > a. Initial Capital(IC) > eg. England, Stanford, > b. All Upper(AU) > eg. MIT, CPU > c. All Lower(AL) > eg. core,pen > d. Number_Digits(Ni) > eg. 2009(N4), 100(N3) > e. Alpha Numeric (AN) > eg. F16, b12, 99ace > f. Others (O) > eg. . , - @ + ' Where did you get these from ? What makes you think that capitalisation has any effect on definiteness ? > 8) Learning Template > ================= > An intelligent baseline should be to start with state of the art > learning template for chunking. Then tweak it to increase accuracy. > Difficult to methodalize, need a few hit and trials. Have done it > earlier for other problems. > > > 9) Post Editing - A rule based approach > ==================================== > This sub-module, will be a rule base and facilitate overriding of > the prediction by the system. Rules can be written by doing error > analysis and finding frequent cases where the model makes mistake. > > > 10) Comments > ======== > 1. Above description is provided for definiteness but this module > can be easily ported to other concepts like aspect etc. > 2. I feel linguistic feature could be added based on the language. I think that it might be a good idea to study the problem in more detail before deciding on which approach to take. I believe that Hindi and English don't mark definiteness in the same way, so perhaps you could do the study based on these languages. Regards, hope the comments are useful! Fran ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff