Re: [Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Jimmy O'Regan Wed, 28 Mar 2012 19:43:24 -0700

On 27 March 2012 02:07, Sambhav Jain <sam...@gmail.com> wrote:
> Hi,
>
> I have shaped the idea for the "Corpus-based lexicalised feature
> transfer". Feedback is welcome.
>
>   DEFINITENESS
>   ------------
>
>   Proposing a module "Definiteness Adapter" which will lie just before
> morphological generation in the apertium pipeline.
>


The idea as listed talks about more features than just definiteness.
Why are you only concentrating on this?

>  6) Choice of Classifier
>     ====================
>     Plan to use CRF as the classifier to predict the presence of
> 'definite' marker. CRF has an established reputation for sequence
> labeling tasks. In case the accuracies suffer then other available
> options are - SVM, Bayes etc.
>

You're essentially treating this as chunking, so say chunking. We're
not biologists, we don't care too much for sequence labeling :)

>  7) Choice of Features
>     ==================
>     1) LEMMA - lemma in lower case
>                Using lower case to reduce the vocab and using case
> information as a separate feature instead. This will limit the vocab
> and hence sparsity.
>     2) POS   - POS tag of the token, use UNKNOWN if the POS tag is unknown.
>     3) CASE  - alphabetic case of the lemma

Your choice of features would be more convincing if you took the time
to look at the translation stream to see what features are available.
For one thing, we have separate tags for features such as number,
gender, etc. For another, there is the possibility of running this
between interchunk and postchunk, which would mean the target language
chunks produced by the transfer component were still available.

>  8) Learning Template
>     =================
>     An intelligent baseline should be to start with state of the art
> learning template for chunking. Then tweak it to increase accuracy.
> Difficult to methodalize, need a few hit and trials. Have done it
> earlier for other problems.
>

If by 'state of the art' you mean 'existing' you should tell us what
you intend to use as a starting point.

>  9) Post Editing - A rule based approach
>     ====================================
>     This sub-module, will be a rule base and facilitate overriding of
> the prediction by the system. Rules can be written by doing error
> analysis and finding frequent cases where the model makes mistake.
>

This is quite weak and needs to be expanded upon. 'Error analysis' is
somewhat vague, and I think we would all prefer a more precise
description.

>
> 10) Comments
>     ========
>     1. Above description is provided for definiteness but this module
> can be easily ported to other concepts like aspect etc.

Have you given any thought to that? It would be better, if you have,
to let us know what your thoughts are, rather than concentrating on a
single application.


-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSOC Idea - Corpus-based lexicalised feature transfer

Reply via email to