On 7/22/19 9:32 PM, Francis Tyers wrote:
> El 2019-07-21 22:50, Amr Mohamed Hosny Anwar escribió:
>> Dear Francis, Nick, Tommi,
>>
>> Hope this mail finds you well.
>> I would like to share with the blog posts that I have used to document
>> the project's progress.
>> Firstly, The scores for the implemented methods that are computed using
>> a custom script
>> (https://github.com/apertium/lttoolbox/pull/55/files#diff-4791d142daa5e6d636af9488c64ef69a)
>>  
>>
>> can be found here https://ak-blog.herokuapp.com/posts/7/
>>
>> Secondly, I have done my best searching for relevant publications
>> related to keywords such as: Morphological Disambiguation.
>> All the methods are supervised in one way or another.
>> I have documented my notes for the list of relevant publications here:
>> https://ak-blog.herokuapp.com/posts/9/
>>
>> Finally, I have made some tweaks to the supervised model and implemented
>> a model based on the analyses length.
>> The model seems to be equivalent to the one that assigns the same weight
>> to all the analyses and I believe this is a result of the way the
>> lt-proc command works.
>> You can check my explanation/findings here:
>> https://ak-blog.herokuapp.com/posts/10/
>>
>> Looking forward to reading your advice on how to proceed with the 
>> project.
>> Additionally, Do you think we can make use of a parallel corpus for two
>> languages in some way or another?
>> I know a parallel corpus is also somehow supervised but my intuition is
>> that finding/developing parallel corpora is easier than
>> finding/developing a tagged corpus.
>>
>> Note: The blog is hosted using heroku as a free host so the first time
>> you access a page might take some time to actually load :)
>>
>
> How about using BPE to weight the possible analyses?
>
> e.g.
>
> 1) BPE will give you a segmentation it likes for a word, 
> "arabasız>lar>da"
>
> 2) analyser will give you various segmentations:
>      araba>sız>lar>da, arabasız>lar>da
>
> 3) you weight the segmentations that disagree with BPE higher for each 
> boundary
>      that isn't predicted by BPE
>
>
> F.
>
Hi Francis,

I have checked the BPE segmentation paper.

The idea is easy to grasp but I think morphological analyzers' output 
has a special format.
To use BPE I will need to drop the analysis tags such as "<n>" and "<sg>".
In order to validate that the BPE might be beneficial, I decided to 
compute certain statistics from the tiny English corpus that I am using.

* Corpus size: 9098 tokens.
* Unambiguous tokens (unique analysis for the token): 5830 tokens (64%)
* Ambiguous tokens: 3268 tokens (36%)
     * Tokens having different segments: 533 tokens (5.86% of the corpus 
- 16.3% out of the ambiguous tokens)
     Example:
         * Surface token: oscillating
         * Analyses: 
oscillating<adj>/oscillate<vblex><pprs>/oscillate<vblex><subs>/oscillate<vblex><ger>
         * Segments: oscillating/oscillate/oscillate/oscillate

Thus BPE will generate different segments for only 5.86% of the tokens.
Additionally, Out of these tokens, most of them will have the same 
segmentation for their respective analyses.
I am encouraged to use BPE in this way and I believe it won't make a big 
difference.
Do you think these statistics will differ for languages such as German, 
Turkish, Finish who seem to have more complex compounding than English?

Amr


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to