Hey Apertiumers,
This mail is regarding an ongoing project to eliminate dictionary trimming.
The project idea can be found here
<https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Eliminate_trimming>.
The project description was to work around everything in Why we trim
<https://wiki.apertium.org/wiki/Why_we_trim>.

In the current state of the project, it was proposed that using secondary
tags to propagate surface information in the pipe and weighting the
monodix with the bidix would preserve most of the benefits of trimming
while getting rid of the disadvantages. However, it doesn't seem like we
all agree on the importance of aforementioned benefits and disadvantages. I
am a potential language developer with Apertium, but I haven't made a pair
yet, which makes me anything but an authority on what's actually
beneficial. When I wrote the proposal, I weighed the pros and cons based on
my knowledge of linguistics, MT systems, and Apertium.

However, Francis has made some really good points about the benefits of
trimming that I wasn't aware of, which has now led to a dilemma. I will do
my best to list these advantages and disadvantages here, and you guys can
decide between: *making trimming the norm and having the option of
eliminating it, or making eliminating trimming the norm and having the
option of activating it, or to have partial trimming, as discussed later.*

*Disadvantages:*
1. The monodix has some erroneous analyses - wrong surface forms, wrong
analyses, or even MWEs that aren't really MWEs and can be translated word
by word. These are currently removed since bidixes are more carefully
maintained. If trimming is eliminated, and none of the analyses of a word
are in the bidix, then one of the analyses will be chosen, and there is a
chance that it is erroneous. If it's an MWE that doesn't exist in the
bidix, it won't be translated word by word even though that was ok.
2. If your monodix is used by lots of other pair developers, you don't want
*your* pair to get messed up because someone somewhere decided "take
precautions" should be an MWE, and suddenly where your old output had "ta
forholdsregler" you now get "*take precautions".
- Unhammer
3. Having trimming gives the ability to control the monodix using the bidix
in your language pair. This ability isn't lost, because we're still
weighting the monodix, but if the bidix has none of the analyses for a
word, earlier it was discarded and now it will be retained.
4. Weighting the monodix will take more compile time than just trimming it.

*Advantages:*
1. Since the monodix will be weighted with the bidix (if decided), then a
lot of erroneous analyses will be ignored if even one of them exists in the
bidix and that will be chosen.
2. If none of the analyses of a surface form are in the bidix, then the
most likely one will be chosen and used for context disambiguation and
transfer rules, thereby giving a more comprehensible and more post-editable
output. Even enables us to do something like:

*Basque to English "Andonik izarak izeki zuen" ('Andoni hung up the
sheets') → 'Andoni *izeki-ed the sheets"?*

*Doing this won't be possible without eliminating dictionary trimming.*
3. Philosophically, it would make sense to not discard the knowledge of a
word's analysis if we can use it, even though we cannot translate it.
4. It helps with debugging as now we can distinguish errors from word not
being in the bidix and word not being in the monodix.
5. Earlier without trimming we get @source_lemma, which harms the
translated output, but now we can get @source_surfaceform, which is the
same as what we get with trimming (*source_surfaceform), but with the
benefits of disambiguation and transfer.

If there are any advantages or disadvantages I missed, please add them here
so people can make a more informed decision. *Another possible solution
that was discussed was to continue trimming multiwords since they present a
unique disadvantage, but to eliminate trimming for words without spaces. *Any
other possible solution can be discussed as well.

Hopefully after an informed discussion we can come to an acceptable
conclusion.

Thanks and Regards,
Tanmai Khanna

-- 
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to