On Mon, May 25, 2020 at 03:10:28PM +0530, Tanmai Khanna wrote:

> *Disadvantages:*
> 1. The monodix has some erroneous analyses - wrong surface forms, wrong
> analyses, or even MWEs that aren't really MWEs and can be translated word
> by word. These are currently removed since bidixes are more carefully
> maintained. If trimming is eliminated, and none of the analyses of a word
> are in the bidix, then one of the analyses will be chosen, and there is a
> chance that it is erroneous. If it's an MWE that doesn't exist in the
> bidix, it won't be translated word by word even though that was ok.

I think the only argument here is that we want to keep having bad stuff
in monodixes. From software engineering standpoint I find this argument
really problematic, to hinder further development of systems because we
want to keep bad, low quality data in monodixes is not good. As I've
curated and maintained a bunch of stuff though, I can relate to the
sentiment, linguistic data collection including dictionaries is not
really a software project that will have complete and correct version
1.0. But I do think apertium does need to move towards maybe more
quality control, more continuous testing for monodixes, especially of
the esteemed release quality languages. However, I don't think I have
seen a  non-mwe example of how we lose the ability to reconstruct the
trimmed output of the whole pipeline, I am still kind of under the
impression that we mostly just add data to the stream, and it will then
be more possibilities to output either the input surface forms or the
bad monodix lemma or something else programmatically than before?

> 2. If your monodix is used by lots of other pair developers, you don't want
> *your* pair to get messed up because someone somewhere decided "take
> precautions" should be an MWE, and suddenly where your old output had "ta
> forholdsregler" you now get "*take precautions".
> - Unhammer

This MWE problem I do agree is bad and relevant. I usually develop
hacked untrimmed by default and doing eng→fin was not particularly fun
like that. Like, my solution to that would be, to not add "take
precautions" to apertium-eng ever, and make more automated and social
control for it, that's how other software projects keep codebases clean
enough, but I don't know how to go about it here. 

Wasn't there a "separable"-based solution that looked good though?

> 3. Having trimming gives the ability to control the monodix using the bidix
> in your language pair. This ability isn't lost, because we're still
> weighting the monodix, but if the bidix has none of the analyses for a
> word, earlier it was discarded and now it will be retained.

We can still discard it with just a bit of more hacking, surely? Am I
missing something here? The stream will contain the surface form as well
as the bad monodix analysis and information however encoded (nonzero 
weight, secondary tag, etc.) that it wasn't in the bidix? 

> 4. Weighting the monodix will take more compile time than just trimming it.

Some numbers would be interesting, I think both are quite heavy and we
don't do much further processing in finite-state algebra (/hfst space)
so the weighted models won't blow up. In any case, people seem to be
happy in 2020 to wait 70 hours for some neural stuff, few minutes for
weighted automata won't be too bad ;-)


-- 
Regards, Flammie <https://flammie.github.io>
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to