On Mon, May 25, 2020 at 03:10:28PM +0530, Tanmai Khanna wrote: > *Disadvantages:* > 1. The monodix has some erroneous analyses - wrong surface forms, wrong > analyses, or even MWEs that aren't really MWEs and can be translated word > by word. These are currently removed since bidixes are more carefully > maintained. If trimming is eliminated, and none of the analyses of a word > are in the bidix, then one of the analyses will be chosen, and there is a > chance that it is erroneous. If it's an MWE that doesn't exist in the > bidix, it won't be translated word by word even though that was ok.
I think the only argument here is that we want to keep having bad stuff in monodixes. From software engineering standpoint I find this argument really problematic, to hinder further development of systems because we want to keep bad, low quality data in monodixes is not good. As I've curated and maintained a bunch of stuff though, I can relate to the sentiment, linguistic data collection including dictionaries is not really a software project that will have complete and correct version 1.0. But I do think apertium does need to move towards maybe more quality control, more continuous testing for monodixes, especially of the esteemed release quality languages. However, I don't think I have seen a non-mwe example of how we lose the ability to reconstruct the trimmed output of the whole pipeline, I am still kind of under the impression that we mostly just add data to the stream, and it will then be more possibilities to output either the input surface forms or the bad monodix lemma or something else programmatically than before? > 2. If your monodix is used by lots of other pair developers, you don't want > *your* pair to get messed up because someone somewhere decided "take > precautions" should be an MWE, and suddenly where your old output had "ta > forholdsregler" you now get "*take precautions". > - Unhammer This MWE problem I do agree is bad and relevant. I usually develop hacked untrimmed by default and doing eng→fin was not particularly fun like that. Like, my solution to that would be, to not add "take precautions" to apertium-eng ever, and make more automated and social control for it, that's how other software projects keep codebases clean enough, but I don't know how to go about it here. Wasn't there a "separable"-based solution that looked good though? > 3. Having trimming gives the ability to control the monodix using the bidix > in your language pair. This ability isn't lost, because we're still > weighting the monodix, but if the bidix has none of the analyses for a > word, earlier it was discarded and now it will be retained. We can still discard it with just a bit of more hacking, surely? Am I missing something here? The stream will contain the surface form as well as the bad monodix analysis and information however encoded (nonzero weight, secondary tag, etc.) that it wasn't in the bidix? > 4. Weighting the monodix will take more compile time than just trimming it. Some numbers would be interesting, I think both are quite heavy and we don't do much further processing in finite-state algebra (/hfst space) so the weighted models won't blow up. In any case, people seem to be happy in 2020 to wait 70 hours for some neural stuff, few minutes for weighted automata won't be too bad ;-) -- Regards, Flammie <https://flammie.github.io> (Please note, that I will often include my replies inline instead of top or bottom of the mail)
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff