I apologise, it seems like the link got removed when the message sent. Here it is: http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming
Thanks Tanmai On Sun, Mar 29, 2020 at 3:11 PM Tanmai Khanna <khanna.tan...@gmail.com> wrote: > Hey guys, > Here's a draft proposal for this project. Any comments will be > appreciated :) > > Thanks, > Tanmai > > On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <khanna.tan...@gmail.com> > wrote: > >> Hi Hèctor, >> A fundamental motivation for this proposal is the possibility of giving >> the power to each program to use and propagate as much information as it >> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said: >> >>> You should see how much secondary information VISL's streams have. Noun >>> semantics, verb frames, dependency, markup tags, etc. Being able to carry >>> any information along makes many things possible, often things you can't >>> imagine because of current limitations. >>> >> >> The example in my original was well, just an example, but the idea is >> that you can add any amount of information as you want, in the language >> models or even the translation modules. It's also not just about English >> words, but about all languages. This is not to say that we have to add case >> information to every English word. It is optional information which can be >> added if needed for the translation task. >> >> With this proposal we're trying to prepare the apertium stream for the >> future. Today we realised that we need the surface form in the stream, and >> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't >> do this now, we will have to modify all the parsers in the pipeline each >> time we need more information in the pipe.* This is why it's a good idea >> to modify the parsers so that it can handle an arbitrary amount of >> information. >> >> Lastly, one point we should discuss is this idea about how any secondary >> information I add in the monodix would be available for everyone who uses >> that information. There's several things to say about this: >> >> - As long as the information is correct, I don't really see why >> redundant secondary information should bother anyone. It will be available >> for anyone who wishes to use it for their task, and if you don't want to >> use it the programs will ignore it. >> - Another idea is that secondary information could be put in a >> separate dix, however this would lead to an unnecessary increase in >> complexity. >> >> Unless if the developers of Apertium feel that redundant information in >> the stream will be a huge problem, this will allow each program to access a >> lot more information and open up possibilities that we haven't even thought >> of yet. At the very least, it will help us to eliminate trimming. >> >> Thanks and Regards, >> Tanmai Khanna >> >> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <hectora...@gmail.com> >> wrote: >> >>> Hi Tanmai, >>> >>> I am surprised by this proposal. It involves some very important changes >>> that should be better justified. I don't quite understand when should one >>> define the "optional secondary information" in addition to the current >>> morphological fields. Will it be in the language module (apertium-xxx) or >>> in each of the translation modules (apertium-xxx-yyy)? Part of the problem >>> may be in the example. I can't imagine why information on case should be >>> added to every English word (not much that, say, information about >>> belonging, which is common for Turkic languages). Should this kind of >>> unnecessary information for everybody, or almost everybody, will be found >>> in every language pair using, say, English if someone for his or her >>> specific purposes will like to add it? As far as I understand, for the >>> given project it is needed to add the surface form of the word. This seems >>> quite logical. Moreover, this information may be useful for e.g. lexical >>> selection and structural transfer. But more than that seems to me too >>> obscure. >>> >>> Best, >>> Hèctor >>> >>> Missatge de Tanmai Khanna <khanna.tan...@gmail.com> del dia ds., 28 de >>> març 2020 a les 23:51: >>> >>>> Hey guys, >>>> As part of the project to eliminate trimming, I had to come up with a >>>> way to include the surface form in the lexical unit and hence modifying the >>>> apertium stream format. To do this I would have to modify the parsers of >>>> every program in the pipeline, and if that has to happen, we discussed on >>>> the IRC that *it might be a good idea to modify the stream in such a >>>> way that we can include an arbitrary amount of information in a lexical >>>> unit, and each program can use whatever information they need.* >>>> >>>> The current information in the lexical unit would be primary >>>> information, and then we would have optional secondary information which >>>> could contain the surface form, but also literally anything you can think >>>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of >>>> possibilities for each program, and it would strengthen the apertium stream >>>> format considerably. >>>> >>>> We discussed several possible syntax for this new stream format, and >>>> the one that seems the best is something like this: >>>> >>>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$ >>>> >>>> This doesn't mess with the current stream format too much. The number >>>> of tags is already arbitrary so that helps. The secondary tags contain a >>>> ":" that would help distinguish them from primary tags. >>>> >>>> To implement this a modification would still be needed to all the >>>> parsers but the benefits far outweigh the amount of work needed to pull >>>> this off. >>>> >>>> Since this would be a major fundamental change to Apertium, I request >>>> you all to contribute with your views, any pros, cons, suggestions - to the >>>> idea, to the syntax, anything. >>>> >>>> Thanks and Regards, >>>> Tanmai Khanna >>>> >>>> -- >>>> *Khanna, Tanmai* >>>> _______________________________________________ >>>> Apertium-stuff mailing list >>>> Apertium-stuff@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> >> >> -- >> *Khanna, Tanmai* >> > > > -- > *Khanna, Tanmai* > -- *Khanna, Tanmai*
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff