Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Sun, 29 Mar 2020 02:44:00 -0700

I apologise, it seems like the link got removed when the message sent.
Here it is:
http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming


Thanks
Tanmai

On Sun, Mar 29, 2020 at 3:11 PM Tanmai Khanna <[email protected]>
wrote:

> Hey guys,
> Here's a draft proposal for this project. Any comments will be
> appreciated :)
>
> Thanks,
> Tanmai
>
> On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <[email protected]>
> wrote:
>
>> Hi Hèctor,
>> A fundamental motivation for this proposal is the possibility of giving
>> the power to each program to use and propagate as much information as it
>> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>>
>>> You should see how much secondary information VISL's streams have. Noun
>>> semantics, verb frames, dependency, markup tags, etc. Being able to carry
>>> any information along makes many things possible, often things you can't
>>> imagine because of current limitations.
>>>
>>
>> The example in my original was well, just an example, but the idea is
>> that you can add any amount of information as you want, in the language
>> models or even the translation modules. It's also not just about English
>> words, but about all languages. This is not to say that we have to add case
>> information to every English word. It is optional information which can be
>> added if needed for the translation task.
>>
>> With this proposal we're trying to prepare the apertium stream for the
>> future. Today we realised that we need the surface form in the stream, and
>> tomorrow we might need semantic tags, sentiment tags, etc.* If we don't
>> do this now, we will have to modify all the parsers in the pipeline each
>> time we need more information in the pipe.* This is why it's a good idea
>> to modify the parsers so that it can handle an arbitrary amount of
>> information.
>>
>> Lastly, one point we should discuss is this idea about how any secondary
>> information I add in the monodix would be available for everyone who uses
>> that information. There's several things to say about this:
>>
>>    - As long as the information is correct, I don't really see why
>>    redundant secondary information should bother anyone. It will be available
>>    for anyone who wishes to use it for their task, and if you don't want to
>>    use it the programs will ignore it.
>>    - Another idea is that secondary information could be put in a
>>    separate dix, however this would lead to an unnecessary increase in
>>    complexity.
>>
>> Unless if the developers of Apertium feel that redundant information in
>> the stream will be a huge problem, this will allow each program to access a
>> lot more information and open up possibilities that we haven't even thought
>> of yet. At the very least, it will help us to eliminate trimming.
>>
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <[email protected]>
>> wrote:
>>
>>> Hi Tanmai,
>>>
>>> I am surprised by this proposal. It involves some very important changes
>>> that should be better justified. I don't quite understand when should one
>>> define the "optional secondary information" in addition to the current
>>> morphological fields. Will it be in the language module (apertium-xxx) or
>>> in each of the translation modules (apertium-xxx-yyy)? Part of the problem
>>> may be in the example. I can't imagine why information on case should be
>>> added to every English word (not much that, say, information about
>>> belonging, which is common for Turkic languages). Should this kind of
>>> unnecessary information for everybody, or almost everybody, will be found
>>> in every language pair using, say, English if someone for his or her
>>> specific purposes will like to add it? As far as I understand, for the
>>> given project it is needed to add the surface form of the word. This seems
>>> quite logical. Moreover, this information may be useful for e.g. lexical
>>> selection and structural transfer. But more than that seems to me too
>>> obscure.
>>>
>>> Best,
>>> Hèctor
>>>
>>> Missatge de Tanmai Khanna <[email protected]> del dia ds., 28 de
>>> març 2020 a les 23:51:
>>>
>>>> Hey guys,
>>>> As part of the project to eliminate trimming, I had to come up with a
>>>> way to include the surface form in the lexical unit and hence modifying the
>>>> apertium stream format. To do this I would have to modify the parsers of
>>>> every program in the pipeline, and if that has to happen, we discussed on
>>>> the IRC that *it might be a good idea to modify the stream in such a
>>>> way that we can include an arbitrary amount of information in a lexical
>>>> unit, and each program can use whatever information they need.*
>>>>
>>>> The current information in the lexical unit would be primary
>>>> information, and then we would have optional secondary information which
>>>> could contain the surface form, but also literally anything you can think
>>>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of
>>>> possibilities for each program, and it would strengthen the apertium stream
>>>> format considerably.
>>>>
>>>> We discussed several possible syntax for this new stream format, and
>>>> the one that seems the best is something like this:
>>>>
>>>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$
>>>>
>>>> This doesn't mess with the current stream format too much. The number
>>>> of tags is already arbitrary so that helps. The secondary tags contain a
>>>> ":" that would help distinguish them from primary tags.
>>>>
>>>> To implement this a modification would still be needed to all the
>>>> parsers but the benefits far outweigh the amount of work needed to pull
>>>> this off.
>>>>
>>>> Since this would be a major fundamental change to Apertium, I request
>>>> you all to contribute with your views, any pros, cons, suggestions - to the
>>>> idea, to the syntax, anything.
>>>>
>>>> Thanks and Regards,
>>>> Tanmai Khanna
>>>>
>>>> --
>>>> *Khanna, Tanmai*
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> --
>> *Khanna, Tanmai*
>>
>
>
> --
> *Khanna, Tanmai*
>


-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to