Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Sun, 29 Mar 2020 11:30:09 -0700

Instead of looking at this as modifying or extending the apertium stream
format, we could look at this as making tags more versatile by creating a
new kind of tags which have a feature:value pair. That's all there is to
it, really. In effect, it allows us to pass an arbitrary amount of info in
the pipe.


Tanmai

On Sun, Mar 29, 2020 at 8:53 PM Tanmai Khanna <khanna.tan...@gmail.com>
wrote:

> Hi Mikel,
>
>> (0) No change should be made without proper regression testing. I think
>> we all agree on that!
>>
> Definitely, and this is something I'll add in the proposal.
>
>> (1) I still believe that the functionality should be proven without
>> rewriting the (critical) format parsing portions in Apertium's modules. In
>> particular, finite-state lexical-processing modules written in C++. I'd
>> like to hear (for instance) from Sergio first.  There is always the
>> possibility of having a complete series of modules that allow for
>> "extended" formats, one per current module, so that each pair works with
>> what it needs.
>>
> As mentioned in the last mail, the functionality will definitely be proven
> before even one line of code is written.
>
>> (2) Extending *is* changing. The payload associated to each lexical unit
>> would be heavier and will be running down the pipe, where it is necessary
>> and where it isn't (but see (4)). This does not come without a
>> computational burden. It is just a matter of comparing it with the burden
>> of working around the current modules. I would not not underestimate the
>> speed of Unix utilities, pipes, etc. This is code that has been around for
>> 40 years.
>>
> I have regularly seen programs work with larger streams but if
> computational burden is the main issue then yes Tino's idea of reduced
> reference payload works. Any way I'd see that as more of an optimisation
> than a different idea.
>
>> (99) What's the driving force behind avoiding trimming? I confess I don't
>> quite get it.
>>
> I explain this in
> http://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020Proposal_Trimming#Which_of_the_published_tasks_am_I_interested_in.3F_What_do_I_plan_to_do.3F
> , but the gist is that if you trim away any source words and their analysis
> if they aren't present in the bidix, you lose out on a better lexical
> selection and transfer rules that would have otherwise matched, would not
> match now because the word is unknown. If we can avoid trimming, we can
> improve disambiguation and transfer. Monodixes can often be improved
> individually using several sources and methods, but none of that would
> matter if we keep trimming all words that aren't in the bidix as well. Just
> because we cannot translate a word shouldn't mean we can't use it for our
> benefit.
>
> On a more fundamental level, this was also about not trimming away
> information in general, and propagating important information in the pipe
> so that each program can use it as it'd like. This would need us to have
> the surface form in the pipe, and it felt like a more general and future
> ready solution would be to allow arbitrary information in the pipe. For
> this project, we can limit this to surface form so that none of the files
> are modified, and every program works as it did before. But it gives the
> developers a powerful option for propagating several features, as Tino
> listed already.
>
> Hope this made the rationale behind trimming and modifying the stream
> clear. It would be great to hear from Sergio, as these discussions are how
> we're going to make this project a success for Apertium.
>
> Thanks and Regards,
> Tanmai Khanna
>
>
> El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:
>>
>> Mikel,
>> This is a preliminary idea and a suggestion that we discussed only
>> yesterday, but I assure you that it will be justified in an uncontestable
>> way before even one line of code is written. Ensuring backwards
>> compatibility is of utmost importance, and because of this, in the proposal
>> to modify the stream, the current stream format is called primary, whereas
>> anything else you can add would be secondary and optional. It will also be
>> presented in a level of formality comparable to that used in the
>> documentation.
>>
>> And of course, as Xavi mentioned, it won't be merged until backwards
>> compatibility is ensured.
>>
>> About the idea of DAG and multiple streams, before we discuss it I just
>> want to understand that the main concern you have is that we should leave
>> the main stream untouched, or is it that information shouldn't have to go
>> through programs that don't need it.
>>
>> With all your comments and suggestions, I will make a robust proposal and
>> it will be formalised before it is implemented.
>> Thanks and Regards,
>> Tanmai Khanna
>>
>> On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada <m...@dlsi.ua.es> wrote:
>>
>>> Folks:
>>>
>>> The elders in Apertium will not be surprised if I voiced my opposition
>>> to changing the format in the Apertium formats used between different
>>> modules of the pipeline. In any case, this is affects the core
>>> functionality of Apertium in many ways and its need should be justified in
>>> an uncontestable way so that the PMC makes a decision to have a new version
>>> of Apertium which should inevitably have paths to backward compatibility so
>>> that legacy languages and language pairs work identically and without any
>>> loss of performance. I believe we are far from "uncontestability", but that
>>> is just my personal opinion.
>>>
>>> Currently, modes are linear pipelines. Any functionality requiring
>>> information that is currently ingested by one module and not passed ahead
>>> could be sent to later modules by teeing (
>>> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We
>>> would have a directed acyclic graph, much as in tools like make, snakemake,
>>> dgsh (https://github.com/dspinellis/dgsh/wiki) etc.
>>>
>>> Any justification should first prove that the functionality is robust
>>> and needed by working around the current format and modules, and be
>>> presented in a level of formality which is comparable to that used
>>> currently in our documentation.
>>>
>>> Having said that, no one cannot oppose people forking and testing. If
>>> the new thing works, Apertium could bless the fork and merge it (depending
>>> on how the fork handles provisions for legacy Apertium workflows). But, as
>>> I said, this seems premature to me. But I am usually very conservative.
>>>
>>> Cheers
>>>
>>> Mikel
>>>
>>>
>>> El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:
>>>
>>> Hey guys,
>>> Here's a draft proposal for this project. Any comments will be
>>> appreciated :)
>>>
>>> Thanks,
>>> Tanmai
>>>
>>> On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna <khanna.tan...@gmail.com>
>>> wrote:
>>>
>>>> Hi Hèctor,
>>>> A fundamental motivation for this proposal is the possibility of giving
>>>> the power to each program to use and propagate as much information as it
>>>> needs in the pipeline. In our discussion on the IRC, Tino Didriksen said:
>>>>
>>>>> You should see how much secondary information VISL's streams have.
>>>>> Noun semantics, verb frames, dependency, markup tags, etc. Being able to
>>>>> carry any information along makes many things possible, often things you
>>>>> can't imagine because of current limitations.
>>>>>
>>>>
>>>> The example in my original was well, just an example, but the idea is
>>>> that you can add any amount of information as you want, in the language
>>>> models or even the translation modules. It's also not just about English
>>>> words, but about all languages. This is not to say that we have to add case
>>>> information to every English word. It is optional information which can be
>>>> added if needed for the translation task.
>>>>
>>>> With this proposal we're trying to prepare the apertium stream for the
>>>> future. Today we realised that we need the surface form in the stream, and
>>>> tomorrow we might need semantic tags, sentiment tags, etc.* If we
>>>> don't do this now, we will have to modify all the parsers in the pipeline
>>>> each time we need more information in the pipe.* This is why it's a
>>>> good idea to modify the parsers so that it can handle an arbitrary amount
>>>> of information.
>>>>
>>>> Lastly, one point we should discuss is this idea about how any
>>>> secondary information I add in the monodix would be available for everyone
>>>> who uses that information. There's several things to say about this:
>>>>
>>>>    - As long as the information is correct, I don't really see why
>>>>    redundant secondary information should bother anyone. It will be 
>>>> available
>>>>    for anyone who wishes to use it for their task, and if you don't want to
>>>>    use it the programs will ignore it.
>>>>    - Another idea is that secondary information could be put in a
>>>>    separate dix, however this would lead to an unnecessary increase in
>>>>    complexity.
>>>>
>>>> Unless if the developers of Apertium feel that redundant information in
>>>> the stream will be a huge problem, this will allow each program to access a
>>>> lot more information and open up possibilities that we haven't even thought
>>>> of yet. At the very least, it will help us to eliminate trimming.
>>>>
>>>> Thanks and Regards,
>>>> Tanmai Khanna
>>>>
>>>> On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font <
>>>> hectora...@gmail.com> wrote:
>>>>
>>>>> Hi Tanmai,
>>>>>
>>>>> I am surprised by this proposal. It involves some very important
>>>>> changes that should be better justified. I don't quite understand when
>>>>> should one define the "optional secondary information" in addition to the
>>>>> current morphological fields. Will it be in the language module
>>>>> (apertium-xxx) or in each of the translation modules (apertium-xxx-yyy)?
>>>>> Part of the problem may be in the example. I can't imagine why information
>>>>> on case should be added to every English word (not much that, say,
>>>>> information about belonging, which is common for Turkic languages). Should
>>>>> this kind of unnecessary information for everybody, or almost everybody,
>>>>> will be found in every language pair using, say, English if someone for 
>>>>> his
>>>>> or her specific purposes will like to add it? As far as I understand, for
>>>>> the given project it is needed to add the surface form of the word. This
>>>>> seems quite logical. Moreover, this information may be useful for e.g.
>>>>> lexical selection and structural transfer. But more than that seems to me
>>>>> too obscure.
>>>>>
>>>>> Best,
>>>>> Hèctor
>>>>>
>>>>> Missatge de Tanmai Khanna <khanna.tan...@gmail.com> del dia ds., 28
>>>>> de març 2020 a les 23:51:
>>>>>
>>>>>> Hey guys,
>>>>>> As part of the project to eliminate trimming, I had to come up with a
>>>>>> way to include the surface form in the lexical unit and hence modifying 
>>>>>> the
>>>>>> apertium stream format. To do this I would have to modify the parsers of
>>>>>> every program in the pipeline, and if that has to happen, we discussed on
>>>>>> the IRC that *it might be a good idea to modify the stream in such a
>>>>>> way that we can include an arbitrary amount of information in a lexical
>>>>>> unit, and each program can use whatever information they need.*
>>>>>>
>>>>>> The current information in the lexical unit would be primary
>>>>>> information, and then we would have optional secondary information which
>>>>>> could contain the surface form, but also literally anything you can think
>>>>>> of (case, sentiment, pragmatic info, etc.). This would open up a lot of
>>>>>> possibilities for each program, and it would strengthen the apertium 
>>>>>> stream
>>>>>> format considerably.
>>>>>>
>>>>>> We discussed several possible syntax for this new stream format, and
>>>>>> the one that seems the best is something like this:
>>>>>>
>>>>>> ^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$
>>>>>>
>>>>>> This doesn't mess with the current stream format too much. The number
>>>>>> of tags is already arbitrary so that helps. The secondary tags contain a
>>>>>> ":" that would help distinguish them from primary tags.
>>>>>>
>>>>>> To implement this a modification would still be needed to all the
>>>>>> parsers but the benefits far outweigh the amount of work needed to pull
>>>>>> this off.
>>>>>>
>>>>>> Since this would be a major fundamental change to Apertium, I request
>>>>>> you all to contribute with your views, any pros, cons, suggestions - to 
>>>>>> the
>>>>>> idea, to the syntax, anything.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Tanmai Khanna
>>>>>>
>>>>>> --
>>>>>> *Khanna, Tanmai*
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>>
>>>>
>>>> --
>>>> *Khanna, Tanmai*
>>>>
>>>
>>>
>>> --
>>> *Khanna, Tanmai*
>>>
>>>
>>> _______________________________________________
>>> Apertium-stuff mailing 
>>> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>> --
>>> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
>>> Departament de Llenguatges i Sistemes Informàtics
>>> Universitat d'Alacant
>>> E-03690 Sant Vicent del Raspeig
>>> Spain
>>> Office: +34 96 590 9776
>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>>
>> --
>> *Khanna, Tanmai*
>>
>>
>> _______________________________________________
>> Apertium-stuff mailing 
>> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>> --
>> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
>> Departament de Llenguatges i Sistemes Informàtics
>> Universitat d'Alacant
>> E-03690 Sant Vicent del Raspeig
>> Spain
>> Office: +34 96 590 9776
>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> *Khanna, Tanmai*
>


-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to