Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Mikel L. Forcada Sun, 29 Mar 2020 07:46:01 -0700

Folks:

A quick round of comments after the responses by Tino, Xavi, Tanmai,and Fran. Did I miss anyone?

(00) I cannot claim to have thoroughly considered all of the details ofthe proposal. Therefore, I can change my mind.

(0) No change should be made without proper regression testing. I thinkwe all agree on that!

(1) I still believe that the functionality should be proven withoutrewriting the (critical) format parsing portions in Apertium's modules.In particular, finite-state lexical-processing modules written in C++.I'd like to hear (for instance) from Sergio first. There is always thepossibility of having a complete series of modules that allow for"extended" formats, one per current module, so that each pair works withwhat it needs.

(2) Extending *is* changing. The payload associated to each lexical unitwould be heavier and will be running down the pipe, where it isnecessary and where it isn't (but see (4)). This does not come without acomputational burden. It is just a matter of comparing it with theburden of working around the current modules. I would not notunderestimate the speed of Unix utilities, pipes, etc. This is code thathas been around for 40 years. [1]

(3) I was not suggesting external forking. Branches inside Apertium arefine.

(4) I like Tino's idea of a reduced reference payload and a commonstorage which is referred to.

(99) What's the driving force behind avoiding trimming? I confess Idon't quite get it.


Cheers

Mikel

[1] If it weren't for some stubbornness on my part (ask Sergio),Apertium (previously interNOSTRUM) would not be a Unix pipeline, butsome monolithic object-oriented thing, or whatever was fashionable then.Yes, I was a grumpy, stubborn boss back then. Now I am not the bossanymore; still grumpy sometimes — fortunately 3,5 years from voluntaryretirement. Which, hey, I have decided to take.


El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:

Mikel,

This is a preliminary idea and a suggestion that we discussed onlyyesterday, but I assure you that it will be justified in anuncontestable way before even one line of code is written. Ensuringbackwards compatibility is of utmost importance, and because of this,in the proposal to modify the stream, the current stream format iscalled primary, whereas anything else you can add would be secondaryand optional. It will also be presented in a level of formalitycomparable to that used in the documentation.

And of course, as Xavi mentioned, it won't be merged until backwardscompatibility is ensured.

About the idea of DAG and multiple streams, before we discuss it Ijust want to understand that the main concern you have is that weshould leave the main stream untouched, or is it that informationshouldn't have to go through programs that don't need it.

With all your comments and suggestions, I will make a robust proposaland it will be formalised before it is implemented.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada <m...@dlsi.ua.es<mailto:m...@dlsi.ua.es>> wrote:


    Folks:

    The elders in Apertium will not be surprised if I voiced my
    opposition to changing the format in the Apertium formats used
    between different modules of the pipeline. In any case, this is
    affects the core functionality of Apertium in many ways and its
    need should be justified in an uncontestable way so that the PMC
    makes a decision to have a new version of Apertium which should
    inevitably have paths to backward compatibility so that legacy
    languages and language pairs work identically and without any loss
    of performance. I believe we are far from "uncontestability", but
    that is just my personal opinion.

    Currently, modes are linear pipelines. Any functionality requiring
    information that is currently ingested by one module and not
    passed ahead could be sent to later modules by teeing
    (https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc.
    We would have a directed acyclic graph, much as in tools like
    make, snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.

    Any justification should first prove that the functionality is
    robust and needed by working around the current format and
    modules, and be presented in a level of formality which is
    comparable to that used currently in our documentation.

    Having said that, no one cannot oppose people forking and testing.
    If the new thing works, Apertium could bless the fork and merge it
    (depending on how the fork handles provisions for legacy Apertium
    workflows). But, as I said, this seems premature to me. But I am
    usually very conservative.

    Cheers

    Mikel


    El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:

Hey guys,
Here's a draft proposal for this project. Any comments will be
appreciated :)

Thanks,
Tanmai

On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna
<khanna.tan...@gmail.com <mailto:khanna.tan...@gmail.com>> wrote:

Hi Hèctor,
A fundamental motivation for this proposal is the possibility
of giving the power to each program to use and propagate as
much information as it needs in the pipeline. In our
discussion on the IRC, Tino Didriksen said:

You should see how much secondary information VISL's
streams have. Noun semantics, verb frames, dependency,
markup tags, etc. Being able to carry any information
along makes many things possible, often things you can't
imagine because of current limitations.

The example in my original was well, just an example, but the
idea is that you can add any amount of information as you
want, in the language models or even the translation modules.
It's also not just about English words, but about all
languages. This is not to say that we have to add case
information to every English word. It is optional
information which can be added if needed for the translation
task.

With this proposal we're trying to prepare the apertium
stream for the future. Today we realised that we need the
surface form in the stream, and tomorrow we might need
semantic tags, sentiment tags, etc.*If we don't do this now,
we will have to modify all the parsers in the pipeline each
time we need more information in the pipe.* This is why it's
a good idea to modify the parsers so that it can handle an
arbitrary amount of information.

Lastly, one point we should discuss is this idea about how
any secondary information I add in the monodix would be
available for everyone who uses that information. There's
several things to say about this:

* As long as the information is correct, I don't really see
why redundant secondary information should bother anyone.
It will be available for anyone who wishes to use it for
their task, and if you don't want to use it the programs
will ignore it.
* Another idea is that secondary information could be put
in a separate dix, however this would lead to an
unnecessary increase in complexity.

Unless if the developers of Apertium feel that redundant
information in the stream will be a huge problem, this will
allow each program to access a lot more information and open
up possibilities that we haven't even thought of yet. At the
very least, it will help us to eliminate trimming.

Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
<hectora...@gmail.com <mailto:hectora...@gmail.com>> wrote:

Hi Tanmai,

I am surprised by this proposal. It involves some very
important changes that should be better justified. I
don't quite understand when should one define the
"optional secondary information" in addition to the
current morphological fields. Will it be in the language
module (apertium-xxx) or in each of the translation
modules (apertium-xxx-yyy)? Part of the problem may be in
the example. I can't imagine why information on case
should be added to every English word (not much that,
say, information about belonging, which is common for
Turkic languages). Should this kind of unnecessary
information for everybody, or almost everybody, will be
found in every language pair using, say, English if
someone for his or her specific purposes will like to add
it? As far as I understand, for the given project it is
needed to add the surface form of the word. This seems
quite logical. Moreover, this information may be useful
for e.g. lexical selection and structural transfer. But
more than that seems to me too obscure.

Best,
Hèctor

Missatge de Tanmai Khanna <khanna.tan...@gmail.com
<mailto:khanna.tan...@gmail.com>> del dia ds., 28 de març
2020 a les 23:51:

Hey guys,
As part of the project to eliminate trimming, I had
to come up with a way to include the surface form in
the lexical unit and hence modifying the apertium
stream format. To do this I would have to modify the
parsers of every program in the pipeline, and if that
has to happen, we discussed on the IRC that *it might
be a good idea to modify the stream in such a way
that we can include an arbitrary amount of
information in a lexical unit, and each program can
use whatever information they need.*

The current information in the lexical unit would be
primary information, and then we would have optional
secondary information which could contain the surface
form, but also literally anything you can think of
(case, sentiment, pragmatic info, etc.). This would
open up a lot of possibilities for each program, and
it would strengthen the apertium stream format
considerably.

We discussed several possible syntax for this new
stream format, and the one that seems the best is
something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

This doesn't mess with the current stream format too
much. The number of tags is already arbitrary so that
helps. The secondary tags contain a ":" that would
help distinguish them from primary tags.

To implement this a modification would still be
needed to all the parsers but the benefits far
outweigh the amount of work needed to pull this off.

Since this would be a major fundamental change to
Apertium, I request you all to contribute with your
views, any pros, cons, suggestions - to the idea, to
the syntax, anything.

Thanks and Regards,
Tanmai Khanna

--*Khanna, Tanmai*

                _______________________________________________
                Apertium-stuff mailing list
                Apertium-stuff@lists.sourceforge.net
                <mailto:Apertium-stuff@lists.sourceforge.net>
                https://lists.sourceforge.net/lists/listinfo/apertium-stuff

            _______________________________________________
            Apertium-stuff mailing list
            Apertium-stuff@lists.sourceforge.net
            <mailto:Apertium-stuff@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--*Khanna, Tanmai*



    _______________________________________________
    Apertium-stuff mailing list
    Apertium-stuff@lists.sourceforge.net  
<mailto:Apertium-stuff@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--Mikel L. Forcadahttp://www.dlsi.ua.es/~mlf/

    Departament de Llenguatges i Sistemes Informàtics
    Universitat d'Alacant
    E-03690 Sant Vicent del Raspeig
    Spain
    Office: +34 96 590 9776

    _______________________________________________
    Apertium-stuff mailing list
    Apertium-stuff@lists.sourceforge.net
    <mailto:Apertium-stuff@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/apertium-stuff



--
*Khanna, Tanmai*


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to