Folks:

A quick round of  comments  after the responses by Tino, Xavi, Tanmai, and Fran. Did I miss anyone?

(00) I cannot claim to have thoroughly considered all of the details of the proposal. Therefore, I can change my mind.

(0) No change should be made without proper regression testing. I think we all agree on that!

(1) I still believe that the functionality should be proven without rewriting the (critical) format parsing portions in Apertium's modules. In particular, finite-state lexical-processing modules written in C++. I'd like to hear (for instance) from Sergio first.  There is always the possibility of having a complete series of modules that allow for "extended" formats, one per current module, so that each pair works with what it needs.

(2) Extending *is* changing. The payload associated to each lexical unit would be heavier and will be running down the pipe, where it is necessary and where it isn't (but see (4)). This does not come without a computational burden. It is just a matter of comparing it with the burden of working around the current modules. I would not not underestimate the speed of Unix utilities, pipes, etc. This is code that has been around for 40 years. [1]

(3) I was not suggesting external forking. Branches inside Apertium are fine.

(4) I like Tino's idea of a reduced reference payload and a common storage which is referred to.

(99) What's the driving force behind avoiding trimming? I confess I don't quite get it.

Cheers

Mikel

[1] If it weren't for some stubbornness on my part (ask Sergio), Apertium (previously interNOSTRUM) would not be a Unix pipeline, but some monolithic object-oriented thing, or whatever was fashionable then. Yes, I was a grumpy, stubborn boss back then. Now I am not the boss anymore; still grumpy sometimes — fortunately 3,5 years from voluntary retirement. Which, hey, I have decided to take.

El 29/3/20 a les 13:31, Tanmai Khanna ha escrit:
Mikel,
This is a preliminary idea and a suggestion that we discussed only yesterday, but I assure you that it will be justified in an uncontestable way before even one line of code is written. Ensuring backwards compatibility is of utmost importance, and because of this, in the proposal to modify the stream, the current stream format is called primary, whereas anything else you can add would be secondary and optional. It will also be presented in a level of formality comparable to that used in the documentation.

And of course, as Xavi mentioned, it won't be merged until backwards compatibility is ensured.

About the idea of DAG and multiple streams, before we discuss it I just want to understand that the main concern you have is that we should leave the main stream untouched, or is it that information shouldn't have to go through programs that don't need it.

With all your comments and suggestions, I will make a robust proposal and it will be formalised before it is implemented.
Thanks and Regards,
Tanmai Khanna

On Sun, Mar 29, 2020 at 3:52 PM Mikel L. Forcada <m...@dlsi.ua.es <mailto:m...@dlsi.ua.es>> wrote:

    Folks:

    The elders in Apertium will not be surprised if I voiced my
    opposition to changing the format in the Apertium formats used
    between different modules of the pipeline. In any case, this is
    affects the core functionality of Apertium in many ways and its
    need should be justified in an uncontestable way so that the PMC
    makes a decision to have a new version of Apertium which should
    inevitably have paths to backward compatibility so that legacy
    languages and language pairs work identically and without any loss
    of performance. I believe we are far from "uncontestability", but
    that is just my personal opinion.

    Currently, modes are linear pipelines. Any functionality requiring
    information that is currently ingested by one module and not
    passed ahead could be sent to later modules by teeing
    (https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc.
    We would have a directed acyclic graph, much as in tools like
    make, snakemake, dgsh (https://github.com/dspinellis/dgsh/wiki) etc.

    Any justification should first prove that the functionality is
    robust and needed by working around the current format and
    modules, and be presented in a level of formality which is
    comparable to that used currently in our documentation.

    Having said that, no one cannot oppose people forking and testing.
    If the new thing works, Apertium could bless the fork and merge it
    (depending on how the fork handles provisions for legacy Apertium
    workflows). But, as I said, this seems premature to me. But I am
    usually very conservative.

    Cheers

    Mikel


    El 29/3/20 a les 11:41, Tanmai Khanna ha escrit:
    Hey guys,
    Here's a draft proposal for this project. Any comments will be
    appreciated :)

    Thanks,
    Tanmai

    On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna
    <khanna.tan...@gmail.com <mailto:khanna.tan...@gmail.com>> wrote:

        Hi Hèctor,
        A fundamental motivation for this proposal is the possibility
        of giving the power to each program to use and propagate as
        much information as it needs in the pipeline. In our
        discussion on the IRC, Tino Didriksen said:

            You should see how much secondary information VISL's
            streams have. Noun semantics, verb frames, dependency,
            markup tags, etc. Being able to carry any information
            along makes many things possible, often things you can't
            imagine because of current limitations.


        The example in my original was well, just an example, but the
        idea is that you can add any amount of information as you
        want, in the language models or even the translation modules.
        It's also not just about English words, but about all
        languages. This is not to say that we have to add case
        information to every English word. It is optional
        information which can be added if needed for the translation
        task.

        With this proposal we're trying to prepare the apertium
        stream for the future. Today we realised that we need the
        surface form in the stream, and tomorrow we might need
        semantic tags, sentiment tags, etc.*If we don't do this now,
        we will have to modify all the parsers in the pipeline each
        time we need more information in the pipe.* This is why it's
        a good idea to modify the parsers so that it can handle an
        arbitrary amount of information.

        Lastly, one point we should discuss is this idea about how
        any secondary information I add in the monodix would be
        available for everyone who uses that information. There's
        several things to say about this:

          * As long as the information is correct, I don't really see
            why redundant secondary information should bother anyone.
            It will be available for anyone who wishes to use it for
            their task, and if you don't want to use it the programs
            will ignore it.
          * Another idea is that secondary information could be put
            in a separate dix, however this would lead to an
            unnecessary increase in complexity.

        Unless if the developers of Apertium feel that redundant
        information in the stream will be a huge problem, this will
        allow each program to access a lot more information and open
        up possibilities that we haven't even thought of yet. At the
        very least, it will help us to eliminate trimming.

        Thanks and Regards,
        Tanmai Khanna

        On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
        <hectora...@gmail.com <mailto:hectora...@gmail.com>> wrote:

            Hi Tanmai,

            I am surprised by this proposal. It involves some very
            important changes that should be better justified. I
            don't quite understand when should one define the
            "optional secondary information" in addition to the
            current morphological fields. Will it be in the language
            module (apertium-xxx) or in each of the translation
            modules (apertium-xxx-yyy)? Part of the problem may be in
            the example. I can't imagine why information on case
            should be added to every English word (not much that,
            say, information about belonging, which is common for
            Turkic languages). Should this kind of unnecessary
            information for everybody, or almost everybody, will be
            found in every language pair using, say, English if
            someone for his or her specific purposes will like to add
            it? As far as I understand, for the given project it is
            needed to add the surface form of the word. This seems
            quite logical. Moreover, this information may be useful
            for e.g. lexical selection and structural transfer. But
            more than that seems to me too obscure.

            Best,
            Hèctor

            Missatge de Tanmai Khanna <khanna.tan...@gmail.com
            <mailto:khanna.tan...@gmail.com>> del dia ds., 28 de març
            2020 a les 23:51:

                Hey guys,
                As part of the project to eliminate trimming, I had
                to come up with a way to include the surface form in
                the lexical unit and hence modifying the apertium
                stream format. To do this I would have to modify the
                parsers of every program in the pipeline, and if that
                has to happen, we discussed on the IRC that *it might
                be a good idea to modify the stream in such a way
                that we can include an arbitrary amount of
                information in a lexical unit, and each program can
                use whatever information they need.*

                The current information in the lexical unit would be
                primary information, and then we would have optional
                secondary information which could contain the surface
                form, but also literally anything you can think of
                (case, sentiment, pragmatic info, etc.). This would
                open up a lot of possibilities for each program, and
                it would strengthen the apertium stream format
                considerably.

                We discussed several possible syntax for this new
                stream format, and the one that seems the best is
                something like this:
                
^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

                This doesn't mess with the current stream format too
                much. The number of tags is already arbitrary so that
                helps. The secondary tags contain a ":" that would
                help distinguish them from primary tags.

                To implement this a modification would still be
                needed to all the parsers but the benefits far
                outweigh the amount of work needed to pull this off.

                Since this would be a major fundamental change to
                Apertium, I request you all to contribute with your
                views, any pros, cons, suggestions - to the idea, to
                the syntax, anything.

                Thanks and Regards,
                Tanmai Khanna
-- *Khanna, Tanmai*
                _______________________________________________
                Apertium-stuff mailing list
                Apertium-stuff@lists.sourceforge.net
                <mailto:Apertium-stuff@lists.sourceforge.net>
                https://lists.sourceforge.net/lists/listinfo/apertium-stuff

            _______________________________________________
            Apertium-stuff mailing list
            Apertium-stuff@lists.sourceforge.net
            <mailto:Apertium-stuff@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/apertium-stuff



-- *Khanna, Tanmai*



-- *Khanna, Tanmai*


    _______________________________________________
    Apertium-stuff mailing list
    Apertium-stuff@lists.sourceforge.net  
<mailto:Apertium-stuff@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/apertium-stuff

-- Mikel L. Forcadahttp://www.dlsi.ua.es/~mlf/
    Departament de Llenguatges i Sistemes Informàtics
    Universitat d'Alacant
    E-03690 Sant Vicent del Raspeig
    Spain
    Office: +34 96 590 9776

    _______________________________________________
    Apertium-stuff mailing list
    Apertium-stuff@lists.sourceforge.net
    <mailto:Apertium-stuff@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/apertium-stuff



--
*Khanna, Tanmai*


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

--
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to