Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Xavi Ivars Sun, 29 Mar 2020 04:17:09 -0700

Missatge de Mikel L. Forcada <m...@dlsi.ua.es> del dia dg., 29 de març 2020
a les 12:22:


> Folks:
>
> The elders in Apertium will not be surprised if I voiced my opposition to
> changing the format in the Apertium formats used between different modules
> of the pipeline. In any case, this is affects the core functionality of
> Apertium in many ways and its need should be justified in an uncontestable
> way so that the PMC makes a decision to have a new version of Apertium
> which should inevitably have paths to backward compatibility so that legacy
> languages and language pairs work identically and without any loss of
> performance. I believe we are far from "uncontestability", but that is just
> my personal opinion.
>
Either I'm missing something, or there's something about the proposal that
is not clear enough.

I don't think anyone has suggested to "change" the format used between
modules, but to extend it in a way that is fully backwards compatible. One
of the main pieces about the "implementation" part of the proposal is
actually about that: avoiding any type of regression in existing modules
and language pairs.

Someone using the language data with apertium 3.x (whatever X we currently
have) will keep working the same way it works as of today. And anyone using
the language data with Apertium 3.Y (whatever Y is when this features get
merged) will also work the same way. I agree this is a must, but where does
the proposal says that won't be the case? It actually says the opposite: it
will ensure no regressions.


> Currently, modes are linear pipelines. Any functionality requiring
> information that is currently ingested by one module and not passed ahead
> could be sent to later modules by teeing (
> https://en.wikipedia.org/wiki/Tee_(command)), named pipes, etc. We would
> have a directed acyclic graph, much as in tools like make, snakemake, dgsh (
> https://github.com/dspinellis/dgsh/wiki) etc.
>
One of the (huge) benefits of Apertium's modular architecture is the
ability to place modules at any point of the pipeline to improve the
overall translation, without having to modify the existing modules. Several
modules have been added to different pipelines: move from 1-level to
3-level transfers (with chunking), new morphological analyzers based on
HFST, CG disambiguation, sliding window and perceptron implementation for
taggers, lexical selection, anaphora resolution, recursive transfers,...
even a new DOCX processor few days ago!

But this extensibility is currently limited by the actual information that
is passed over to the next step. Why isn't the PoS tagger passing the
surface form to the "next" module? Because at the time it was written, the
"next" module didn't need it. And that definitely constraints the ability
to add a module after a certain point that needs information that has
already been discarded.

On the other hand, teeing would actually have a much bigger performance
impact: file systems are slow, much slower, than memory. Also, when
thousands of translations are done every minute (apertium.org,
softcatala.org.,...) adding that level of *stress* in the IO of those
servers seems a very bad implementation decision from my point of view.

> Any justification should first prove that the functionality is robust and
> needed by working around the current format and modules, and be presented
> in a level of formality which is comparable to that used currently in our
> documentation.
>
Completely agree. Documentation is something that is already behind the
current implementation, and these changes should be properly documented.

> Having said that, no one cannot oppose people forking and testing. If the
> new thing works, Apertium could bless the fork and merge it (depending on
> how the fork handles provisions for legacy Apertium workflows). But, as I
> said, this seems premature to me. But I am usually very conservative.
>
I don't think anyone would suggest merging this before actually ensuring it
properly works in a backwards compatible mode. The good thing about how
Apertium code base is structured in GitHub (and Git in general) is that it
makes fork-and-merge straight forward *inside *Apertium itself (git
branches, pull requests, etc).

And the reason I don't see these discussions premature is because it'll be
much easier to eventually merge it back to master if there's already an
agreement on the plan. Because it would be really painful that, once this
work is done and it achieves the prerequisites that we all agree (backwards
compatibility, documentation,...) it wouldn't be merged because of
disagreements of, for example, how the extension has been done.

So I'd really ask the ones that have more expertise (and, honestly, a much
better formed point of view that I do) in how Apertium is like it is today
(Mikel & Fran, but also Felipe, Sergio,...) to please keep challenging
approaches like this, but also by pointing out what are the weak points of
the proposal, so they can be reconsidered and improved, but not doing an
amendment to the entire proposal.

-- 
< Xavi Ivars >
< http://xavi.ivars.me >

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to