Hey guys,
Here's a draft proposal for this project. Any comments will be
appreciated :)
Thanks,
Tanmai
On Sun, Mar 29, 2020 at 12:52 PM Tanmai Khanna
<khanna.tan...@gmail.com <mailto:khanna.tan...@gmail.com>> wrote:
Hi Hèctor,
A fundamental motivation for this proposal is the possibility
of giving the power to each program to use and propagate as
much information as it needs in the pipeline. In our
discussion on the IRC, Tino Didriksen said:
You should see how much secondary information VISL's
streams have. Noun semantics, verb frames, dependency,
markup tags, etc. Being able to carry any information
along makes many things possible, often things you can't
imagine because of current limitations.
The example in my original was well, just an example, but the
idea is that you can add any amount of information as you
want, in the language models or even the translation modules.
It's also not just about English words, but about all
languages. This is not to say that we have to add case
information to every English word. It is optional
information which can be added if needed for the translation
task.
With this proposal we're trying to prepare the apertium
stream for the future. Today we realised that we need the
surface form in the stream, and tomorrow we might need
semantic tags, sentiment tags, etc.*If we don't do this now,
we will have to modify all the parsers in the pipeline each
time we need more information in the pipe.* This is why it's
a good idea to modify the parsers so that it can handle an
arbitrary amount of information.
Lastly, one point we should discuss is this idea about how
any secondary information I add in the monodix would be
available for everyone who uses that information. There's
several things to say about this:
* As long as the information is correct, I don't really see
why redundant secondary information should bother anyone.
It will be available for anyone who wishes to use it for
their task, and if you don't want to use it the programs
will ignore it.
* Another idea is that secondary information could be put
in a separate dix, however this would lead to an
unnecessary increase in complexity.
Unless if the developers of Apertium feel that redundant
information in the stream will be a huge problem, this will
allow each program to access a lot more information and open
up possibilities that we haven't even thought of yet. At the
very least, it will help us to eliminate trimming.
Thanks and Regards,
Tanmai Khanna
On Sun, Mar 29, 2020 at 10:39 AM Hèctor Alòs i Font
<hectora...@gmail.com <mailto:hectora...@gmail.com>> wrote:
Hi Tanmai,
I am surprised by this proposal. It involves some very
important changes that should be better justified. I
don't quite understand when should one define the
"optional secondary information" in addition to the
current morphological fields. Will it be in the language
module (apertium-xxx) or in each of the translation
modules (apertium-xxx-yyy)? Part of the problem may be in
the example. I can't imagine why information on case
should be added to every English word (not much that,
say, information about belonging, which is common for
Turkic languages). Should this kind of unnecessary
information for everybody, or almost everybody, will be
found in every language pair using, say, English if
someone for his or her specific purposes will like to add
it? As far as I understand, for the given project it is
needed to add the surface form of the word. This seems
quite logical. Moreover, this information may be useful
for e.g. lexical selection and structural transfer. But
more than that seems to me too obscure.
Best,
Hèctor
Missatge de Tanmai Khanna <khanna.tan...@gmail.com
<mailto:khanna.tan...@gmail.com>> del dia ds., 28 de març
2020 a les 23:51:
Hey guys,
As part of the project to eliminate trimming, I had
to come up with a way to include the surface form in
the lexical unit and hence modifying the apertium
stream format. To do this I would have to modify the
parsers of every program in the pipeline, and if that
has to happen, we discussed on the IRC that *it might
be a good idea to modify the stream in such a way
that we can include an arbitrary amount of
information in a lexical unit, and each program can
use whatever information they need.*
The current information in the lexical unit would be
primary information, and then we would have optional
secondary information which could contain the surface
form, but also literally anything you can think of
(case, sentiment, pragmatic info, etc.). This would
open up a lot of possibilities for each program, and
it would strengthen the apertium stream format
considerably.
We discussed several possible syntax for this new
stream format, and the one that seems the best is
something like this:
^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$
This doesn't mess with the current stream format too
much. The number of tags is already arbitrary so that
helps. The secondary tags contain a ":" that would
help distinguish them from primary tags.
To implement this a modification would still be
needed to all the parsers but the benefits far
outweigh the amount of work needed to pull this off.
Since this would be a major fundamental change to
Apertium, I request you all to contribute with your
views, any pros, cons, suggestions - to the idea, to
the syntax, anything.
Thanks and Regards,
Tanmai Khanna
--
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
--
*Khanna, Tanmai*
--
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
<mailto:Apertium-stuff@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/apertium-stuff