Hey guys,
As part of the project to eliminate trimming, I had to come up with a way
to include the surface form in the lexical unit and hence modifying the
apertium stream format. To do this I would have to modify the parsers of
every program in the pipeline, and if that has to happen, we discussed on
the IRC that *it might be a good idea to modify the stream in such a way
that we can include an arbitrary amount of information in a lexical unit,
and each program can use whatever information they need.*

The current information in the lexical unit would be primary information,
and then we would have optional secondary information which could contain
the surface form, but also literally anything you can think of (case,
sentiment, pragmatic info, etc.). This would open up a lot of possibilities
for each program, and it would strengthen the apertium stream format
considerably.

We discussed several possible syntax for this new stream format, and the
one that seems the best is something like this:
^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

This doesn't mess with the current stream format too much. The number of
tags is already arbitrary so that helps. The secondary tags contain a ":"
that would help distinguish them from primary tags.

To implement this a modification would still be needed to all the parsers
but the benefits far outweigh the amount of work needed to pull this off.

Since this would be a major fundamental change to Apertium, I request you
all to contribute with your views, any pros, cons, suggestions - to the
idea, to the syntax, anything.

Thanks and Regards,
Tanmai Khanna

-- 
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to