Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Tanmai Khanna Mon, 20 Apr 2020 11:57:28 -0700

Hey Francis,
I agree that it does seem like a solution searching for a problem if we
look at it in isolation. But it's important to look at this in the context
of eliminating trimming. Chronologically, this project was first about and
still is, about eliminating dictionary trimming. Modification to the stream
is just part of the solution - a solution that will help this problem, but
also potentially several other problems, such as the superblank reordering
problem. I went into detail about this in the proposal but I'll explain it
here.

The monodix of a language is generally larger than the bidix for a language
pair involving that language pair. It was noticed that if used as is, there
are a lot of translation errors (the ones with @), which basically just put
the lemma of the source language if a translation isnt available. To deal
with this, dictionary trimming was added, which basically removed a word
from the monodix if it wasn't present in the bidix and it went through the
pipeline as an unknown word and the source surface form was found in the
final translation (with a *), which is arguably better and more
intelligible than just the source lemma.

However, trimming meant giving up certain benefits. Let's look at these
benefits in greater detail:

   - *Lexical Selection:* By discarding the analysis of a word in the
   source language, we lose the ability to use it as context to disambiguate
   words in its context. Assume a [Noun Adjective] in which the we don't know
   the translation of the Adjective, i.e. it isn't in the bidix. With trimming
   we would discard it and hence if the Noun has several ambiguous forms, we
   have no way to disambiguate it since we've discarded the analysis of the
   Adjective (which included the fact that it's an adjective)
   - *Transfer:* In the same example, assume that in the target language,
   [Noun Adj] is to be rearranged into [Adj Noun]. With trimming, this can't
   be done as we've discarded the analysis of the Adjective, treating it as an
   unknown word.

Now, if we don't discard the analysis and don't trim, we would again fall
into the earlier problem of untranslated lemmas.

This project, is a way to have our cake and eat it too. We don't discard
the analysis even if we don't know the translation, but we don't just
output the lemma either - we output the source surface form. For a solution
like this, it is *essential that we propagate the surface form till at
least transfer or even till the generator*, so that we can use the benefits
of the source analysis and then before translation, we discard it and use
the source surface form.

Currently the source surface form is discarded at the tagger. This is where
the stream modification comes in. It's a robust way to propagate the
surface form through the stream with least disruption to the current
modules.

Then there are other possible benefits of secondary information, such as
markup tags. Hope this makes sense.

Tanmai

On Tue, Apr 21, 2020 at 12:02 AM Francis Tyers <fty...@prompsit.com> wrote:

> El 2020-04-20 19:21, Daniel Swanson escribió:
> >> Another way of putting this is that it looks like a technical
> > solution
> >> in search of a problem, rather than a problem description in search
> >> of a solution.
> >
> > To me the most obvious thing to do with it is to put markup
> > information in secondary tags as a way of solving the superblank
> > reordering problem.
> >
>
> Didn't we have a solution for this that was worked on over a couple
> of GSOC projects ?
>
> Fran
>
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

-- 
*Khanna, Tanmai*

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Modifying the apertium stream format to include arbitrary information

Reply via email to