Hey guys!
The markup handling project reached all of its goals this week. While it
will continue to be improved, it is in a state that’s ready to be tested
with real world data now.

*We have updated https://beta.apertium.org <https://beta.apertium.org> so
that if you translate a document/webpage on it, it will use the new system
to translate it.* https://apertium.org still uses the old way to do it, so
you can compare the translation outputs. The best way to check the impact
of the project is to translate webpages with lots of links, markup, etc.

If you use Apertium from source,

apertium -f html/odt/docx/pptx -d . eng-spa

also uses the updated system with wblanks. You will also need Transfuse
<https://github.com/TinoDidriksen/Transfuse>. Make sure apertium,
lttoolbox, etc., are all at the latest commit for you.

Once this system has undergone real world testing, we can update the main
apertium.org as well.

I also want to explain very briefly what was done. Markup handling has been
a problem in Apertium for a long time. It was done using superblanks that
encapsulate markup information inside them during the translation process.
This works well to protect the formatting of the document. However,
languages represent information differently and during translation,
words/phrases move around, get deleted, split, merge, etc. *The markup
information on the words needs to stick with the words, otherwise we end up
with erroneous markup in the translation*, which is what happened:

Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The white</i> <b>dog</b>

As part of this project, a new kind of blank was proposed - a wordbound
blank. It contains any information that needs to stay attached to a
word/phrase during the entire translation process. After modifying most
modules in the pipeline to work with these wblanks, writing new
de/reformatters (transfuse), and adding markup in wblanks, the translation
we have is:

Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The</i> <b>white</b> <i>dog</i>

It should prove immensely useful for users of Apertium MT system to
translate html or any formatted documents such as odt, docx, pptx.

For more details about the project, about wordbound blanks, and about the
new way of doing markup handling, check out: Project Report
<https://wiki.apertium.org/wiki/User:Khannatanmai/GSoC2020_Final_Report>,
Development
of wordbound blanks
<https://wiki.apertium.org/wiki/User:Khannatanmai/Wordbound_blanks>.

I’d like to thank Tino Didriksen for not only being an active mentor, but
for participating in the project as well. A major chunk of this project -
Transfuse, deformatters, reformatters, getting all of it integrated with
Apertium, and lots more was done by him.

Hope this proves to be useful :))

Some more links to understand the problem:
https://wiki.apertium.org/wiki/Reordering_superblanks
https://wiki.apertium.org/wiki/Format_handling
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Automatic_blank_handling

Thanks and Regards,
*तन्मय खन्ना *
*Tanmai Khanna*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to