[
https://issues.apache.org/jira/browse/JOSHUA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794381#comment-16794381
]
Thamme Gowda commented on JOSHUA-341:
-------------------------------------
Here is another handy tool to consider.
[https://github.com/isi-nlp/uroman]
It uses Unicode tables and rules to transliterate non-roman script words to
Roman script (No training needed)
(Sorry, yet another Perl script, but *sometimes/most-times* this is all we
need)
> Integrated Transliteration
> --------------------------
>
> Key: JOSHUA-341
> URL: https://issues.apache.org/jira/browse/JOSHUA-341
> Project: Joshua
> Issue Type: Task
> Components: core, language packs
> Reporter: Tommaso Teofili
> Priority: Major
> Labels: gsoc2019
>
> Many of the language packs released translated from languages with non-Latin
> scripts. Words that cannot be translated are therefore pushed through to the
> translation and cannot even be read by someone who doesn't know that script.
> At the same time, many untranslatable words are simply transliterated words.
> For example, an Arabic word might be an English word (like a name or
> technical term) that has simply been written in Arabic. These words can be
> transliterated. It would be good to add built-in transliteration models that
> can be applied to all out-of-vocabulary words and enabled for certain
> languages. Transliteration models can be built over the same bitext using
> techniques like Sajjad, Fraser, and Schmid (2012) [1].
> [1] : http://www.anthology.aclweb.org/P/P12/P12-1049.pdf
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)