[
https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kishani Kandasamy updated JOSHUA-338:
-------------------------------------
Comment: was deleted
(was: Hi Tommaso Teofili, Thank you for your reply. I'm particularly interested
in this issue to complete as my GSoC 2021 Project. Currently , I'm reading
Language models used within Joshua in order to understand project scope
thoroughly.Thank you.
On Fri, Nov 20, 2020 at 11:19 PM Tommaso Teofili (Jira) <[email protected]>
)
> Generate smaller models for LPs
> -------------------------------
>
> Key: JOSHUA-338
> URL: https://issues.apache.org/jira/browse/JOSHUA-338
> Project: Joshua
> Issue Type: Task
> Components: core
> Reporter: Tommaso Teofili
> Priority: Major
> Labels: gsoc2019
>
> Phrase tables and grammars can get very big when trained on lots of parallel
> data, which makes it hard to distribute them in Language Packs. A quick way
> to reduce model size is to reduce the amount of parallel data used to build
> models, but sampling a subset of it. This is the very naive approach used in
> the construction of the original language packs (November 2016), but there
> are much better ways. One relatively simple one is the Vocabulary Saturation
> Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in
> paper [1]. It would be wonderful to implement this and use it to do a better
> job selecting which sentences to include for our general-purpose language
> packs.
> It would be ideal to implement this in Java, but Python or Scala would also
> fit well inside Joshua.
> [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf
--
This message was sent by Atlassian Jira
(v8.3.4#803005)