[jira] [Issue Comment Deleted] (JOSHUA-338) Generate smaller models for LPs

Kishani Kandasamy (Jira) Tue, 23 Feb 2021 20:04:07 -0800


     [ 
https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kishani Kandasamy updated JOSHUA-338:
-------------------------------------
    Comment: was deleted

(was: Hi Tommaso Teofili, Thank you for your reply. I'm particularly interested
in this issue to complete as my GSoC 2021 Project. Currently , I'm reading
Language models used within Joshua  in order to understand project scope
thoroughly.Thank you.

On Fri, Nov 20, 2020 at 11:19 PM Tommaso Teofili (Jira) <[email protected]>

)

> Generate smaller models for LPs
> -------------------------------
>
>                 Key: JOSHUA-338
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-338
>             Project: Joshua
>          Issue Type: Task
>          Components: core
>            Reporter: Tommaso Teofili
>            Priority: Major
>              Labels: gsoc2019
>
> Phrase tables and grammars can get very big when trained on lots of parallel 
> data, which makes it hard to distribute them in Language Packs. A quick way 
> to reduce model size is to reduce the amount of parallel data used to build 
> models, but sampling a subset of it. This is the very naive approach used in 
> the construction of the original language packs (November 2016), but there 
> are much better ways. One relatively simple one is the Vocabulary Saturation 
> Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in 
> paper [1]. It would be wonderful to implement this and use it to do a better 
> job selecting which sentences to include for our general-purpose language 
> packs.
> It would be ideal to implement this in Java, but Python or Scala would also 
> fit well inside Joshua.
> [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (JOSHUA-338) Generate smaller models for LPs

Reply via email to