Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Matt Post
It wouldn't be hard to add some TMX-like features, no. There are some technical 
challenges, though — for example, the current demo lets you add phrases, but 
that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run 
John's fast_align implementation (with a saved model) to break down that new 
sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili  wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  > ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ 
>> >.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt



Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Tommaso Teofili
Matt,

really nice least of very useful features, thanks for this!
One comment only on the translation memories one: as seen by one that had
never heard about it, it sounds not too complicated to implement on top of
current Joshua (with IR library like Apache Lucene), is my understanding
correct ?

My 2 cents,
Tommaso


Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  ha
scritto:

> One project I think could be interesting for Joshua's future is sketched
> here.
>
> - Dynamic phrase tables. Joshua currently lets people add custom phrases
> to the existing models that then get used. There is a research topic here
> for how to make it better (particularly, how to set the weights of rules
> that are added at runtime instead of learned from bitext), but it works
> really well for adding words that are OOV (since it's always cheaper to use
> the OOV). Here's a demo of how this works (this feature is included in the
> language packs).
>
>
> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>
> - Translation memories. There is a large commercial market (billions) for
> tools called "translation memories", where translators are translating
> documents, and the sentences get queried against their past translations
> and matched in a fuzzy fashion. The big tool on the market for this is SDL
> Trados <
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/>.
> I'm not talking about selling a product, but in a space that big, there
> have got to be a lot of people who'd rather just run their own system, than
> shell out for an expensive (and ugly) tool. So there is a big niche for an
> open source tool, and currently nothing really filling it. The "dynamic
> phrase table" feature above provides the beginnings of offering a TM
> competitor, but one that is "seeded" with a regular statistical machine
> translation model.
>
> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
> could sit on top of a large tuning set across diverse domains (e.g, formal
> news, informal web logs, spoken dialogue, etc). You could then add new
> phrases in sentences as above, which would get automatically aligned, and
> then everything could be retuned at the user's request (or perhaps at
> night). This way, when people added new data to their models, Joshua would
> automatically find the best weights, either immediately or on some
> schedule. There'd be less worry about bit rot.
>
> - Data collection and sharing. Another cool idea would be to allow people
> to easily send us data. If we get to a place where people are building
> custom dynamic phrase tables, a cool ability would be to make it easy for
> people to upload the data they have added to their private systems, which
> we could then collect and further distribute. So Joshua could become an
> easy means for people to crowdsource data used for translation systems.
> This is obviously just a high-level idea that would require a lot of
> details to be figured out, but it would be super cool.
>
> matt


★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post
One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt