Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Matt Post
It wouldn't be hard to add some TMX-like features, no. There are some technical 
challenges, though — for example, the current demo lets you add phrases, but 
that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run 
John's fast_align implementation (with a saved model) to break down that new 
sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili  wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  > ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ 
>> >.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt



Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Tommaso Teofili
Matt,

really nice least of very useful features, thanks for this!
One comment only on the translation memories one: as seen by one that had
never heard about it, it sounds not too complicated to implement on top of
current Joshua (with IR library like Apache Lucene), is my understanding
correct ?

My 2 cents,
Tommaso


Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  ha
scritto:

> One project I think could be interesting for Joshua's future is sketched
> here.
>
> - Dynamic phrase tables. Joshua currently lets people add custom phrases
> to the existing models that then get used. There is a research topic here
> for how to make it better (particularly, how to set the weights of rules
> that are added at runtime instead of learned from bitext), but it works
> really well for adding words that are OOV (since it's always cheaper to use
> the OOV). Here's a demo of how this works (this feature is included in the
> language packs).
>
>
> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>
> - Translation memories. There is a large commercial market (billions) for
> tools called "translation memories", where translators are translating
> documents, and the sentences get queried against their past translations
> and matched in a fuzzy fashion. The big tool on the market for this is SDL
> Trados <
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/>.
> I'm not talking about selling a product, but in a space that big, there
> have got to be a lot of people who'd rather just run their own system, than
> shell out for an expensive (and ugly) tool. So there is a big niche for an
> open source tool, and currently nothing really filling it. The "dynamic
> phrase table" feature above provides the beginnings of offering a TM
> competitor, but one that is "seeded" with a regular statistical machine
> translation model.
>
> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
> could sit on top of a large tuning set across diverse domains (e.g, formal
> news, informal web logs, spoken dialogue, etc). You could then add new
> phrases in sentences as above, which would get automatically aligned, and
> then everything could be retuned at the user's request (or perhaps at
> night). This way, when people added new data to their models, Joshua would
> automatically find the best weights, either immediately or on some
> schedule. There'd be less worry about bit rot.
>
> - Data collection and sharing. Another cool idea would be to allow people
> to easily send us data. If we get to a place where people are building
> custom dynamic phrase tables, a cool ability would be to make it easy for
> people to upload the data they have added to their private systems, which
> we could then collect and further distribute. So Joshua could become an
> easy means for people to crowdsource data used for translation systems.
> This is obviously just a high-level idea that would require a lot of
> details to be figured out, but it would be super cool.
>
> matt


★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post
One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt

Re: roadmap

2016-09-30 Thread Tommaso Teofili
very nice Matt, all sounds good to me, thanks!

Also looking forward to be able to play with additional language packs.

Regards,
Tommaso

Il giorno ven 30 set 2016 alle ore 15:09 Matt Post  ha
scritto:

> Hi folks,
>
> Just a status update, since I / we are a bit behind: I'm in the process of
> putting together the first language pack, along with a script that will
> bundle it with the jar, a README describing its use and assembly, a CREDITS
> file describing the data used to build the model, and a BENCHMARK file
> listing the performance on test sets. All of these are being more-or-less
> automatically assembled and I think it's important to include in the
> language packs.
>
> Once I have the version of that put together, I'll post it for your review
> and testing. I hope to do this first thing next week. We can then move to
> do our first release. There are a number of small things we need to do
> (updating the CHANGELOG, site documentation, etc), but I think we're mostly
> ready.
>
> A colleague here is also putting together a large number of language packs
> for lots of different languages. I'll have a list soon.
>
> matt
>
>


roadmap

2016-09-30 Thread Matt Post
Hi folks,

Just a status update, since I / we are a bit behind: I'm in the process of 
putting together the first language pack, along with a script that will bundle 
it with the jar, a README describing its use and assembly, a CREDITS file 
describing the data used to build the model, and a BENCHMARK file listing the 
performance on test sets. All of these are being more-or-less automatically 
assembled and I think it's important to include in the language packs.

Once I have the version of that put together, I'll post it for your review and 
testing. I hope to do this first thing next week. We can then move to do our 
first release. There are a number of small things we need to do (updating the 
CHANGELOG, site documentation, etc), but I think we're mostly ready.

A colleague here is also putting together a large number of language packs for 
lots of different languages. I'll have a list soon.

matt



Re: [IMPORTANT] Roadmap for 6.1 Release

2016-07-11 Thread kellen sunderland
Thanks for organizing Lewis, sorry for the late replies.  Looking at the
frequency of our updates I'd suggest quarterly, or bi-annual releases.  If
we can keep the master branch stable (which should really be a goal of
ours) then hopefully it's not too much work to create the releases.I do
appreciate that there's probably some effort required to create release
notes + documentation.  Hopefully JIRA will be able to help us create some
of this documentation.

I'd agree that we should shoot for a 6.1 release fairly soon.  I'll review
the PRs that came from our side early after the Apache switch.  They should
probably have JIRA tickets tracking the changes with fix version assigned
as 6.1.

-Kellen



On Thu, Jun 23, 2016 at 11:01 PM, Tom Barber <t...@analytical-labs.com>
wrote:

> Hey Matt
>
> Over on  OODT our releases are few and far between, although that said,
> I've been trying to increase the frequency even if they are very minor. The
> main reason being, if someone commits some code, they don't want to wait 12
> months for it to hit a stable release! So you might say yearly major
> releases and patch releases at sporadic points inbetween to include patches
> people have submitted, this also keeps drive by committers interested
> because if they get some stuff into the codebase they then may commit more,
> rather than say "well I submitted a fix for issue x ages ago and its got
> notwhere".  Releases don't need to be set in stone, but I would try and
> keep them ticking over.
>
> Just my own 2 cents.
>
> Tom
>
> --
>
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
>
> (Thanks to the Saiku community we reached our Kickstart
> <
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> >
> goal, but you can always help by sponsoring the project
> <http://www.meteorite.bi/products/saiku/sponsorship>)
>
> On 23 June 2016 at 21:56, Matt Post <p...@cs.jhu.edu> wrote:
>
> > Hi Lewis,
> >
> > Sorry for taking some time to get back to you. I think the roadmap looks
> > great. One thing, though, is that the Amazon folks and I have discussed
> > making a number of backwards-incompatible changes in an effort to
> modernize
> > some pieces of the code. This would have to do with things like the
> config
> > file format, a totally new pipeline based on duct tape, and some other
> > ideas. We think those changes would be suitable for a 7.0 release (major
> > version number change signals backwards incompatibility).
> >
> > I think we've been doing some good work on improving Joshua, but at the
> > same time, I think the release cycle is still little too accelerated for
> > me. I would like to push back to semi- yearly or even yearly releases,
> with
> > bug fixes in between. However, I'm also curious how this might affect our
> > ability to move out of incubation. Do you have any thoughts on this?
> >
> > The major downsides to releases are documentation. It's just hard to find
> > the time to do.
> >
> > My own thoughts for what I'd like to do:
> >
> > - Maybe a 6.1 release (soon, to get it out of the way? or otherwise this
> > fall?), where we formalize the Apache move and maybe formalize the
> release
> > of a handful of language packs, without a lot of other changes
> >
> > - Write a linux.com article advertising this, hopefully attracting some
> > attention
> >
> > - Shoot for a 7.0 release with many of the changes we've discussed (some
> > offline). If we get a good showing at MT Marathon in Prague this year,
> that
> > could be a good time to get all of that in order.
> >
> > - Start getting to work on a version of Joshua that swaps out the core
> > decoder for a neural approach
> >
> > matt
> >
> >
> >
> >
> > > On Jun 23, 2016, at 4:13 PM, Tom Barber <t...@analytical-labs.com>
> wrote:
> > >
> > > I would volunteer some cycles for multi model support in the server and
> > an
> > > improved rest interface and basic UI for end user interaction if you
> > fancy
> > > it.
> > >
> > > --
> > >
> > > Director Meteorite.bi - Saiku Analytics Founder
> > > Tel: +44(0)5603641316
> > >
> > > (Thanks to the Saiku community we reached our Kickstart
> > > <
> >
> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
> > >
> > > goal, but you can always help by sponsoring the project
> > > <http://www.meteorite.bi/products/saiku/sponsorship>)
>

Re: [IMPORTANT] Roadmap for 6.1 Release

2016-06-23 Thread Matt Post
Hi Lewis,

Sorry for taking some time to get back to you. I think the roadmap looks great. 
One thing, though, is that the Amazon folks and I have discussed making a 
number of backwards-incompatible changes in an effort to modernize some pieces 
of the code. This would have to do with things like the config file format, a 
totally new pipeline based on duct tape, and some other ideas. We think those 
changes would be suitable for a 7.0 release (major version number change 
signals backwards incompatibility).

I think we've been doing some good work on improving Joshua, but at the same 
time, I think the release cycle is still little too accelerated for me. I would 
like to push back to semi- yearly or even yearly releases, with bug fixes in 
between. However, I'm also curious how this might affect our ability to move 
out of incubation. Do you have any thoughts on this?

The major downsides to releases are documentation. It's just hard to find the 
time to do. 

My own thoughts for what I'd like to do:

- Maybe a 6.1 release (soon, to get it out of the way? or otherwise this 
fall?), where we formalize the Apache move and maybe formalize the release of a 
handful of language packs, without a lot of other changes

- Write a linux.com article advertising this, hopefully attracting some 
attention

- Shoot for a 7.0 release with many of the changes we've discussed (some 
offline). If we get a good showing at MT Marathon in Prague this year, that 
could be a good time to get all of that in order.

- Start getting to work on a version of Joshua that swaps out the core decoder 
for a neural approach

matt




> On Jun 23, 2016, at 4:13 PM, Tom Barber <t...@analytical-labs.com> wrote:
> 
> I would volunteer some cycles for multi model support in the server and an
> improved rest interface and basic UI for end user interaction if you fancy
> it.
> 
> --
> 
> Director Meteorite.bi - Saiku Analytics Founder
> Tel: +44(0)5603641316
> 
> (Thanks to the Saiku community we reached our Kickstart
> <http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/>
> goal, but you can always help by sponsoring the project
> <http://www.meteorite.bi/products/saiku/sponsorship>)
> 
> On 23 June 2016 at 21:10, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> wrote:
> 
>> Hi Folks,
>> Anyone have any comments on this?
>> Seeing that the Maven multimodule project seems to be taking flight, it
>> would be nice to see where the roadmap is going?
>> Any comments would be great. Also, I'm kinda lost as to what is happening
>> with Jira but it looks like it is not really being used for much.
>> Thanks
>> 
>> On Mon, Jun 20, 2016 at 11:34 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>> 
>>> Hi Folks,
>>> I've just smartened up Jira a bit with our Roadmap being defined as
>> follows
>>> 
>>> 
>>> 
>> https://issues.apache.org/jira/browse/joshua/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
>>> 
>>> Right now there are only 14/14 issues as RESOLVED for 6.1. This is false
>>> as I know that many more issues have been addressed however I don't think
>>> that Jira tickets have been created for all changes to the source code.
>>> Maybe moving forward we could open Jira issues and link them to the
>> Github
>>> tickets via commit messages?
>>> 
>>> Additionally, everything that was currently UNRESOLVED has merely been
>>> pushed to 6.2. If this is not what is required then please reassign the
>> fix
>>> version for any ticket(s) to 6.1 and we can fix.
>>> 
>>> Finally, are there any mitigating factor which would prevent a 6.1
>> release
>>> candidate being prepared right now?
>>> Thanks
>>> Lewis
>>> 
>>> --
>>> *Lewis*
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 



Re: [IMPORTANT] Roadmap for 6.1 Release

2016-06-23 Thread Tom Barber
I would volunteer some cycles for multi model support in the server and an
improved rest interface and basic UI for end user interaction if you fancy
it.

--

Director Meteorite.bi - Saiku Analytics Founder
Tel: +44(0)5603641316

(Thanks to the Saiku community we reached our Kickstart
<http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/>
goal, but you can always help by sponsoring the project
<http://www.meteorite.bi/products/saiku/sponsorship>)

On 23 June 2016 at 21:10, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
wrote:

> Hi Folks,
> Anyone have any comments on this?
> Seeing that the Maven multimodule project seems to be taking flight, it
> would be nice to see where the roadmap is going?
> Any comments would be great. Also, I'm kinda lost as to what is happening
> with Jira but it looks like it is not really being used for much.
> Thanks
>
> On Mon, Jun 20, 2016 at 11:34 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > Hi Folks,
> > I've just smartened up Jira a bit with our Roadmap being defined as
> follows
> >
> >
> >
> https://issues.apache.org/jira/browse/joshua/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
> >
> > Right now there are only 14/14 issues as RESOLVED for 6.1. This is false
> > as I know that many more issues have been addressed however I don't think
> > that Jira tickets have been created for all changes to the source code.
> > Maybe moving forward we could open Jira issues and link them to the
> Github
> > tickets via commit messages?
> >
> > Additionally, everything that was currently UNRESOLVED has merely been
> > pushed to 6.2. If this is not what is required then please reassign the
> fix
> > version for any ticket(s) to 6.1 and we can fix.
> >
> > Finally, are there any mitigating factor which would prevent a 6.1
> release
> > candidate being prepared right now?
> > Thanks
> > Lewis
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *Lewis*
>


Re: [IMPORTANT] Roadmap for 6.1 Release

2016-06-20 Thread Mattmann, Chris A (3980)
Thanks for doing the yeoman’s work Lewis

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 6/20/16, 11:34 AM, "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com> wrote:

>Hi Folks,
>I've just smartened up Jira a bit with our Roadmap being defined as follows
>
>https://issues.apache.org/jira/browse/joshua/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
>
>Right now there are only 14/14 issues as RESOLVED for 6.1. This is false as
>I know that many more issues have been addressed however I don't think that
>Jira tickets have been created for all changes to the source code. Maybe
>moving forward we could open Jira issues and link them to the Github
>tickets via commit messages?
>
>Additionally, everything that was currently UNRESOLVED has merely been
>pushed to 6.2. If this is not what is required then please reassign the fix
>version for any ticket(s) to 6.1 and we can fix.
>
>Finally, are there any mitigating factor which would prevent a 6.1 release
>candidate being prepared right now?
>Thanks
>Lewis
>
>-- 
>*Lewis*


[IMPORTANT] Roadmap for 6.1 Release

2016-06-20 Thread Lewis John Mcgibbney
Hi Folks,
I've just smartened up Jira a bit with our Roadmap being defined as follows

https://issues.apache.org/jira/browse/joshua/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel

Right now there are only 14/14 issues as RESOLVED for 6.1. This is false as
I know that many more issues have been addressed however I don't think that
Jira tickets have been created for all changes to the source code. Maybe
moving forward we could open Jira issues and link them to the Github
tickets via commit messages?

Additionally, everything that was currently UNRESOLVED has merely been
pushed to 6.2. If this is not what is required then please reassign the fix
version for any ticket(s) to 6.1 and we can fix.

Finally, are there any mitigating factor which would prevent a 6.1 release
candidate being prepared right now?
Thanks
Lewis

-- 
*Lewis*