Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Tommaso Teofili Mon, 18 Jan 2016 23:59:41 -0800

Hi Chris, all,

that's a very interesting proposal, if you wish I can help (with my
OpenNLP/Lucene hat on).


Regards,
Tommaso

2016-01-13 7:56 GMT+01:00 Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> Hi Everyone,
>
> Please find attached for your viewing pleasure a proposed new project,
> Apache Joshua, a statistical machine translation toolkit. The proposal
> is in wiki draft form at: https://wiki.apache.org/incubator/JoshuaProposal
>
> Proposal text is copied below. I’ll leave the discussion open for a week
> and we are interested in folks who would like to be initial committers
> and mentors. Please discuss here on the thread.
>
> Thanks!
>
> Cheers,
> Chris (Champion)
>
> ———
>
> = Joshua Proposal =
>
> == Abstract ==
> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
> translation toolkit. It includes a Java-based decoder for translating with
> phrase-based, hierarchical, and syntax-based translation models, a
> Hadoop-based grammar extractor (Thrax), and an extensive set of tools and
> scripts for training and evaluating new models from parallel text.
>
> == Proposal ==
> Joshua is a state of the art statistical machine translation system that
> provides a number of features:
>
>  * Support for the two main paradigms in statistical machine translation:
> phrase-based and hierarchical / syntactic.
>  * A sparse feature API that makes it easy to add new feature templates
> supporting millions of features
>  * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>  * Support for lattice decoding, allowing upstream NLP tools to expose
> their hypothesis space to the MT system
>  * An efficient representation for models, allowing for quick loading of
> multi-gigabyte model files
>  * Fast decoding speed (on par with Moses and mtplz)
>  * Language packs — precompiled models that allow the decoder to be run as
> a black box
>  * Thrax, a Hadoop-based tool for learning translation models from
> parallel text
>  * A suite of tools for constructing new models for any language pair for
> which sufficient training data exists
>
> == Background and Rationale ==
> A number of factors make this a good time for an Apache project focused on
> machine translation (MT): the quality of MT output (for many language
> pairs); the average computing resources available on computers, relative
> to the needs of MT systems; and the availability of a number of
> high-quality toolkits, together with a large base of researchers working
> on them.
>
> Over the past decade, machine translation (MT; the automatic translation
> of one human language to another) has become a reality. The research into
> statistical approaches to translation that began in the early nineties,
> together with the availability of large amounts of training data, and
> better computing infrastructure, have all come together to produce
> translations results that are “good enough” for a large set of language
> pairs and use cases. Free services like
> [[https://www.bing.com/translator|Bing Translator]] and
> [[https://translate.google.com|Google Translate]] have made these services
> available to the average person through direct interfaces and through
> tools like browser plugins, and sites across the world with higher
> translation needs use them to translate their pages through automatically.
>
> MT does not require the infrastructure of large corporations in order to
> produce feasible output. Machine translation can be resource-intensive,
> but need not be prohibitively so. Disk and memory usage are mostly a
> matter of model size, which for most language pairs is a few gigabytes at
> most, at which size models can provide coverage on the order of tens or
> even hundreds of thousands of words in the input and output languages. The
> computational complexity of the algorithms used to search for translations
> of new sentences are typically linear in the number of words in the input
> sentence, making it possible to run a translation engine on a personal
> computer.
>
> The research community has produced many different open source translation
> projects for a range of programming languages and under a variety of
> licenses. These projects include the core “decoder”, which takes a model
> and uses it to translate new sentences between the language pair the model
> was defined for. They also typically include a large set of tools that
> enable new models to be built from large sets of example translations
> (“parallel data”) and monolingual texts. These toolkits are usually built
> to support the agendas of the (largely) academic researchers that build
> them: the repeated cycle of building new models, tuning model parameters
> against development data, and evaluating them against held-out test data,
> using standard metrics for testing the quality of MT output.
>
> Together, these three factors—the quality of machine translation output,
> the feasibility of translating on standard computers, and the availability
> of tools to build models—make it reasonable for the end users to use MT as
> a black-box service, and to run it on their personal machine.
>
> These factors make it a good time for an organization with the status of
> the Apache Foundation to host a machine translation project.
>
> == Current Status ==
> Joshua was originally ported from David Chiang’s Python implementation of
> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
> University. The current version is maintained by Matt Post at Johns
> Hopkins’ Human Language Technology Center of Excellence. Joshua has made
> many releases with a list of over 20 source code tags. The last release of
> Joshua was 6.0.5 on November 5th, 2015.
>
> == Meritocracy ==
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want to
> encourage this style of development for the project.
>
> == Community ==
> Joshua is used widely across the world. Perhaps its biggest (known)
> research / industrial user is the Amazon research group in Berlin. Another
> user is the US Army Research Lab. No formal census has been undertaken,
> but posts to the Joshua technical support mailing list, along with the
> occasional contributions, suggest small research and academic communities
> spread across the world, many of them in India.
>
> During incubation, we will explicitly seek to increase our usage across
> the board, including academic research, industry, and other end users
> interested in statistical machine translation.
>
> == Core Developers ==
> The current set of core developers is fairly small, having fallen with the
> graduation from Johns Hopkins of some core student participants. However,
> Joshua is used fairly widely, as mentioned above, and there remains a
> commitment from the principal researcher at Johns Hopkins to continue to
> use and develop it. Joshua has seen a number of new community members
> become interested recently due to a potential for its projected use in a
> number of ongoing DARPA projects such as XDATA and Memex.
>
> == Alignment ==
> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
> rights reserved and licensed under BSD 2-clause license. It would of
> course be the intention to relicense this code under AL2.0 which would
> permit expanded and increased use of the software within Apache projects.
> There is currently an ongoing effort within the Apache Tika community to
> utilize Joshua within Tika’s Translate API, see
> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>
> == Known Risks ==
>
> === Orphaned products ===
> At the moment, regular contributions are made by a single contributor, the
> lead maintainer. He (Matt Post) plans to continue development for the next
> few years, but it is still a single point of failure, since the graduate
> students who worked on the project have moved on to jobs, mostly in
> industry. However, our goal is to help that process by growing the
> community in Apache, and at least in growing the community with users and
> participants from NASA JPL.
>
> === Inexperience with Open Source ===
> The team both at Johns Hopkins and NASA JPL have experience with many OSS
> software projects at Apache and elsewhere. We understand "how it works"
> here at the foundation.
>
>
> == Relationships with Other Apache Products ==
> Joshua includes dependences on Hadoop, and also is included as a plugin in
> Apache Tika. We are also interested in coordinating with other projects
> including Spark, and other projects needing MT services for language
> translation.
>
> == Developers ==
> Joshua only has one regular developer who is employed by Johns Hopkins
> University. NASA JPL (Mattmann and McGibbney) have been contributing
> lately including a Brew formula and other contributions to the project
> through the DARPA XDATA and Memex programs.
>
> == Documentation ==
> Documentation and publications related to Joshua can be found at
> joshua-decoder.org. The source for the Joshua documentation is currently
> hosted on Github at
> https://github.com/joshua-decoder/joshua-decoder.github.com
>
> == Initial Source ==
> Current source resides at Github: github.com/joshua-decoder/joshua (the
> main decoder and toolkit) and github.com/joshua-decoder/thrax (the grammar
> extraction tool).
>
> == External Dependencies ==
> Joshua has a number of external dependencies. Only BerkeleyLM (Apache 2.0)
> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is
> needed for translating sentences with pre-built models). The rest are
> dependencies for the build system and pipeline, used for constructing and
> training new models from parallel text.
>
> Apache projects:
>  * Ant
>  * Hadoop
>  * Commons
>  * Maven
>  * Ivy
>
> There are also a number of other open-source projects with various
> licenses that the project depends on both dynamically (runtime), and
> statically.
>
> === GNU GPL 2 ===
>  * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>
> === LGPG 2.1 ===
>  * KenLM: github.com/kpu/kenlm
>
> === Apache 2.0 ===
>  * BerkeleyLM: https://code.google.com/p/berkeleylm/
>
> === GNU GPL ===
>  * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>
> == Required Resources ==
>  * Mailing Lists
>    * priv...@joshua.incubator.apache.org
>    * d...@joshua.incubator.apache.org
>    * comm...@joshua.incubator.apache.org
>
>  * Git Repos
>    * https://git-wip-us.apache.org/repos/asf/joshua.git
>
>  * Issue Tracking
>    * JIRA Joshua (JOSHUA)
>
>  * Continuous Integration
>    * Jenkins builds on https://builds.apache.org/
>
>  * Web
>    * http://joshua.incubator.apache.org/
>    * wiki at http://cwiki.apache.org
>
> == Initial Committers ==
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository on Github).
>
>  * Matt Post (p...@cs.jhu.edu)
>  * Lewis John McGibbney (lewi...@apache.org)
>  * Chris Mattmann (mattm...@apache.org)
>
> == Affiliations ==
>
>  * Johns Hopkins University
>    * Matt Post
>
>  * NASA JPL
>    * Chris Mattmann
>    * Lewis John McGibbney
>
>
> == Sponsors ==
> === Champion ===
>  * Chris Mattmann (NASA/JPL)
>
> === Nominated Mentors ===
>  * Paul Ramirez
>  * Lewis John McGibbney
>  * Chris Mattmann
>
> == Sponsoring Entity ==
> The Apache Incubator
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Reply via email to