Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Mattmann, Chris A (3980) Mon, 18 Jan 2016 21:59:26 -0800

Great Hen, we’d love to have you on board as a mentor! Please
add yourself to the proposal on the wiki.


Anyone else have interest in Machine Translation? Any OpenNLP folks,
Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility
please feel free to reply to general@i.a.o.

I’ll leave the DISCUSS thread open for a few more days.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Henri Yandell <bay...@apache.org>
Reply-To: "gene...@incubator.apache.org" <gene...@incubator.apache.org>
Date: Monday, January 18, 2016 at 7:57 PM
To: jpluser <chris.a.mattm...@jpl.nasa.gov>,
"gene...@incubator.apache.org" <gene...@incubator.apache.org>
Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
Translation Toolkit

>Non-binding +1 to Joshua joining the Incubator. I'd be interested in
>mentoring.
>
>
>> -----Original Message-----
>> From: jpluser <chris.a.mattm...@jpl.nasa.gov>
>> Reply-To: "gene...@incubator.apache.org" <gene...@incubator.apache.org>
>> Date: Tuesday, January 12, 2016 at 10:56 PM
>> To: "gene...@incubator.apache.org" <gene...@incubator.apache.org>
>> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu>
>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>Translation
>> Toolkit
>>
>> >Hi Everyone,
>> >
>> >Please find attached for your viewing pleasure a proposed new project,
>> >Apache Joshua, a statistical machine translation toolkit. The proposal
>> >is in wiki draft form at:
>> https://wiki.apache.org/incubator/JoshuaProposal
>> >
>> >Proposal text is copied below. I’ll leave the discussion open for a
>>week
>> >and we are interested in folks who would like to be initial committers
>> >and mentors. Please discuss here on the thread.
>> >
>> >Thanks!
>> >
>> >Cheers,
>> >Chris (Champion)
>> >
>> >———
>> >
>> >= Joshua Proposal =
>> >
>> >== Abstract ==
>> >[[joshua-decoder.org|Joshua]] is an open-source statistical machine
>> >translation toolkit. It includes a Java-based decoder for translating
>>with
>> >phrase-based, hierarchical, and syntax-based translation models, a
>> >Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>>and
>> >scripts for training and evaluating new models from parallel text.
>> >
>> >== Proposal ==
>> >Joshua is a state of the art statistical machine translation system
>>that
>> >provides a number of features:
>> >
>> > * Support for the two main paradigms in statistical machine
>>translation:
>> >phrase-based and hierarchical / syntactic.
>> > * A sparse feature API that makes it easy to add new feature templates
>> >supporting millions of features
>> > * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>> > * Support for lattice decoding, allowing upstream NLP tools to expose
>> >their hypothesis space to the MT system
>> > * An efficient representation for models, allowing for quick loading
>>of
>> >multi-gigabyte model files
>> > * Fast decoding speed (on par with Moses and mtplz)
>> > * Language packs — precompiled models that allow the decoder to be
>>run as
>> >a black box
>> > * Thrax, a Hadoop-based tool for learning translation models from
>> >parallel text
>> > * A suite of tools for constructing new models for any language pair
>>for
>> >which sufficient training data exists
>> >
>> >== Background and Rationale ==
>> >A number of factors make this a good time for an Apache project
>>focused on
>> >machine translation (MT): the quality of MT output (for many language
>> >pairs); the average computing resources available on computers,
>>relative
>> >to the needs of MT systems; and the availability of a number of
>> >high-quality toolkits, together with a large base of researchers
>>working
>> >on them.
>> >
>> >Over the past decade, machine translation (MT; the automatic
>>translation
>> >of one human language to another) has become a reality. The research
>>into
>> >statistical approaches to translation that began in the early nineties,
>> >together with the availability of large amounts of training data, and
>> >better computing infrastructure, have all come together to produce
>> >translations results that are “good enough” for a large set of language
>> >pairs and use cases. Free services like
>> >[[https://www.bing.com/translator|Bing Translator]] and
>> >[[https://translate.google.com|Google Translate]] have made these
>> services
>> >available to the average person through direct interfaces and through
>> >tools like browser plugins, and sites across the world with higher
>> >translation needs use them to translate their pages through
>>automatically.
>> >
>> >MT does not require the infrastructure of large corporations in order
>>to
>> >produce feasible output. Machine translation can be resource-intensive,
>> >but need not be prohibitively so. Disk and memory usage are mostly a
>> >matter of model size, which for most language pairs is a few gigabytes
>>at
>> >most, at which size models can provide coverage on the order of tens or
>> >even hundreds of thousands of words in the input and output languages.
>>The
>> >computational complexity of the algorithms used to search for
>>translations
>> >of new sentences are typically linear in the number of words in the
>>input
>> >sentence, making it possible to run a translation engine on a personal
>> >computer.
>> >
>> >The research community has produced many different open source
>>translation
>> >projects for a range of programming languages and under a variety of
>> >licenses. These projects include the core “decoder”, which takes a
>>model
>> >and uses it to translate new sentences between the language pair the
>>model
>> >was defined for. They also typically include a large set of tools that
>> >enable new models to be built from large sets of example translations
>> >(“parallel data”) and monolingual texts. These toolkits are usually
>>built
>> >to support the agendas of the (largely) academic researchers that build
>> >them: the repeated cycle of building new models, tuning model
>>parameters
>> >against development data, and evaluating them against held-out test
>>data,
>> >using standard metrics for testing the quality of MT output.
>> >
>> >Together, these three factors—the quality of machine translation
>>output,
>> >the feasibility of translating on standard computers, and the
>>availability
>> >of tools to build models—make it reasonable for the end users to use
>>MT as
>> >a black-box service, and to run it on their personal machine.
>> >
>> >These factors make it a good time for an organization with the status
>>of
>> >the Apache Foundation to host a machine translation project.
>> >
>> >== Current Status ==
>> >Joshua was originally ported from David Chiang’s Python implementation
>>of
>> >Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>> >University. The current version is maintained by Matt Post at Johns
>> >Hopkins’ Human Language Technology Center of Excellence. Joshua has
>>made
>> >many releases with a list of over 20 source code tags. The last
>>release of
>> >Joshua was 6.0.5 on November 5th, 2015.
>> >
>> >== Meritocracy ==
>> >The current developers are familiar with meritocratic open source
>> >development at Apache. Apache was chosen specifically because we want
>>to
>> >encourage this style of development for the project.
>> >
>> >== Community ==
>> >Joshua is used widely across the world. Perhaps its biggest (known)
>> >research / industrial user is the Amazon research group in Berlin.
>>Another
>> >user is the US Army Research Lab. No formal census has been undertaken,
>> >but posts to the Joshua technical support mailing list, along with the
>> >occasional contributions, suggest small research and academic
>>communities
>> >spread across the world, many of them in India.
>> >
>> >During incubation, we will explicitly seek to increase our usage across
>> >the board, including academic research, industry, and other end users
>> >interested in statistical machine translation.
>> >
>> >== Core Developers ==
>> >The current set of core developers is fairly small, having fallen with
>>the
>> >graduation from Johns Hopkins of some core student participants.
>>However,
>> >Joshua is used fairly widely, as mentioned above, and there remains a
>> >commitment from the principal researcher at Johns Hopkins to continue
>>to
>> >use and develop it. Joshua has seen a number of new community members
>> >become interested recently due to a potential for its projected use in
>>a
>> >number of ongoing DARPA projects such as XDATA and Memex.
>> >
>> >== Alignment ==
>> >Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>> >rights reserved and licensed under BSD 2-clause license. It would of
>> >course be the intention to relicense this code under AL2.0 which would
>> >permit expanded and increased use of the software within Apache
>>projects.
>> >There is currently an ongoing effort within the Apache Tika community
>>to
>> >utilize Joshua within Tika’s Translate API, see
>> >[[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>> >
>> >== Known Risks ==
>> >
>> >=== Orphaned products ===
>> >At the moment, regular contributions are made by a single contributor,
>>the
>> >lead maintainer. He (Matt Post) plans to continue development for the
>>next
>> >few years, but it is still a single point of failure, since the
>>graduate
>> >students who worked on the project have moved on to jobs, mostly in
>> >industry. However, our goal is to help that process by growing the
>> >community in Apache, and at least in growing the community with users
>>and
>> >participants from NASA JPL.
>> >
>> >=== Inexperience with Open Source ===
>> >The team both at Johns Hopkins and NASA JPL have experience with many
>>OSS
>> >software projects at Apache and elsewhere. We understand "how it works"
>> >here at the foundation.
>> >
>> >
>> >== Relationships with Other Apache Products ==
>> >Joshua includes dependences on Hadoop, and also is included as a
>>plugin in
>> >Apache Tika. We are also interested in coordinating with other projects
>> >including Spark, and other projects needing MT services for language
>> >translation.
>> >
>> >== Developers ==
>> >Joshua only has one regular developer who is employed by Johns Hopkins
>> >University. NASA JPL (Mattmann and McGibbney) have been contributing
>> >lately including a Brew formula and other contributions to the project
>> >through the DARPA XDATA and Memex programs.
>> >
>> >== Documentation ==
>> >Documentation and publications related to Joshua can be found at
>> >joshua-decoder.org. The source for the Joshua documentation is
>>currently
>> >hosted on Github at
>> >https://github.com/joshua-decoder/joshua-decoder.github.com
>> >
>> >== Initial Source ==
>> >Current source resides at Github: github.com/joshua-decoder/joshua (the
>> >main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>> grammar
>> >extraction tool).
>> >
>> >== External Dependencies ==
>> >Joshua has a number of external dependencies. Only BerkeleyLM (Apache
>>2.0)
>> >and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is
>> >needed for translating sentences with pre-built models). The rest are
>> >dependencies for the build system and pipeline, used for constructing
>>and
>> >training new models from parallel text.
>> >
>> >Apache projects:
>> > * Ant
>> > * Hadoop
>> > * Commons
>> > * Maven
>> > * Ivy
>> >
>> >There are also a number of other open-source projects with various
>> >licenses that the project depends on both dynamically (runtime), and
>> >statically.
>> >
>> >=== GNU GPL 2 ===
>> > * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>> >
>> >=== LGPG 2.1 ===
>> > * KenLM: github.com/kpu/kenlm
>> >
>> >=== Apache 2.0 ===
>> > * BerkeleyLM: https://code.google.com/p/berkeleylm/
>> >
>> >=== GNU GPL ===
>> > * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>> >
>> >== Required Resources ==
>> > * Mailing Lists
>> >   * priv...@joshua.incubator.apache.org
>> >   * d...@joshua.incubator.apache.org
>> >   * comm...@joshua.incubator.apache.org
>> >
>> > * Git Repos
>> >   * https://git-wip-us.apache.org/repos/asf/joshua.git
>> >
>> > * Issue Tracking
>> >   * JIRA Joshua (JOSHUA)
>> >
>> > * Continuous Integration
>> >   * Jenkins builds on https://builds.apache.org/
>> >
>> > * Web
>> >   * http://joshua.incubator.apache.org/
>> >   * wiki at http://cwiki.apache.org
>> >
>> >== Initial Committers ==
>> >The following is a list of the planned initial Apache committers (the
>> >active subset of the committers for the current repository on Github).
>> >
>> > * Matt Post (p...@cs.jhu.edu)
>> > * Lewis John McGibbney (lewi...@apache.org)
>> > * Chris Mattmann (mattm...@apache.org)
>> >
>> >== Affiliations ==
>> >
>> > * Johns Hopkins University
>> >   * Matt Post
>> >
>> > * NASA JPL
>> >   * Chris Mattmann
>> >   * Lewis John McGibbney
>> >
>> >
>> >== Sponsors ==
>> >=== Champion ===
>> > * Chris Mattmann (NASA/JPL)
>> >
>> >=== Nominated Mentors ===
>> > * Paul Ramirez
>> > * Lewis John McGibbney
>> > * Chris Mattmann
>> >
>> >== Sponsoring Entity ==
>> >The Apache Incubator
>> >
>> >
>> >
>> >
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Chris Mattmann, Ph.D.
>> >Chief Architect
>> >Instrument Software and Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: chris.a.mattm...@nasa.gov
>> >WWW:  http://sunset.usc.edu/~mattmann/
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>> >
>> >
>> 
>>>?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKC
>>>B�
>> 
>>>?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y??]?[ۘ
>>>[?
>> >?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B

Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Reply via email to