Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Matt Post Wed, 20 Jan 2016 07:19:22 -0800

The dependencies can be split into two kinds: ones required for building new 
models, and ones needed by the decoder to translate new sentences with a 
pre-built model (i.e., black-box translation with the language packs).


1. For building new models, you need a way to align the words between sentences 
in parallel text. Both the aligners used by Joshua (GIZA++ and the Berkeley 
aligner) are GPL of some form. These can be implemented as external 
dependencies, or can be replaced with another aligner, like fast_align 
(https://github.com/clab/fast_align), which is Apache-licensed. There are many 
other options, in fact. So this should not be a worry.

2. For doing black-box translation, one needs to represent the language model, 
which is very large. The best tool for this is KenLM (github.com/kpu/kenlm), 
which is LGPL 2.1. There is also BerkeleyLM, which is just as good for 
practical purposes and is Apache-licensed. KenLM is C++ and is loaded via the 
JNI, whereas BerkeleyLM is written in Java. I have moved to including 
BerkeleyLM in language packs, because I can then include the Joshua-runtime, 
and people can translate without even having to compile anything.

So in short, there are no hard dependencies on unfavorably-licensed external 
projects.

matt




> On Jan 20, 2016, at 10:08 AM, Mattmann, Chris A (3980) 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Hey Hen,
> 
> Matt Post who I believe is monitoring this list and who has
> been one of the key Joshua developers and I have discussed this
> and we believe that potentially GPL/LGPL dependencies can:
> 
> 1. be replaced with category-A or category-B alternatives. Matt
> mentioned one already to me which has slipped my mind.
> 2. be made in such a way that they are external tools and the
> bindings exist in Joshua to call those external tools (aka runtime
> deps akin to depending on a C compiler, etc.)
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Henri Yandell <bay...@apache.org>
> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
> Date: Tuesday, January 19, 2016 at 7:38 PM
> To: "general@incubator.apache.org" <general@incubator.apache.org>
> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
> Translation Toolkit
> 
>> License-wise, any expectation of problems from the GPL and LGPL
>> dependencies?
>> 
>> On Mon, Jan 18, 2016 at 9:58 PM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>>> Great Hen, we’d love to have you on board as a mentor! Please
>>> add yourself to the proposal on the wiki.
>>> 
>>> Anyone else have interest in Machine Translation? Any OpenNLP folks,
>>> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility
>>> please feel free to reply to general@i.a.o.
>>> 
>>> I’ll leave the DISCUSS thread open for a few more days.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Henri Yandell <bay...@apache.org>
>>> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
>>> Date: Monday, January 18, 2016 at 7:57 PM
>>> To: jpluser <chris.a.mattm...@jpl.nasa.gov>,
>>> "general@incubator.apache.org" <general@incubator.apache.org>
>>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>> Translation Toolkit
>>> 
>>>> Non-binding +1 to Joshua joining the Incubator. I'd be interested in
>>>> mentoring.
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: jpluser <chris.a.mattm...@jpl.nasa.gov>
>>>>> Reply-To: "general@incubator.apache.org"
>>> <general@incubator.apache.org>
>>>>> Date: Tuesday, January 12, 2016 at 10:56 PM
>>>>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>>>>> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu>
>>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>>>> Translation
>>>>> Toolkit
>>>>> 
>>>>>> Hi Everyone,
>>>>>> 
>>>>>> Please find attached for your viewing pleasure a proposed new
>>> project,
>>>>>> Apache Joshua, a statistical machine translation toolkit. The
>>> proposal
>>>>>> is in wiki draft form at:
>>>>> https://wiki.apache.org/incubator/JoshuaProposal
>>>>>> 
>>>>>> Proposal text is copied below. I’ll leave the discussion open for a
>>>>> week
>>>>>> and we are interested in folks who would like to be initial
>>> committers
>>>>>> and mentors. Please discuss here on the thread.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris (Champion)
>>>>>> 
>>>>>> ———
>>>>>> 
>>>>>> = Joshua Proposal =
>>>>>> 
>>>>>> == Abstract ==
>>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>>>>>> translation toolkit. It includes a Java-based decoder for
>>> translating
>>>>> with
>>>>>> phrase-based, hierarchical, and syntax-based translation models, a
>>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of
>>> tools
>>>>> and
>>>>>> scripts for training and evaluating new models from parallel text.
>>>>>> 
>>>>>> == Proposal ==
>>>>>> Joshua is a state of the art statistical machine translation system
>>>>> that
>>>>>> provides a number of features:
>>>>>> 
>>>>>> * Support for the two main paradigms in statistical machine
>>>>> translation:
>>>>>> phrase-based and hierarchical / syntactic.
>>>>>> * A sparse feature API that makes it easy to add new feature
>>> templates
>>>>>> supporting millions of features
>>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and
>>> AdaGrad)
>>>>>> * Support for lattice decoding, allowing upstream NLP tools to
>>> expose
>>>>>> their hypothesis space to the MT system
>>>>>> * An efficient representation for models, allowing for quick
>>> loading
>>>>> of
>>>>>> multi-gigabyte model files
>>>>>> * Fast decoding speed (on par with Moses and mtplz)
>>>>>> * Language packs — precompiled models that allow the decoder to be
>>>>> run as
>>>>>> a black box
>>>>>> * Thrax, a Hadoop-based tool for learning translation models from
>>>>>> parallel text
>>>>>> * A suite of tools for constructing new models for any language
>>> pair
>>>>> for
>>>>>> which sufficient training data exists
>>>>>> 
>>>>>> == Background and Rationale ==
>>>>>> A number of factors make this a good time for an Apache project
>>>>> focused on
>>>>>> machine translation (MT): the quality of MT output (for many
>>> language
>>>>>> pairs); the average computing resources available on computers,
>>>>> relative
>>>>>> to the needs of MT systems; and the availability of a number of
>>>>>> high-quality toolkits, together with a large base of researchers
>>>>> working
>>>>>> on them.
>>>>>> 
>>>>>> Over the past decade, machine translation (MT; the automatic
>>>>> translation
>>>>>> of one human language to another) has become a reality. The research
>>>>> into
>>>>>> statistical approaches to translation that began in the early
>>> nineties,
>>>>>> together with the availability of large amounts of training data,
>>> and
>>>>>> better computing infrastructure, have all come together to produce
>>>>>> translations results that are “good enough” for a large set of
>>> language
>>>>>> pairs and use cases. Free services like
>>>>>> [[https://www.bing.com/translator|Bing Translator]] and
>>>>>> [[https://translate.google.com|Google Translate]] have made these
>>>>> services
>>>>>> available to the average person through direct interfaces and
>>> through
>>>>>> tools like browser plugins, and sites across the world with higher
>>>>>> translation needs use them to translate their pages through
>>>>> automatically.
>>>>>> 
>>>>>> MT does not require the infrastructure of large corporations in
>>> order
>>>>> to
>>>>>> produce feasible output. Machine translation can be
>>> resource-intensive,
>>>>>> but need not be prohibitively so. Disk and memory usage are mostly a
>>>>>> matter of model size, which for most language pairs is a few
>>> gigabytes
>>>>> at
>>>>>> most, at which size models can provide coverage on the order of
>>> tens or
>>>>>> even hundreds of thousands of words in the input and output
>>> languages.
>>>>> The
>>>>>> computational complexity of the algorithms used to search for
>>>>> translations
>>>>>> of new sentences are typically linear in the number of words in the
>>>>> input
>>>>>> sentence, making it possible to run a translation engine on a
>>> personal
>>>>>> computer.
>>>>>> 
>>>>>> The research community has produced many different open source
>>>>> translation
>>>>>> projects for a range of programming languages and under a variety of
>>>>>> licenses. These projects include the core “decoder”, which takes a
>>>>> model
>>>>>> and uses it to translate new sentences between the language pair the
>>>>> model
>>>>>> was defined for. They also typically include a large set of tools
>>> that
>>>>>> enable new models to be built from large sets of example
>>> translations
>>>>>> (“parallel data”) and monolingual texts. These toolkits are usually
>>>>> built
>>>>>> to support the agendas of the (largely) academic researchers that
>>> build
>>>>>> them: the repeated cycle of building new models, tuning model
>>>>> parameters
>>>>>> against development data, and evaluating them against held-out test
>>>>> data,
>>>>>> using standard metrics for testing the quality of MT output.
>>>>>> 
>>>>>> Together, these three factors—the quality of machine translation
>>>>> output,
>>>>>> the feasibility of translating on standard computers, and the
>>>>> availability
>>>>>> of tools to build models—make it reasonable for the end users to use
>>>>> MT as
>>>>>> a black-box service, and to run it on their personal machine.
>>>>>> 
>>>>>> These factors make it a good time for an organization with the
>>> status
>>>>> of
>>>>>> the Apache Foundation to host a machine translation project.
>>>>>> 
>>>>>> == Current Status ==
>>>>>> Joshua was originally ported from David Chiang’s Python
>>> implementation
>>>>> of
>>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>>>>>> University. The current version is maintained by Matt Post at Johns
>>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has
>>>>> made
>>>>>> many releases with a list of over 20 source code tags. The last
>>>>> release of
>>>>>> Joshua was 6.0.5 on November 5th, 2015.
>>>>>> 
>>>>>> == Meritocracy ==
>>>>>> The current developers are familiar with meritocratic open source
>>>>>> development at Apache. Apache was chosen specifically because we
>>> want
>>>>> to
>>>>>> encourage this style of development for the project.
>>>>>> 
>>>>>> == Community ==
>>>>>> Joshua is used widely across the world. Perhaps its biggest (known)
>>>>>> research / industrial user is the Amazon research group in Berlin.
>>>>> Another
>>>>>> user is the US Army Research Lab. No formal census has been
>>> undertaken,
>>>>>> but posts to the Joshua technical support mailing list, along with
>>> the
>>>>>> occasional contributions, suggest small research and academic
>>>>> communities
>>>>>> spread across the world, many of them in India.
>>>>>> 
>>>>>> During incubation, we will explicitly seek to increase our usage
>>> across
>>>>>> the board, including academic research, industry, and other end
>>> users
>>>>>> interested in statistical machine translation.
>>>>>> 
>>>>>> == Core Developers ==
>>>>>> The current set of core developers is fairly small, having fallen
>>> with
>>>>> the
>>>>>> graduation from Johns Hopkins of some core student participants.
>>>>> However,
>>>>>> Joshua is used fairly widely, as mentioned above, and there remains
>>> a
>>>>>> commitment from the principal researcher at Johns Hopkins to
>>> continue
>>>>> to
>>>>>> use and develop it. Joshua has seen a number of new community
>>> members
>>>>>> become interested recently due to a potential for its projected use
>>> in
>>>>> a
>>>>>> number of ongoing DARPA projects such as XDATA and Memex.
>>>>>> 
>>>>>> == Alignment ==
>>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>>>>>> rights reserved and licensed under BSD 2-clause license. It would of
>>>>>> course be the intention to relicense this code under AL2.0 which
>>> would
>>>>>> permit expanded and increased use of the software within Apache
>>>>> projects.
>>>>>> There is currently an ongoing effort within the Apache Tika
>>> community
>>>>> to
>>>>>> utilize Joshua within Tika’s Translate API, see
>>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>>>>>> 
>>>>>> == Known Risks ==
>>>>>> 
>>>>>> === Orphaned products ===
>>>>>> At the moment, regular contributions are made by a single
>>> contributor,
>>>>> the
>>>>>> lead maintainer. He (Matt Post) plans to continue development for
>>> the
>>>>> next
>>>>>> few years, but it is still a single point of failure, since the
>>>>> graduate
>>>>>> students who worked on the project have moved on to jobs, mostly in
>>>>>> industry. However, our goal is to help that process by growing the
>>>>>> community in Apache, and at least in growing the community with
>>> users
>>>>> and
>>>>>> participants from NASA JPL.
>>>>>> 
>>>>>> === Inexperience with Open Source ===
>>>>>> The team both at Johns Hopkins and NASA JPL have experience with
>>> many
>>>>> OSS
>>>>>> software projects at Apache and elsewhere. We understand "how it
>>> works"
>>>>>> here at the foundation.
>>>>>> 
>>>>>> 
>>>>>> == Relationships with Other Apache Products ==
>>>>>> Joshua includes dependences on Hadoop, and also is included as a
>>>>> plugin in
>>>>>> Apache Tika. We are also interested in coordinating with other
>>> projects
>>>>>> including Spark, and other projects needing MT services for language
>>>>>> translation.
>>>>>> 
>>>>>> == Developers ==
>>>>>> Joshua only has one regular developer who is employed by Johns
>>> Hopkins
>>>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing
>>>>>> lately including a Brew formula and other contributions to the
>>> project
>>>>>> through the DARPA XDATA and Memex programs.
>>>>>> 
>>>>>> == Documentation ==
>>>>>> Documentation and publications related to Joshua can be found at
>>>>>> joshua-decoder.org. The source for the Joshua documentation is
>>>>> currently
>>>>>> hosted on Github at
>>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com
>>>>>> 
>>>>>> == Initial Source ==
>>>>>> Current source resides at Github: github.com/joshua-decoder/joshua
>>> (the
>>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>>>>> grammar
>>>>>> extraction tool).
>>>>>> 
>>>>>> == External Dependencies ==
>>>>>> Joshua has a number of external dependencies. Only BerkeleyLM
>>> (Apache
>>>>> 2.0)
>>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of
>>> which is
>>>>>> needed for translating sentences with pre-built models). The rest
>>> are
>>>>>> dependencies for the build system and pipeline, used for
>>> constructing
>>>>> and
>>>>>> training new models from parallel text.
>>>>>> 
>>>>>> Apache projects:
>>>>>> * Ant
>>>>>> * Hadoop
>>>>>> * Commons
>>>>>> * Maven
>>>>>> * Ivy
>>>>>> 
>>>>>> There are also a number of other open-source projects with various
>>>>>> licenses that the project depends on both dynamically (runtime), and
>>>>>> statically.
>>>>>> 
>>>>>> === GNU GPL 2 ===
>>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>>>>>> 
>>>>>> === LGPG 2.1 ===
>>>>>> * KenLM: github.com/kpu/kenlm
>>>>>> 
>>>>>> === Apache 2.0 ===
>>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/
>>>>>> 
>>>>>> === GNU GPL ===
>>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>>>>>> 
>>>>>> == Required Resources ==
>>>>>> * Mailing Lists
>>>>>>  * priv...@joshua.incubator.apache.org
>>>>>>  * d...@joshua.incubator.apache.org
>>>>>>  * comm...@joshua.incubator.apache.org
>>>>>> 
>>>>>> * Git Repos
>>>>>>  * https://git-wip-us.apache.org/repos/asf/joshua.git
>>>>>> 
>>>>>> * Issue Tracking
>>>>>>  * JIRA Joshua (JOSHUA)
>>>>>> 
>>>>>> * Continuous Integration
>>>>>>  * Jenkins builds on https://builds.apache.org/
>>>>>> 
>>>>>> * Web
>>>>>>  * http://joshua.incubator.apache.org/
>>>>>>  * wiki at http://cwiki.apache.org
>>>>>> 
>>>>>> == Initial Committers ==
>>>>>> The following is a list of the planned initial Apache committers
>>> (the
>>>>>> active subset of the committers for the current repository on
>>> Github).
>>>>>> 
>>>>>> * Matt Post (p...@cs.jhu.edu)
>>>>>> * Lewis John McGibbney (lewi...@apache.org)
>>>>>> * Chris Mattmann (mattm...@apache.org)
>>>>>> 
>>>>>> == Affiliations ==
>>>>>> 
>>>>>> * Johns Hopkins University
>>>>>>  * Matt Post
>>>>>> 
>>>>>> * NASA JPL
>>>>>>  * Chris Mattmann
>>>>>>  * Lewis John McGibbney
>>>>>> 
>>>>>> 
>>>>>> == Sponsors ==
>>>>>> === Champion ===
>>>>>> * Chris Mattmann (NASA/JPL)
>>>>>> 
>>>>>> === Nominated Mentors ===
>>>>>> * Paul Ramirez
>>>>>> * Lewis John McGibbney
>>>>>> * Chris Mattmann
>>>>>> 
>>>>>> == Sponsoring Entity ==
>>>>>> The Apache Incubator
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattm...@nasa.gov
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>>>>> ?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
>>>>>> KC
>>>>>> B�
>>>>> 
>>> 
>>>>>> ?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y??]?
>>>>>> [ۘ
>>>>>> [?
>>>>>> ?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit

Reply via email to