The dependencies can be split into two kinds: ones required for building new models, and ones needed by the decoder to translate new sentences with a pre-built model (i.e., black-box translation with the language packs).
1. For building new models, you need a way to align the words between sentences in parallel text. Both the aligners used by Joshua (GIZA++ and the Berkeley aligner) are GPL of some form. These can be implemented as external dependencies, or can be replaced with another aligner, like fast_align (https://github.com/clab/fast_align), which is Apache-licensed. There are many other options, in fact. So this should not be a worry. 2. For doing black-box translation, one needs to represent the language model, which is very large. The best tool for this is KenLM (github.com/kpu/kenlm), which is LGPL 2.1. There is also BerkeleyLM, which is just as good for practical purposes and is Apache-licensed. KenLM is C++ and is loaded via the JNI, whereas BerkeleyLM is written in Java. I have moved to including BerkeleyLM in language packs, because I can then include the Joshua-runtime, and people can translate without even having to compile anything. So in short, there are no hard dependencies on unfavorably-licensed external projects. matt > On Jan 20, 2016, at 10:08 AM, Mattmann, Chris A (3980) > <chris.a.mattm...@jpl.nasa.gov> wrote: > > Hey Hen, > > Matt Post who I believe is monitoring this list and who has > been one of the key Joshua developers and I have discussed this > and we believe that potentially GPL/LGPL dependencies can: > > 1. be replaced with category-A or category-B alternatives. Matt > mentioned one already to me which has slipped my mind. > 2. be made in such a way that they are external tools and the > bindings exist in Joshua to call those external tools (aka runtime > deps akin to depending on a C compiler, etc.) > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Henri Yandell <bay...@apache.org> > Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> > Date: Tuesday, January 19, 2016 at 7:38 PM > To: "general@incubator.apache.org" <general@incubator.apache.org> > Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine > Translation Toolkit > >> License-wise, any expectation of problems from the GPL and LGPL >> dependencies? >> >> On Mon, Jan 18, 2016 at 9:58 PM, Mattmann, Chris A (3980) < >> chris.a.mattm...@jpl.nasa.gov> wrote: >> >>> Great Hen, we’d love to have you on board as a mentor! Please >>> add yourself to the proposal on the wiki. >>> >>> Anyone else have interest in Machine Translation? Any OpenNLP folks, >>> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility >>> please feel free to reply to general@i.a.o. >>> >>> I’ll leave the DISCUSS thread open for a few more days. >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: chris.a.mattm...@nasa.gov >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Henri Yandell <bay...@apache.org> >>> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> >>> Date: Monday, January 18, 2016 at 7:57 PM >>> To: jpluser <chris.a.mattm...@jpl.nasa.gov>, >>> "general@incubator.apache.org" <general@incubator.apache.org> >>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>> Translation Toolkit >>> >>>> Non-binding +1 to Joshua joining the Incubator. I'd be interested in >>>> mentoring. >>>> >>>> >>>>> -----Original Message----- >>>>> From: jpluser <chris.a.mattm...@jpl.nasa.gov> >>>>> Reply-To: "general@incubator.apache.org" >>> <general@incubator.apache.org> >>>>> Date: Tuesday, January 12, 2016 at 10:56 PM >>>>> To: "general@incubator.apache.org" <general@incubator.apache.org> >>>>> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> >>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>>> Translation >>>>> Toolkit >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> Please find attached for your viewing pleasure a proposed new >>> project, >>>>>> Apache Joshua, a statistical machine translation toolkit. The >>> proposal >>>>>> is in wiki draft form at: >>>>> https://wiki.apache.org/incubator/JoshuaProposal >>>>>> >>>>>> Proposal text is copied below. I’ll leave the discussion open for a >>>>> week >>>>>> and we are interested in folks who would like to be initial >>> committers >>>>>> and mentors. Please discuss here on the thread. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Cheers, >>>>>> Chris (Champion) >>>>>> >>>>>> ——— >>>>>> >>>>>> = Joshua Proposal = >>>>>> >>>>>> == Abstract == >>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >>>>>> translation toolkit. It includes a Java-based decoder for >>> translating >>>>> with >>>>>> phrase-based, hierarchical, and syntax-based translation models, a >>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of >>> tools >>>>> and >>>>>> scripts for training and evaluating new models from parallel text. >>>>>> >>>>>> == Proposal == >>>>>> Joshua is a state of the art statistical machine translation system >>>>> that >>>>>> provides a number of features: >>>>>> >>>>>> * Support for the two main paradigms in statistical machine >>>>> translation: >>>>>> phrase-based and hierarchical / syntactic. >>>>>> * A sparse feature API that makes it easy to add new feature >>> templates >>>>>> supporting millions of features >>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and >>> AdaGrad) >>>>>> * Support for lattice decoding, allowing upstream NLP tools to >>> expose >>>>>> their hypothesis space to the MT system >>>>>> * An efficient representation for models, allowing for quick >>> loading >>>>> of >>>>>> multi-gigabyte model files >>>>>> * Fast decoding speed (on par with Moses and mtplz) >>>>>> * Language packs — precompiled models that allow the decoder to be >>>>> run as >>>>>> a black box >>>>>> * Thrax, a Hadoop-based tool for learning translation models from >>>>>> parallel text >>>>>> * A suite of tools for constructing new models for any language >>> pair >>>>> for >>>>>> which sufficient training data exists >>>>>> >>>>>> == Background and Rationale == >>>>>> A number of factors make this a good time for an Apache project >>>>> focused on >>>>>> machine translation (MT): the quality of MT output (for many >>> language >>>>>> pairs); the average computing resources available on computers, >>>>> relative >>>>>> to the needs of MT systems; and the availability of a number of >>>>>> high-quality toolkits, together with a large base of researchers >>>>> working >>>>>> on them. >>>>>> >>>>>> Over the past decade, machine translation (MT; the automatic >>>>> translation >>>>>> of one human language to another) has become a reality. The research >>>>> into >>>>>> statistical approaches to translation that began in the early >>> nineties, >>>>>> together with the availability of large amounts of training data, >>> and >>>>>> better computing infrastructure, have all come together to produce >>>>>> translations results that are “good enough” for a large set of >>> language >>>>>> pairs and use cases. Free services like >>>>>> [[https://www.bing.com/translator|Bing Translator]] and >>>>>> [[https://translate.google.com|Google Translate]] have made these >>>>> services >>>>>> available to the average person through direct interfaces and >>> through >>>>>> tools like browser plugins, and sites across the world with higher >>>>>> translation needs use them to translate their pages through >>>>> automatically. >>>>>> >>>>>> MT does not require the infrastructure of large corporations in >>> order >>>>> to >>>>>> produce feasible output. Machine translation can be >>> resource-intensive, >>>>>> but need not be prohibitively so. Disk and memory usage are mostly a >>>>>> matter of model size, which for most language pairs is a few >>> gigabytes >>>>> at >>>>>> most, at which size models can provide coverage on the order of >>> tens or >>>>>> even hundreds of thousands of words in the input and output >>> languages. >>>>> The >>>>>> computational complexity of the algorithms used to search for >>>>> translations >>>>>> of new sentences are typically linear in the number of words in the >>>>> input >>>>>> sentence, making it possible to run a translation engine on a >>> personal >>>>>> computer. >>>>>> >>>>>> The research community has produced many different open source >>>>> translation >>>>>> projects for a range of programming languages and under a variety of >>>>>> licenses. These projects include the core “decoder”, which takes a >>>>> model >>>>>> and uses it to translate new sentences between the language pair the >>>>> model >>>>>> was defined for. They also typically include a large set of tools >>> that >>>>>> enable new models to be built from large sets of example >>> translations >>>>>> (“parallel data”) and monolingual texts. These toolkits are usually >>>>> built >>>>>> to support the agendas of the (largely) academic researchers that >>> build >>>>>> them: the repeated cycle of building new models, tuning model >>>>> parameters >>>>>> against development data, and evaluating them against held-out test >>>>> data, >>>>>> using standard metrics for testing the quality of MT output. >>>>>> >>>>>> Together, these three factors—the quality of machine translation >>>>> output, >>>>>> the feasibility of translating on standard computers, and the >>>>> availability >>>>>> of tools to build models—make it reasonable for the end users to use >>>>> MT as >>>>>> a black-box service, and to run it on their personal machine. >>>>>> >>>>>> These factors make it a good time for an organization with the >>> status >>>>> of >>>>>> the Apache Foundation to host a machine translation project. >>>>>> >>>>>> == Current Status == >>>>>> Joshua was originally ported from David Chiang’s Python >>> implementation >>>>> of >>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >>>>>> University. The current version is maintained by Matt Post at Johns >>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has >>>>> made >>>>>> many releases with a list of over 20 source code tags. The last >>>>> release of >>>>>> Joshua was 6.0.5 on November 5th, 2015. >>>>>> >>>>>> == Meritocracy == >>>>>> The current developers are familiar with meritocratic open source >>>>>> development at Apache. Apache was chosen specifically because we >>> want >>>>> to >>>>>> encourage this style of development for the project. >>>>>> >>>>>> == Community == >>>>>> Joshua is used widely across the world. Perhaps its biggest (known) >>>>>> research / industrial user is the Amazon research group in Berlin. >>>>> Another >>>>>> user is the US Army Research Lab. No formal census has been >>> undertaken, >>>>>> but posts to the Joshua technical support mailing list, along with >>> the >>>>>> occasional contributions, suggest small research and academic >>>>> communities >>>>>> spread across the world, many of them in India. >>>>>> >>>>>> During incubation, we will explicitly seek to increase our usage >>> across >>>>>> the board, including academic research, industry, and other end >>> users >>>>>> interested in statistical machine translation. >>>>>> >>>>>> == Core Developers == >>>>>> The current set of core developers is fairly small, having fallen >>> with >>>>> the >>>>>> graduation from Johns Hopkins of some core student participants. >>>>> However, >>>>>> Joshua is used fairly widely, as mentioned above, and there remains >>> a >>>>>> commitment from the principal researcher at Johns Hopkins to >>> continue >>>>> to >>>>>> use and develop it. Joshua has seen a number of new community >>> members >>>>>> become interested recently due to a potential for its projected use >>> in >>>>> a >>>>>> number of ongoing DARPA projects such as XDATA and Memex. >>>>>> >>>>>> == Alignment == >>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >>>>>> rights reserved and licensed under BSD 2-clause license. It would of >>>>>> course be the intention to relicense this code under AL2.0 which >>> would >>>>>> permit expanded and increased use of the software within Apache >>>>> projects. >>>>>> There is currently an ongoing effort within the Apache Tika >>> community >>>>> to >>>>>> utilize Joshua within Tika’s Translate API, see >>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>>>>> >>>>>> == Known Risks == >>>>>> >>>>>> === Orphaned products === >>>>>> At the moment, regular contributions are made by a single >>> contributor, >>>>> the >>>>>> lead maintainer. He (Matt Post) plans to continue development for >>> the >>>>> next >>>>>> few years, but it is still a single point of failure, since the >>>>> graduate >>>>>> students who worked on the project have moved on to jobs, mostly in >>>>>> industry. However, our goal is to help that process by growing the >>>>>> community in Apache, and at least in growing the community with >>> users >>>>> and >>>>>> participants from NASA JPL. >>>>>> >>>>>> === Inexperience with Open Source === >>>>>> The team both at Johns Hopkins and NASA JPL have experience with >>> many >>>>> OSS >>>>>> software projects at Apache and elsewhere. We understand "how it >>> works" >>>>>> here at the foundation. >>>>>> >>>>>> >>>>>> == Relationships with Other Apache Products == >>>>>> Joshua includes dependences on Hadoop, and also is included as a >>>>> plugin in >>>>>> Apache Tika. We are also interested in coordinating with other >>> projects >>>>>> including Spark, and other projects needing MT services for language >>>>>> translation. >>>>>> >>>>>> == Developers == >>>>>> Joshua only has one regular developer who is employed by Johns >>> Hopkins >>>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing >>>>>> lately including a Brew formula and other contributions to the >>> project >>>>>> through the DARPA XDATA and Memex programs. >>>>>> >>>>>> == Documentation == >>>>>> Documentation and publications related to Joshua can be found at >>>>>> joshua-decoder.org. The source for the Joshua documentation is >>>>> currently >>>>>> hosted on Github at >>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com >>>>>> >>>>>> == Initial Source == >>>>>> Current source resides at Github: github.com/joshua-decoder/joshua >>> (the >>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>>>> grammar >>>>>> extraction tool). >>>>>> >>>>>> == External Dependencies == >>>>>> Joshua has a number of external dependencies. Only BerkeleyLM >>> (Apache >>>>> 2.0) >>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of >>> which is >>>>>> needed for translating sentences with pre-built models). The rest >>> are >>>>>> dependencies for the build system and pipeline, used for >>> constructing >>>>> and >>>>>> training new models from parallel text. >>>>>> >>>>>> Apache projects: >>>>>> * Ant >>>>>> * Hadoop >>>>>> * Commons >>>>>> * Maven >>>>>> * Ivy >>>>>> >>>>>> There are also a number of other open-source projects with various >>>>>> licenses that the project depends on both dynamically (runtime), and >>>>>> statically. >>>>>> >>>>>> === GNU GPL 2 === >>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>>>>> >>>>>> === LGPG 2.1 === >>>>>> * KenLM: github.com/kpu/kenlm >>>>>> >>>>>> === Apache 2.0 === >>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>>>>> >>>>>> === GNU GPL === >>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>>>>> >>>>>> == Required Resources == >>>>>> * Mailing Lists >>>>>> * priv...@joshua.incubator.apache.org >>>>>> * d...@joshua.incubator.apache.org >>>>>> * comm...@joshua.incubator.apache.org >>>>>> >>>>>> * Git Repos >>>>>> * https://git-wip-us.apache.org/repos/asf/joshua.git >>>>>> >>>>>> * Issue Tracking >>>>>> * JIRA Joshua (JOSHUA) >>>>>> >>>>>> * Continuous Integration >>>>>> * Jenkins builds on https://builds.apache.org/ >>>>>> >>>>>> * Web >>>>>> * http://joshua.incubator.apache.org/ >>>>>> * wiki at http://cwiki.apache.org >>>>>> >>>>>> == Initial Committers == >>>>>> The following is a list of the planned initial Apache committers >>> (the >>>>>> active subset of the committers for the current repository on >>> Github). >>>>>> >>>>>> * Matt Post (p...@cs.jhu.edu) >>>>>> * Lewis John McGibbney (lewi...@apache.org) >>>>>> * Chris Mattmann (mattm...@apache.org) >>>>>> >>>>>> == Affiliations == >>>>>> >>>>>> * Johns Hopkins University >>>>>> * Matt Post >>>>>> >>>>>> * NASA JPL >>>>>> * Chris Mattmann >>>>>> * Lewis John McGibbney >>>>>> >>>>>> >>>>>> == Sponsors == >>>>>> === Champion === >>>>>> * Chris Mattmann (NASA/JPL) >>>>>> >>>>>> === Nominated Mentors === >>>>>> * Paul Ramirez >>>>>> * Lewis John McGibbney >>>>>> * Chris Mattmann >>>>>> >>>>>> == Sponsoring Entity == >>>>>> The Apache Incubator >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Chief Architect >>>>>> Instrument Software and Science Data Systems Section (398) >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 168-519, Mailstop: 168-527 >>>>>> Email: chris.a.mattm...@nasa.gov >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Associate Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>> >>> >>>>>> ?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK >>>>>> KC >>>>>> B� >>>>> >>> >>>>>> ?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y??]? >>>>>> [ۘ >>>>>> [? >>>>>> ?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org