Thanks JB - if you are interested in mentoring would appreciate the help. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Jean-Baptiste Onofré <j...@nanthrax.net> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> Date: Monday, January 18, 2016 at 11:01 PM To: "general@incubator.apache.org" <general@incubator.apache.org> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit >Hi Chris, > >it looks interesting. I'm looking forward for the vote. > >Regards >JB > >On 01/13/2016 07:56 AM, Mattmann, Chris A (3980) wrote: >> Hi Everyone, >> >> Please find attached for your viewing pleasure a proposed new project, >> Apache Joshua, a statistical machine translation toolkit. The proposal >> is in wiki draft form at: >>https://wiki.apache.org/incubator/JoshuaProposal >> >> Proposal text is copied below. I’ll leave the discussion open for a week >> and we are interested in folks who would like to be initial committers >> and mentors. Please discuss here on the thread. >> >> Thanks! >> >> Cheers, >> Chris (Champion) >> >> ——— >> >> = Joshua Proposal = >> >> == Abstract == >> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >> translation toolkit. It includes a Java-based decoder for translating >>with >> phrase-based, hierarchical, and syntax-based translation models, a >> Hadoop-based grammar extractor (Thrax), and an extensive set of tools >>and >> scripts for training and evaluating new models from parallel text. >> >> == Proposal == >> Joshua is a state of the art statistical machine translation system that >> provides a number of features: >> >> * Support for the two main paradigms in statistical machine >>translation: >> phrase-based and hierarchical / syntactic. >> * A sparse feature API that makes it easy to add new feature templates >> supporting millions of features >> * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad) >> * Support for lattice decoding, allowing upstream NLP tools to expose >> their hypothesis space to the MT system >> * An efficient representation for models, allowing for quick loading >>of >> multi-gigabyte model files >> * Fast decoding speed (on par with Moses and mtplz) >> * Language packs — precompiled models that allow the decoder to be >>run as >> a black box >> * Thrax, a Hadoop-based tool for learning translation models from >> parallel text >> * A suite of tools for constructing new models for any language pair >>for >> which sufficient training data exists >> >> == Background and Rationale == >> A number of factors make this a good time for an Apache project focused >>on >> machine translation (MT): the quality of MT output (for many language >> pairs); the average computing resources available on computers, relative >> to the needs of MT systems; and the availability of a number of >> high-quality toolkits, together with a large base of researchers working >> on them. >> >> Over the past decade, machine translation (MT; the automatic translation >> of one human language to another) has become a reality. The research >>into >> statistical approaches to translation that began in the early nineties, >> together with the availability of large amounts of training data, and >> better computing infrastructure, have all come together to produce >> translations results that are “good enough” for a large set of language >> pairs and use cases. Free services like >> [[https://www.bing.com/translator|Bing Translator]] and >> [[https://translate.google.com|Google Translate]] have made these >>services >> available to the average person through direct interfaces and through >> tools like browser plugins, and sites across the world with higher >> translation needs use them to translate their pages through >>automatically. >> >> MT does not require the infrastructure of large corporations in order to >> produce feasible output. Machine translation can be resource-intensive, >> but need not be prohibitively so. Disk and memory usage are mostly a >> matter of model size, which for most language pairs is a few gigabytes >>at >> most, at which size models can provide coverage on the order of tens or >> even hundreds of thousands of words in the input and output languages. >>The >> computational complexity of the algorithms used to search for >>translations >> of new sentences are typically linear in the number of words in the >>input >> sentence, making it possible to run a translation engine on a personal >> computer. >> >> The research community has produced many different open source >>translation >> projects for a range of programming languages and under a variety of >> licenses. These projects include the core “decoder”, which takes a model >> and uses it to translate new sentences between the language pair the >>model >> was defined for. They also typically include a large set of tools that >> enable new models to be built from large sets of example translations >> (“parallel data”) and monolingual texts. These toolkits are usually >>built >> to support the agendas of the (largely) academic researchers that build >> them: the repeated cycle of building new models, tuning model parameters >> against development data, and evaluating them against held-out test >>data, >> using standard metrics for testing the quality of MT output. >> >> Together, these three factors—the quality of machine translation output, >> the feasibility of translating on standard computers, and the >>availability >> of tools to build models—make it reasonable for the end users to use MT >>as >> a black-box service, and to run it on their personal machine. >> >> These factors make it a good time for an organization with the status of >> the Apache Foundation to host a machine translation project. >> >> == Current Status == >> Joshua was originally ported from David Chiang’s Python implementation >>of >> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >> University. The current version is maintained by Matt Post at Johns >> Hopkins’ Human Language Technology Center of Excellence. Joshua has made >> many releases with a list of over 20 source code tags. The last release >>of >> Joshua was 6.0.5 on November 5th, 2015. >> >> == Meritocracy == >> The current developers are familiar with meritocratic open source >> development at Apache. Apache was chosen specifically because we want to >> encourage this style of development for the project. >> >> == Community == >> Joshua is used widely across the world. Perhaps its biggest (known) >> research / industrial user is the Amazon research group in Berlin. >>Another >> user is the US Army Research Lab. No formal census has been undertaken, >> but posts to the Joshua technical support mailing list, along with the >> occasional contributions, suggest small research and academic >>communities >> spread across the world, many of them in India. >> >> During incubation, we will explicitly seek to increase our usage across >> the board, including academic research, industry, and other end users >> interested in statistical machine translation. >> >> == Core Developers == >> The current set of core developers is fairly small, having fallen with >>the >> graduation from Johns Hopkins of some core student participants. >>However, >> Joshua is used fairly widely, as mentioned above, and there remains a >> commitment from the principal researcher at Johns Hopkins to continue to >> use and develop it. Joshua has seen a number of new community members >> become interested recently due to a potential for its projected use in a >> number of ongoing DARPA projects such as XDATA and Memex. >> >> == Alignment == >> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >> rights reserved and licensed under BSD 2-clause license. It would of >> course be the intention to relicense this code under AL2.0 which would >> permit expanded and increased use of the software within Apache >>projects. >> There is currently an ongoing effort within the Apache Tika community to >> utilize Joshua within Tika’s Translate API, see >> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >> >> == Known Risks == >> >> === Orphaned products === >> At the moment, regular contributions are made by a single contributor, >>the >> lead maintainer. He (Matt Post) plans to continue development for the >>next >> few years, but it is still a single point of failure, since the graduate >> students who worked on the project have moved on to jobs, mostly in >> industry. However, our goal is to help that process by growing the >> community in Apache, and at least in growing the community with users >>and >> participants from NASA JPL. >> >> === Inexperience with Open Source === >> The team both at Johns Hopkins and NASA JPL have experience with many >>OSS >> software projects at Apache and elsewhere. We understand "how it works" >> here at the foundation. >> >> >> == Relationships with Other Apache Products == >> Joshua includes dependences on Hadoop, and also is included as a plugin >>in >> Apache Tika. We are also interested in coordinating with other projects >> including Spark, and other projects needing MT services for language >> translation. >> >> == Developers == >> Joshua only has one regular developer who is employed by Johns Hopkins >> University. NASA JPL (Mattmann and McGibbney) have been contributing >> lately including a Brew formula and other contributions to the project >> through the DARPA XDATA and Memex programs. >> >> == Documentation == >> Documentation and publications related to Joshua can be found at >> joshua-decoder.org. The source for the Joshua documentation is currently >> hosted on Github at >> https://github.com/joshua-decoder/joshua-decoder.github.com >> >> == Initial Source == >> Current source resides at Github: github.com/joshua-decoder/joshua (the >> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>grammar >> extraction tool). >> >> == External Dependencies == >> Joshua has a number of external dependencies. Only BerkeleyLM (Apache >>2.0) >> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is >> needed for translating sentences with pre-built models). The rest are >> dependencies for the build system and pipeline, used for constructing >>and >> training new models from parallel text. >> >> Apache projects: >> * Ant >> * Hadoop >> * Commons >> * Maven >> * Ivy >> >> There are also a number of other open-source projects with various >> licenses that the project depends on both dynamically (runtime), and >> statically. >> >> === GNU GPL 2 === >> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >> >> === LGPG 2.1 === >> * KenLM: github.com/kpu/kenlm >> >> === Apache 2.0 === >> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >> >> === GNU GPL === >> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >> >> == Required Resources == >> * Mailing Lists >> * priv...@joshua.incubator.apache.org >> * d...@joshua.incubator.apache.org >> * comm...@joshua.incubator.apache.org >> >> * Git Repos >> * https://git-wip-us.apache.org/repos/asf/joshua.git >> >> * Issue Tracking >> * JIRA Joshua (JOSHUA) >> >> * Continuous Integration >> * Jenkins builds on https://builds.apache.org/ >> >> * Web >> * http://joshua.incubator.apache.org/ >> * wiki at http://cwiki.apache.org >> >> == Initial Committers == >> The following is a list of the planned initial Apache committers (the >> active subset of the committers for the current repository on Github). >> >> * Matt Post (p...@cs.jhu.edu) >> * Lewis John McGibbney (lewi...@apache.org) >> * Chris Mattmann (mattm...@apache.org) >> >> == Affiliations == >> >> * Johns Hopkins University >> * Matt Post >> >> * NASA JPL >> * Chris Mattmann >> * Lewis John McGibbney >> >> >> == Sponsors == >> === Champion === >> * Chris Mattmann (NASA/JPL) >> >> === Nominated Mentors === >> * Paul Ramirez >> * Lewis John McGibbney >> * Chris Mattmann >> >> == Sponsoring Entity == >> The Apache Incubator >> >> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> > >-- >Jean-Baptiste Onofré >jbono...@apache.org >http://blog.nanthrax.net >Talend - http://www.talend.com > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >For additional commands, e-mail: general-h...@incubator.apache.org >