GitHub user KellenSunderland opened a pull request: https://github.com/apache/incubator-joshua/pull/6
More work on structuring translation output These commits focus on two areas: * One is structured translation output, and this should be our last PR for this topic for the time being. * We also have done some static code analysis and fixed different issues that have been flagged by our tool. There are also some new unit tests included. You can merge this pull request into a Git repository by running: $ git pull https://github.com/KellenSunderland/incubator-joshua master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-joshua/pull/6.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6 ---- commit 53de11905134e32191d11cafd07d5a033c16e411 Author: Felix Hieber <fhie...@amazon.com> Date: 2015-11-25T14:11:27Z Reworked most of the hypergraph traversals for Viterbi and n-best extractions. Most importantly: translation string extraction is now int based instead of doing regex matching and string operations. This should be a lot faster. However this will only work for hiero models for now. Phrase-based decoding will still use the String-based extractions. Before there were two ways to traverse the hypergraph: (1) regular tailNode order (used for Viterbi and WordAlignment) and (2) tailNode order according to target side non terminal indices (used for KBestExtraction). This caused quite some inconsistencies on how to write general extractors (output string, input string, feature vector, word alignments, tree, etc.) that support both. The main issue was that some extractors (String-based HypothesisExtractor) relies on the traversal order (2) to simply always merge children strings into the first nonTerminal symbol on the target side. However, this breaks the very same class when the input st ring is requested (which is a supported feature in Joshua). This change gets rid of these inconsistencies for Hiero and simplifies a lot of the code. For phrase-based decoding, we still rely on the HypothesisExtractor, but this is not a use case for Saar at the moment, and phrase-based decoding is hack in Joshua anyway. JoshuaConfiguration now throws an exception if you want to have the 'align_index' in output strings (Moses style) for Hiero models. This is not supported by int[]-based extraction and useless anyway. commit c72faea7c241395b8827cf6ab41aec67c7fdc54c Author: Pavel Danchenko <danch...@amazon.com> Date: 2015-12-22T11:49:22Z LanguageModelFF.estimateFutureCost refactorings and test commit d1c3caac1da8c3c1175059b20c411a3ebd965465 Author: Kellen Sunderland <kell...@amazon.com> Date: 2016-04-27T22:12:00Z Updated license files for tests commit 9c3f2e6e60e68c9a55733d872d15c5c39c937ab0 Author: Felix Hieber <fhie...@amazon.com> Date: 2015-12-31T10:57:18Z Modified KenLM jni to support querying the lm using strings not only ids. Also added a method to check whether a word or id is known to the lm. Made output of regression tests more concise commit 8d86ff2b989c2b1db8aff7eaaa479cae38c73357 Author: Felix Hieber <fhie...@amazon.com> Date: 2016-02-03T14:17:30Z Mostly a refactor for improved readability ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---