[Moses-support] z-mert
Hi, I am using z-mert for the first time since I had to implement my own score for tuning. But when I try to run it, I get the following error while parse the param.txt file: Exception in thread "main" java.util.InputMismatchException at java.util.Scanner.throwFor(Scanner.java:864) at java.util.Scanner.next(Scanner.java:1485) at java.util.Scanner.nextDouble(Scanner.java:2413) at MertCore.processParamFile(MertCore.java:1537) at MertCore.initialize(MertCore.java:310) at MertCore.(MertCore.java:239) at ZMERT.main(ZMERT.java:44) My param.txt looks like this: lm_0 ||| 1.0 Opt 0.5 1.5 0.5 1.5 d_0 ||| 1.0 Opt 0.5 1.5 0.5 1.5 tm_0 ||| 0.3 Opt 0.25 0.75 0.25 0.75 tm_1 ||| 0.2 Opt 0.25 0.75 0.25 0.75 tm_2 ||| 0.2 Opt 0.25 0.75 0.25 0.75 tm_3 ||| 0.3 Opt 0.25 0.75 0.25 0.75 w_0 ||| 0.0 Opt -0.5 0.5 -0.5 0.5 normalization = none I was wondering if a type cast to double is missing in the code but before changing the z-mert code, I wanted to make sure I didn't get anything else wrong. Does anybody have experience with that? Cheers, Sarah ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] z-mert
Hi Sarah, try running the command with LC_ALL=C java -jar ... I think the problem is that Java assumes a German locale and expects floating point number with a comma and not a dot. I spent some time myself to figure that out while using ZMERT. Best, Marcin On 18.12.2015 11:09, Sarah Schulz wrote: > Hi, > > I am using z-mert for the first time since I had to implement my own > score for tuning. > But when I try to run it, I get the following error while parse the > param.txt file: > > Exception in thread "main" java.util.InputMismatchException > at java.util.Scanner.throwFor(Scanner.java:864) > at java.util.Scanner.next(Scanner.java:1485) > at java.util.Scanner.nextDouble(Scanner.java:2413) > at MertCore.processParamFile(MertCore.java:1537) > at MertCore.initialize(MertCore.java:310) > at MertCore.(MertCore.java:239) > at ZMERT.main(ZMERT.java:44) > > My param.txt looks like this: > > lm_0 ||| 1.0 Opt 0.5 1.5 0.5 1.5 > d_0 ||| 1.0 Opt 0.5 1.5 0.5 1.5 > tm_0 ||| 0.3 Opt 0.25 0.75 0.25 0.75 > tm_1 ||| 0.2 Opt 0.25 0.75 0.25 0.75 > tm_2 ||| 0.2 Opt 0.25 0.75 0.25 0.75 > tm_3 ||| 0.3 Opt 0.25 0.75 0.25 0.75 > w_0 ||| 0.0 Opt -0.5 0.5 -0.5 0.5 > normalization = none > > I was wondering if a type cast to double is missing in the code but > before changing the z-mert code, I wanted to make sure I didn't get > anything else wrong. > > Does anybody have experience with that? > > Cheers, > > Sarah > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] 1st CfP: 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016) at LREC 2016
(apologies for cross-posting) 2ND WORKSHOP ON NATURAL LANGUAGE PROCESSING FOR TRANSLATION MEMORIES (NLP4TM 2016) http://rgcl.wlv.ac.uk/nlp4tm2016/ [1] to be held at LREC 2016 (Portorož, Slovenia), May 28, 2016 Submission deadline: February 10, 2016 1. CALL FOR PAPERS Translation Memories (TM) are amongst the most used tools by professional translators, if not the most used. The underlying idea of TMs is that a translator should benefit as much as possible from previous translations by being able to retrieve how a similar sentence was translated before. Moreover, the usage of TMs aims at guaranteeing that new translations follow the client's specified style and terminology. Despite the fact that the core idea of these systems relies on comparing segments (typically of sentence length) from the document to be translated with segments from previous translations, most of the existing TM systems hardly use any language processing for this. Instead of addressing this issue, most of the work on translation memories focused on improving the user experience by allowing processing of a variety of document formats, intuitive user interfaces, etc. The term second generation translation memories has been around for more than ten years and it promises translation memory software that integrates linguistic processing in order to improve the translation process. This linguistic processing can involve tasks such as matching of subsentential chunks, edit distance operations between syntactic trees, incorporation of semantic and discourse information in the matching process. This workshop invites papers presenting second generation translation memories and related initiatives. Terminologies, glossaries and ontologies are also very useful for translation memories, by facilitating the task of the translator and ensuring a consistent translation. The field of Natural Language Processing (NLP) has proposed numerous methods for terminology extraction and ontology extraction. Researchers are encouraged to submit papers to the workshop which show how these methods are being successfully applied to Translation Memories. In addition, papers discussing the integration of Machine Translation and Translation Memories or studies about automatic building of translation memories from corpora are also welcomed. 2. TOPICS OF INTEREST This workshop invites original papers which show how language processing can help translation memories. Topics of interest include but are not limited to: * Improving matching and retrieval of segments by using morphological, syntactic, semantic and discourse information * Automatic extraction of terminologies and ontologies for translation memories * Integration of named entity recognition and terminologies in matching and retrieval * Using natural language processing for automatic construction of translation memories * Extracting and aligning TM segments from a parallel or comparable corpus * Construction of translation memories using the Internet * Corpus based studies about the usefulness of TM for specific domains * Development of hybrid TM and MT translation systems * Study of NLP techniques used by TM tools available in the market * Automatic methods for TM cleaning and maintenance 3. SHARED TASK A shared task on cleaning translation memories will be organised. A training set will be distributed to be used to develop and train the participants' systems. The testing will be done on 500 segments distributed during the testing phase. * TASK: Automatically clean translation memories * TRAINING SET: 1,500 TM segments annotated with information on whether they are a valid translation of each other * TEST SET: 500 TM segments * LANGUAGE PAIRS: * English-Italian * English-German * English-Spanish * RELEASE OF THE TRAINING DATA: end of January 2016 Participants are encouraged to submit working notes of their systems to be presented during the workshop. More details, including the shared task schedule will be announced soon in a dedicated Call for Participation. 4. SUBMISSION INFORMATION We invite contributions of either long papers (8 pages + 2 references) which present unpublished original research or short paper/demos of systems which present work in progress or working systems (4 pages + 2 references). The submissions do not need to be anonymised. All the papers will have to be submitted in PDF format via the START system by following this link: https://www.softconf.com/lrec2016/NLP4TM/ [2] 5. Identify, Describe and Share your LRs As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs
[Moses-support] Special Issue of the Machine Translation journal: Natural Language Processing for Translation Memories
Apologies for any duplicates. -- Special Issue of the Machine Translation journal: Natural Language Processing for Translation Memories http://www.springer.com/computer/artificial/journal/10590 Guest editors: Constantin Orasan (University of Wolverhampton, UK) Marcello Federico (FBK, Italy) Submission deadline: May 15, 2016 1. Call For Papers Translation Memories (TM) are amongst the most widely used tools by professional translators. The underlying idea of TMs is that a translator should benefit as much as possible from previous translations by being able to retrieve the way in which a similar sentence was translated before. Moreover, the usage of TMs aims to guarantee that new translations follow the client’s specified style and terminology. Despite the fact that the core idea of these systems relies on comparing segments (typically of sentence length) from the document to be translated with segments from previous translations, most of the existing TM systems hardly use any language processing for this. Instead of addressing this issue, most of the work on translation memories focused on improving the user experience by allowing processing of a variety of document formats, intuitive user interfaces, etc. The term second generation translation memories has been around for more than ten years and it promises translation memory software that integrates linguistic processing in order to improve the translation process. This linguistic processing can involve tasks such as the matching of subsentential chunks, editing distance operations between syntactic trees, and the incorporation of semantic and discourse information in the matching process. Terminologies, glossaries and ontologies are also very useful for translation memories, by facilitating the task of the translator and ensuring a consistent translation. The field of Natural Language Processing (NLP) has proposed numerous methods for terminology extraction and ontology extraction. Other ways of enhancing Translation Memories with information from NLP components are to integrate Machine Translation and Translation Memories, and automatically build and clean translation memories from corpora and from the web. This special issue builds on the success of the NLP4TM workshop organised in conjunction with RANLP 2015 and the forthcoming second edition of this workshop at LREC 2016, which will include a shared task on the cleaning of translation memories. Authors of papers accepted at these workshops are encouraged to submit extended versions for the special issue. However, having a paper accepted at the workshop does not constitute a precondition for submitting a paper for the special issue. 2. Topics of interest This special issue invites original papers which show how language processing can help translation memories. Topics of interest include but are not limited to: - improving matching and retrieval of segments by using morphological, syntactic, semantic and discourse information - automatic extraction of terminologies and ontologies for translation memories - integration of named entity recognition and terminologies in matching and retrieval - using natural language processing for automatic construction of translation memories - extracting and aligning TM segments from a parallel or comparable corpus - construction of translation memories using the Internet - corpus based studies about the usefulness of TM for specific domains - development of hybrid TM and MT translation systems - study of NLP techniques used by TM tools available in the market - automatic methods for TM cleaning and maintenance Note: extended versions of paper previously published at conferences and workshops are likely to be eligible. Please consult us if you have any doubts. 4. Submission guidelines Authors should follow the "Instructions for Authors" available on the MT Journal website:http://www.springer.com/computer/artificial/journal/10590 Submissions must be limited to 15 pages (including references) Papers should be submitted online directly on the MT journal's submission website: http://www.editorialmanager.com/coat/default.asp, indicating this special issue in ‘article type’. 5. Important dates Submission deadline: 15th May 2016 First round of reviews: 15th July 2016 Resubmission of improved versions: 22nd August 2016 Final decisions to authors: 19th Sep 2016 Camera ready papers: 8th Oct 2016 Publication in Issue 3 of the Machine Translation journal 2016 -- Thanks & Regards, *Rohit Gupta* *Marie Curie Early Stage Researcher, EXPERT Project*Research Group in Computational Linguistics Research Institute of Information and Language Processing University of Wolverhampton http://pers-www.wlv.ac.uk/~in4089/ ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Slides or paper walking through SearchNormal::ProcessOneHypothesis ?
Thanks, Wilker. That does look promising. I love this little footnote from the paper: "We do not know if WLd is documented anywhere, but from inspection it is used in Moses (Koehn et al., 2007). This was confirmed by Philipp Koehn and Hieu Hoang (p.c.)." On Fri, Dec 18, 2015 at 10:12 AM, Wilker Azizwrote: > Hi, > > I hope it is not too late to add to this discussion. > > If you are comfortable with weighted deduction, Adam Lopez's 2009 EACL > paper is very a good reference for phrase-based reordering spaces. If I > remember well the implementation in Moses does exactly what he shows with > the logic program WLd. > > http://alopez.github.io/papers/eacl2009-lopez.pdf > > Cheers, > > Wilker > > On 16 December 2015 at 00:56, Matthias Huck wrote: > >> Hi Lane, >> >> Well, you can find excellent descriptions of phrase-based decoding >> algorithms in the literature, though possibly not all details of this >> specific implementation. >> >> I like this description: >> >> R. Zens, and H. Ney. Improvements in Dynamic Programming Beam Search for >> Phrase-based Statistical Machine Translation. In International Workshop >> on Spoken Language Translation (IWSLT), pages 195-205, Honolulu, HI, >> USA, October 2008. >> >> http://www.hltpr.rwth-aachen.de/publications/download/618/Zens-IWSLT-2008.pdf >> >> It's what's implemented in Jane, RWTH's open source statistical machine >> translation toolkit. >> >> J. Wuebker, M. Huck, S. Peitz, M. Nuhn, M. Freitag, J. Peter, S. >> Mansour, and H. Ney. Jane 2: Open Source Phrase-based and Hierarchical >> Statistical Machine Translation. In International Conference on >> Computational Linguistics (COLING), pages 483-491, Mumbai, India, >> December 2012. >> >> http://www.hltpr.rwth-aachen.de/publications/download/830/Wuebker-COLING-2012.pdf >> >> However, I believe that the distinction of coverage hypotheses and >> lexical hypotheses is a unique property of the RWTH systems. >> >> The formalization in the Zens & Ney paper is very nicely done. With hard >> distortion limits or coverage-based reordering constraints, you may need >> a few more steps in the algorithm. E.g., if you have a hard distortion >> limit, you will probably want to avoid leaving a gap and then extending >> your sequence in a way that puts your current position further away from >> the gap than your maximum jump width. Other people should know more >> about how exactly Moses' phrase-based decoder is dealing with this. >> >> I can recommend Richard Zens' PhD thesis as well. >> http://www.hltpr.rwth-aachen.de/publications/download/562/Zens--2008.pdf >> >> I also remember that the following publication from Microsoft Research >> is pretty helpful: >> >> Robert C. Moore and Chris Quirk, Faster Beam-Search Decoding for Phrasal >> Statistical Machine Translation, in Proceedings of MT Summit XI, >> European Association for Machine Translation, September 2007. >> http://research.microsoft.com/pubs/68097/mtsummit2007_beamsearch.pdf >> >> Cheers, >> Matthias >> >> >> >> On Tue, 2015-12-15 at 22:33 +, Hieu Hoang wrote: >> > I've been looking at this and it is surprisingly complicated. I think >> > the code is designed to predetermine if extending a hypothesis will >> > lead it down a path that won't ever be completed. >> > >> > >> > Don't know any slide that explains the reasoning, Philipp Koehn >> > explained it to me once and it seems pretty reasonable. >> > >> > >> > >> > I wouldn't mind seeing this code cleaned up a bit and abstracted and >> > formalised. I've made a start with the cleanup in my new decoder >> > >> > >> https://github.com/moses-smt/mosesdecoder/blob/perf_moses2/contrib/other-builds/moses2/Search/Search.cpp#L36 >> >Search::CanExtend() >> > >> > >> > There was an Aachen paper from years ago comparing different >> > distortion limit heuristics - can't remember the authors or title. >> > Maybe someone know more >> > >> > >> > >> > >> > >> > Hieu Hoang >> > http://www.hoang.co.uk/hieu >> > >> > >> > On 15 December 2015 at 20:59, Lane Schwartz >> > wrote: >> > Hey all, >> > >> > >> > So the SearchNormal::ProcessOneHypothesis() method in >> > SearchNormal.cpp is responsible for taking an existing >> > hypothesis, creating all legal new extension hypotheses, and >> > adding those new hypotheses to the appropriate decoder >> > stacks. >> > >> > >> > First off, the method is actually reasonably well commented, >> > so kudos to whoever did that. :) >> > >> > >> > That said, does anyone happen to have any slides that actually >> > walk through this process, specifically slides that take into >> > account the interaction with the distortion limit? That >> > interaction is where most of the complexity of this method >> > comes from. I don't know about others, but even having a >> > pretty good notion of what's going on here, the
Re: [Moses-support] Compiling Moses with -fPIC
FYI I added the following to Jamroot and then called bjam with --with-pic if [ option.get "with-pic" : : "yes" ] { requirements += -fPIC ; } Hints to compile the dependencies with PIC as well: cmph: ./configure --with-pic boost: ./b2 cxxflags=-fPIC Gesendet: Dienstag, 15. Dezember 2015 um 18:26 Uhr Von: "Hieu Hoang"An: "Miriam Käshammer" , Moses-support@mit.edu Betreff: Re: [Moses-support] Compiling Moses with -fPIC you may have to compile boost, cmph and irstlm with the -fPIC too. On 14/12/15 16:06, "Miriam Käshammer" wrote: Dear Moses community, My goal is to link Moses (the decoder) as a static library into some other shared library. As far as I understand the compiler/linker output of this other library, I need to compile the Moses library with parameter -fPIC (position independent code). Could you help me in achieving this? I already tried to add "cxxflags=-fPIC" to the bjam command like this: ./bjam -j8 -d2 -a --with-boost="${PREFIX}" --with-xmlrpc-c="${PREFIX}" --with-cmph="${PREFIX}" --with-irstlm="${PREFIX}" --install-scripts="${PREFIX}"/scripts link=static cxxflags=-fPIC However, the build process just seems to get stuck before it actually starts, see attached log. Any help/comment is appreciated. Thanks! Miriam ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese & Arabic Tokenizers
Hi Tom, As far as I know, the following are widely-used and open-source Chinese tokenizers: * https://github.com/fxsjy/jieba * http://sourceforge.net/projects/zpar/ * https://github.com/NLPchina/ansj_seg And this proprietary one: * http://ictclas.nlpir.org/ (Disclaimer: I am one of the developers of jieba, and I personally use this.) -- Dingyuan Wang 2015年12月19日 00:51於 "Tom Hoar"寫道: > I'm looking for Chinese and Arabic tokenizers. We've been using > Stanford's for a while but it has downfalls. The Chinese mode loads its > statistical models very slowly. The Arabic mode stems the resulting > tokens. The coup de grace is that their latest jar update (9 days ago) > was compiled run only with Java 1.8. > > So, with the exception of Stanford, what choices are available for > Chinese and Arabic that you're finding worthwhile? > > Thanks! > Tom > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Chinese & Arabic Tokenizers
Hi Tom, There used to be a freely available Chinese word segmenter provided by the LDC as well. Unfortunately, things keep disappearing from the web. https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm For Arabic, I think that many academic research groups used to work with MADA. But it seems like you'll need a special license for commercial use. http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html https://secure.nouvant.com/columbia/technology/cu14012/license/492 Or you try MorphTagger/Segmenter, a segmentation tool for Arabic SMT. http://www.hltpr.rwth-aachen.de/~mansour/MorphSegmenter/ It may not be maintained any more. You can contact Saab Mansour to ask about it. Saab has published a couple of papers about this, some of which report comparisons of different Arabic segmentation strategies for SMT. http://www.hltpr.rwth-aachen.de/publications/download/687/Mansour-IWSLT-2010.pdf http://www.hltpr.rwth-aachen.de/publications/download/808/Mansour-LREC-2012.pdf http://link.springer.com/article/10.1007%2Fs10590-011-9102-0 Cheers, Matthias On Sat, 2015-12-19 at 01:19 +0800, Dingyuan Wang wrote: > Hi Tom, > > As far as I know, the following are widely-used and open-source Chinese > tokenizers: > > * https://github.com/fxsjy/jieba > * http://sourceforge.net/projects/zpar/ > * https://github.com/NLPchina/ansj_seg > > And this proprietary one: > > * http://ictclas.nlpir.org/ > > (Disclaimer: I am one of the developers of jieba, and I personally use > this.) > > -- > Dingyuan Wang > 2015年12月19日 00:51於 "Tom Hoar"寫道: > > > I'm looking for Chinese and Arabic tokenizers. We've been using > > Stanford's for a while but it has downfalls. The Chinese mode loads its > > statistical models very slowly. The Arabic mode stems the resulting > > tokens. The coup de grace is that their latest jar update (9 days ago) > > was compiled run only with Java 1.8. > > > > So, with the exception of Stanford, what choices are available for > > Chinese and Arabic that you're finding worthwhile? > > > > Thanks! > > Tom > > ___ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] Chinese & Arabic Tokenizers
I'm looking for Chinese and Arabic tokenizers. We've been using Stanford's for a while but it has downfalls. The Chinese mode loads its statistical models very slowly. The Arabic mode stems the resulting tokens. The coup de grace is that their latest jar update (9 days ago) was compiled run only with Java 1.8. So, with the exception of Stanford, what choices are available for Chinese and Arabic that you're finding worthwhile? Thanks! Tom ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Slides or paper walking through SearchNormal::ProcessOneHypothesis ?
Hi, I hope it is not too late to add to this discussion. If you are comfortable with weighted deduction, Adam Lopez's 2009 EACL paper is very a good reference for phrase-based reordering spaces. If I remember well the implementation in Moses does exactly what he shows with the logic program WLd. http://alopez.github.io/papers/eacl2009-lopez.pdf Cheers, Wilker On 16 December 2015 at 00:56, Matthias Huckwrote: > Hi Lane, > > Well, you can find excellent descriptions of phrase-based decoding > algorithms in the literature, though possibly not all details of this > specific implementation. > > I like this description: > > R. Zens, and H. Ney. Improvements in Dynamic Programming Beam Search for > Phrase-based Statistical Machine Translation. In International Workshop > on Spoken Language Translation (IWSLT), pages 195-205, Honolulu, HI, > USA, October 2008. > > http://www.hltpr.rwth-aachen.de/publications/download/618/Zens-IWSLT-2008.pdf > > It's what's implemented in Jane, RWTH's open source statistical machine > translation toolkit. > > J. Wuebker, M. Huck, S. Peitz, M. Nuhn, M. Freitag, J. Peter, S. > Mansour, and H. Ney. Jane 2: Open Source Phrase-based and Hierarchical > Statistical Machine Translation. In International Conference on > Computational Linguistics (COLING), pages 483-491, Mumbai, India, > December 2012. > > http://www.hltpr.rwth-aachen.de/publications/download/830/Wuebker-COLING-2012.pdf > > However, I believe that the distinction of coverage hypotheses and > lexical hypotheses is a unique property of the RWTH systems. > > The formalization in the Zens & Ney paper is very nicely done. With hard > distortion limits or coverage-based reordering constraints, you may need > a few more steps in the algorithm. E.g., if you have a hard distortion > limit, you will probably want to avoid leaving a gap and then extending > your sequence in a way that puts your current position further away from > the gap than your maximum jump width. Other people should know more > about how exactly Moses' phrase-based decoder is dealing with this. > > I can recommend Richard Zens' PhD thesis as well. > http://www.hltpr.rwth-aachen.de/publications/download/562/Zens--2008.pdf > > I also remember that the following publication from Microsoft Research > is pretty helpful: > > Robert C. Moore and Chris Quirk, Faster Beam-Search Decoding for Phrasal > Statistical Machine Translation, in Proceedings of MT Summit XI, > European Association for Machine Translation, September 2007. > http://research.microsoft.com/pubs/68097/mtsummit2007_beamsearch.pdf > > Cheers, > Matthias > > > > On Tue, 2015-12-15 at 22:33 +, Hieu Hoang wrote: > > I've been looking at this and it is surprisingly complicated. I think > > the code is designed to predetermine if extending a hypothesis will > > lead it down a path that won't ever be completed. > > > > > > Don't know any slide that explains the reasoning, Philipp Koehn > > explained it to me once and it seems pretty reasonable. > > > > > > > > I wouldn't mind seeing this code cleaned up a bit and abstracted and > > formalised. I've made a start with the cleanup in my new decoder > > > > > https://github.com/moses-smt/mosesdecoder/blob/perf_moses2/contrib/other-builds/moses2/Search/Search.cpp#L36 > >Search::CanExtend() > > > > > > There was an Aachen paper from years ago comparing different > > distortion limit heuristics - can't remember the authors or title. > > Maybe someone know more > > > > > > > > > > > > Hieu Hoang > > http://www.hoang.co.uk/hieu > > > > > > On 15 December 2015 at 20:59, Lane Schwartz > > wrote: > > Hey all, > > > > > > So the SearchNormal::ProcessOneHypothesis() method in > > SearchNormal.cpp is responsible for taking an existing > > hypothesis, creating all legal new extension hypotheses, and > > adding those new hypotheses to the appropriate decoder > > stacks. > > > > > > First off, the method is actually reasonably well commented, > > so kudos to whoever did that. :) > > > > > > That said, does anyone happen to have any slides that actually > > walk through this process, specifically slides that take into > > account the interaction with the distortion limit? That > > interaction is where most of the complexity of this method > > comes from. I don't know about others, but even having a > > pretty good notion of what's going on here, the discussion of > > "the closest thing to the left" is still a bit opaque. > > > > > > Anyway, if anyone knows of a good set of slides, or even a > > good description in a paper, of what's going on here, I'd > > appreciate any pointers. > > > > > > Thanks, > > Lane > > > > > > > > ___ > > Moses-support mailing list > > Moses-support@mit.edu > >
Re: [Moses-support] Doubts on Multiple Decoding Paths
Hi, that sounds right. The "union" option is fairly new, developed by Michael Denkowski. I am not aware of any empirical study of the different methods, so I'd be curious to see what you find. -phi On Fri, Dec 18, 2015 at 1:35 AM, Anoop (അനൂപ്)wrote: > Hi, > > I am trying to understand the multiple decoding paths feature in Moses. > > The documentation (http://www.statmt.org/moses/?n=Advanced.Models#ntoc7) > describes 3 methods: both, either and union > > The following is my understanding of the options. Please let me know if it > is correct: > > >- With *both* option, the constituent phrases of the target hypothesis >come from both tables (since they are shared) and are scored with both the >tables. >- With *either* option, all the constituent phrases of a target >hypothesis come from a single table, but different hypothesis can use >different tables. Each hypothesis is scored using one table only. I did not >understand the " additional options are collected from the other tables" >bit in the documentation. >- With *union* option, the constituent phrases of a target hypothesis >come from different tables and are scored using scores from all the tables. >Use 0 if the option doesn't appear in some table, unless the >*default-average-others=true* option is used. > > > Regards, > Anoop. > > -- > I claim to be a simple individual liable to err like any other fellow > mortal. I own, however, that I have humility enough to confess my errors > and to retrace my steps. > > http://flightsofthought.blogspot.com > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] 1st CFP: LREC 9th Workshop and Shared Task on Building and Using Comparable Corpora
Call for Papers 9th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA Special Topic: Continuous Vector Space Models and Comparable Corpora Shared Task: Identifying Parallel Sentences in Comparable Corpora https://comparable.limsi.fr/bucc2016/ Monday, May 23, 2016 Co-located with LREC 2016, Portorož, Slovenia DEADLINE FOR PAPERS: February 10, 2016 MOTIVATION In the language engineering and the linguistics communities, research on comparable corpora has been motivated by two main reasons. In language engineering, on the one hand, it is chiefly motivated by the need to use comparable corpora as training data for statistical Natural Language Processing applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible inter-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora. SHARED TASK There will be a shared task on "Identifying Parallel Sentences in Comparable Corpora" whose details will be described on the workshop website (URL see above). TOPICS Beyond this year's special topic "Continuous Vector Space Models and Comparable Corpora" and the shared task on "Identifying Parallel Sentences in Comparable Corpora", we solicit contributions including but not limited to the following topics: Building comparable corpora: * Human translations * Automatic and semi-automatic methods * Methods to mine parallel and non-parallel corpora from the Web * Tools and criteria to evaluate the comparability of corpora * Parallel vs non-parallel corpora, monolingual corpora * Rare and minority languages, across language families * Multi-media/multi-modal comparable corpora Applications of comparable corpora: * Human translations * Language learning * Cross-language information retrieval & document categorization * Bilingual projections * Machine translation * Writing assistance Mining from comparable corpora: * Cross-language distributional semantics * Extraction of parallel segments or paraphrases from comparable corpora * Extraction of translations of single words and multi-word expressions, proper names, named entities, etc. IMPORTANT DATES February 10, 2016Deadline for submission of full papers March 10, 2016Notification of acceptance March 25, 2016Camera-ready papers due May 23, 2016Workshop date SUBMISSION INFORMATION Papers should follow the LREC main conference formatting details (to be announced on the conference website http://lrec2016.lrec-conf.org/en/ ) and should be submitted as a PDF-file via the START workshop manager at https://www.softconf.com/lrec2016/BUCC2016/ Contributions can be short or long papers. Short paper submission must describe original and unpublished work without exceeding six (6) pages. Characteristics of short papers include: a small, focused contribution; work in progress; a negative result; an opinion piece; an interesting application nugget. Long paper submissions must describe substantial, original, completed and unpublished work without exceeding ten (10) pages. Reviewing will be double blind, so the papers should not reveal the authors' identity. Accepted papers will be published in the workshop proceedings. Double submission policy: Parallel submission to other meetings or publications is possible but must be immediately notified to the workshop organizers. Please also observe the following two paragraphs which are applicable to all LREC workshops as well as to the main conference: Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data. As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through
[Moses-support] PhraseDictionaryCompact is not registered
I'm following the baseline system page step-by-step as it says.I've binarized the phrase table and reordering table using processPhraseTableMin and processLexicalTableMin,edited the moses.ini as written, but upon executing, it gives an exception with "PhraseDictionaryCompact is not registered" message. I've done some googling, and tried running processLexicalTable (without "Min") to no good,and also tried editing as PhraseDictionaryBinary, PhraseDictionaryOnDisk, which succeeded in running the task, but gets aborted upon writing the input sentence. Is there be any other workaround / fix to this? ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] PhraseDictionaryCompact is not registered
Hi, I'd say you didn't install cmph or compile against it, look again at: http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3 On 18.12.2015 15:15, Andrew wrote: > I'm following the baseline system page step-by-step as it says. > I've binarized the phrase table and reordering table using > processPhraseTableMin and processLexicalTableMin, > edited the moses.ini as written, > but upon executing, it gives an exception with > "PhraseDictionaryCompact is not registered" message. > > I've done some googling, and tried running processLexicalTable > (without "Min") to no good, > and also tried editing as PhraseDictionaryBinary, PhraseDictionaryOnDisk, > which succeeded in running the task, but gets aborted upon writing the > input sentence. > > Is there be any other workaround / fix to this? > > > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] PhraseDictionaryCompact is not registered
make -f contrib/Makefiles/install-dependencies.gmake cmph ./bjam --with-cmph=$(pwd)/opt On Fri, Dec 18, 2015 at 2:15 PM, Andrewwrote: > I'm following the baseline system page step-by-step as it says. > I've binarized the phrase table and reordering table using > processPhraseTableMin and processLexicalTableMin, > edited the moses.ini as written, > but upon executing, it gives an exception with "PhraseDictionaryCompact > is not registered" message. > > I've done some googling, and tried running processLexicalTable (without > "Min") to no good, > and also tried editing as PhraseDictionaryBinary, PhraseDictionaryOnDisk, > which succeeded in running the task, but gets aborted upon writing the > input sentence. > > Is there be any other workaround / fix to this? > > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Ulrich Germann Senior Researcher School of Informatics University of Edinburgh ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Doubts on Multiple Decoding Paths
Hi Anoop, Confirming that your reading of "union" is in fact how it works. If you want each phrase to be scored by all tables without having to worry about making sure every phrase is in every table, I you can use PhraseDictionaryGroup with default-average-others=true. This multiplies the size of the phrase feature set by the number of models, so I recommend running mer-moses.pl with --batch-mira. Best, Michael On Fri, Dec 18, 2015 at 1:08 PM, Philipp Koehnwrote: > Hi, > > that sounds right. > > The "union" option is fairly new, developed by Michael Denkowski. > I am not aware of any empirical study of the different methods, > so I'd be curious to see what you find. > > -phi > > On Fri, Dec 18, 2015 at 1:35 AM, Anoop (അനൂപ്) < > anoop.kunchukut...@gmail.com> wrote: > >> Hi, >> >> I am trying to understand the multiple decoding paths feature in Moses. >> >> The documentation (http://www.statmt.org/moses/?n=Advanced.Models#ntoc7) >> describes 3 methods: both, either and union >> >> The following is my understanding of the options. Please let me know if >> it is correct: >> >> >>- With *both* option, the constituent phrases of the target >>hypothesis come from both tables (since they are shared) and are scored >>with both the tables. >>- With *either* option, all the constituent phrases of a target >>hypothesis come from a single table, but different hypothesis can use >>different tables. Each hypothesis is scored using one table only. I did >> not >>understand the " additional options are collected from the other tables" >>bit in the documentation. >>- With *union* option, the constituent phrases of a target hypothesis >>come from different tables and are scored using scores from all the >> tables. >>Use 0 if the option doesn't appear in some table, unless the >>*default-average-others=true* option is used. >> >> >> Regards, >> Anoop. >> >> -- >> I claim to be a simple individual liable to err like any other fellow >> mortal. I own, however, that I have humility enough to confess my errors >> and to retrace my steps. >> >> http://flightsofthought.blogspot.com >> >> ___ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > ___ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support