I believe the reference file should be in M2Scorer formatting. Did you use the M2 format file for the reference?
You can check the CoNLL-2014 test data files with format .m2 to see how they should look. Download the CoNLL-2014 test data from here: http://www.comp.nus.edu.sg/~nlp/conll14st.html Regards, Shamil On Tue, Jan 16, 2018 at 1:27 AM, Kelly Marchisio <kellymarchi...@gmail.com> wrote: > Sure - it's on my computer locally, but you can access the Google Drive > folder here: https://drive.google.com/drive/folders/1K9Cq9fprMmMo3K > mySJw4ApuxBeK4qgNK > > You can see the results of 3 attempts in folders tmp.1-3. Tokenized and > lowercased input and reference files are there, along with all other files > automatically created by Moses. Please let me know if this is enough > information or if you'd like more. > > On Mon, Jan 15, 2018 at 4:52 PM, Marcin Junczys-Dowmunt < > junc...@amu.edu.pl> wrote: > >> Hm, not really. Any chance you give me access to the tuning folder? I >> could try to run the scorer manually and see if I can reproduce the error. >> This looks like some debugging is needed. >> >> >> >> *From: *Kelly Marchisio <kellymarchi...@gmail.com> >> *Sent: *Monday, January 15, 2018 6:57 AM >> >> *To: *Marcin Junczys-Dowmunt <junc...@amu.edu.pl> >> *Cc: *moses-support <moses-support@mit.edu> >> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error >> Correction >> >> >> >> I've sentence-split all my data, but tuning with M2SCORER and mert fails >> again. In extract.err, I see: >> >> Binary write mode is NOT selected >> >> Scorer type: M2SCORER >> >> name: case value: true >> >> Data::m_score_type M2Scorer >> >> Data::Scorer type from Scorer: M2Scorer >> >> loading nbest from run1.best100.out.gz >> >> Levenshtein distance is greater than source size. >> >> Exception: vector >> >> >> >> On a previous run, I saw the std::bad_alloc error again. >> >> >> >> mert seems to get through one round, then dies right after finishing >> translating the document once. I know this because run1.out contains the >> entirely translated document, then it crashes. >> >> >> >> To simply things, my training data has 4998 sentence-split lines, and >> tuning has 500. (Training originally had 5000 - the tokenizer/cleaner must >> have removed 2 lines). Sentences appear well-aligned looking at >> training/giza.1 >> >> >> >> Any ideas on this one? >> >> >> >> On Sun, Jan 14, 2018 at 4:43 AM, Kelly Marchisio < >> kellymarchi...@gmail.com> wrote: >> >> Yes, these errors happened during tuning with data like that. By the >> original Python implementation, do you mean the one from the CoNLL 2014 >> shared task? (http://www.comp.nus.edu.sg/~nlp/conll14st.html, under >> "Official Scorer") >> >> >> >> Thanks so much for the advice! I'll fix up my data tomorrow and give >> this another go. Thank you :) >> >> >> >> On Sat, Jan 13, 2018 at 11:32 PM, Marcin Junczys-Dowmunt < >> junc...@amu.edu.pl> wrote: >> >> Are you tuning and testing on data like that? If yes this could be part >> of the problem. The M2 scorer in Moses is not really tested and probably >> not well suited for heavy duty (the original python implementation is even >> worse). So it would definitely be better to make sure that not too much >> weird stuff is going on in the data. >> >> >> >> *From: *Kelly Marchisio <kellymarchi...@gmail.com> >> *Sent: *Saturday, January 13, 2018 8:28 PM >> *To: *Marcin Junczys-Dowmunt <junc...@amu.edu.pl> >> *Cc: *moses-support <moses-support@mit.edu> >> >> >> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error >> Correction >> >> >> >> Ah, good to know that the scorer was called successfully and that I can >> ignore the Levenshtein distance errors. >> >> >> >> As for allocating a huge piece of memory -- I realized that though my >> parallel corpus is aligned, I actually split the original corpus by >> *paragraph* instead of sentence. They're mostly short paragraphs (each max. >> ~4 sentences, probably max 80-100 tokens or so), but they are some outliers >> (the largest being ~250-300 tokens). Most paragraphs would only need a few >> edits, but the largest might need 10-15+. Could this be causing this >> problem? >> >> >> >> On Sat, Jan 13, 2018 at 11:08 PM, Marcin Junczys-Dowmunt < >> junc...@amu.edu.pl> wrote: >> >> There seem to be multiple issues here. >> >> >> >> As I said, I have null experience with EMS, so maybe someone else can >> help with that. >> >> >> >> The message in extract.err seems to actually mean, that you were >> successful in calling the M2 scorer in EMS, the only problem is it dies >> 😊 The Levenshtein message is part of a failsafe that is meant to avoid >> exponentially long searches. It does not calculate the M2 metric for a >> sentence pair where there would be excessively many edits (these are >> usually wrong). Theses messages by themselves should not be a reason for >> worrying. >> >> >> >> The std::bad_alloc on the other hand is not good. It seems the scorer >> tries to allocate some huge piece of memory, probably some negative index >> somewhere and then dies. I have not seen this before. Is it possible that >> your system is creating a lot superfluous edits and the graph algorithm in >> M2 is going crazy due to that? >> >> >> >> *From: *Kelly Marchisio <kellymarchi...@gmail.com> >> *Sent: *Saturday, January 13, 2018 7:46 PM >> *To: *Marcin Junczys-Dowmunt <junc...@amu.edu.pl>; moses-support >> <moses-support@mit.edu> >> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error >> Correction >> >> >> >> looping back in mailing-list and copying message :) >> >> >> >> Thanks so much for the response, Marcin! >> >> >> >> I did see your original repo, thanks for sending along. I'd love to get >> this going with EMS because it looks like I can just pass in the M2 scorer >> with: >> >> tuning-settings = "-mertdir $moses-bin-dir -mertargs='--sctype M2SCORER' >> -threads $cores" >> >> However it fails with: >> >> ERROR: Failed to run '/Users/kellymarchisio/L101Fin >> al/experiments/tuning/tmp.1/extractor.sh'. at >> /Users/kellymarchisio/L101Final/programs/mosesdecoder/scripts/training/ >> mert-moses.pl line 1775. >> cp: /Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/moses.ini: >> No such file or directory >> >> There may be an error with the mert-moses script itself used with M2, >> because moses.ini was never created within tmp.1 >> >> >> >> Additionally, in extract.err, I see: >> >> Binary write mode is NOT selected >> Scorer type: M2SCORER >> name: case value: true >> Data::m_score_type M2Scorer >> Data::Scorer type from Scorer: M2Scorer >> loading nbest from run1.best100.out.gz >> Levenshtein distance is greater than source size. >> Levenshtein distance is greater than source size. >> extractor(67381,0x7fffde7dd3c0) malloc: *** >> mach_vm_map(size=3368542481395712) failed (error code=3)*** error: can't >> allocate region >> *** set a breakpoint in malloc_error_break to debug >> Exception: std::bad_alloc >> >> >> >> I'm curious if you've come across these issues (I'm interested why I'm >> seeing "Levenshtein distance is greater than source size.") and if you have >> any pointers for how I can get mert-moses.pl to work for me with >> M2Scorer. >> >> >> >> Best, >> >> Kelly >> >> >> >> On Sat, Jan 13, 2018 at 9:13 PM, Kelly Marchisio < >> kellymarchi...@gmail.com> wrote: >> >> Thanks so much for the response, Marcin! >> >> >> >> I did see your original repo, thanks for sending along. I'd love to get >> this going with EMS because it looks like I can just pass in the M2 scorer >> with: >> >> tuning-settings = "-mertdir $moses-bin-dir -mertargs='--sctype M2SCORER' >> -threads $cores" >> >> However it fails with: >> >> ERROR: Failed to run '/Users/kellymarchisio/L101Fin >> al/experiments/tuning/tmp.1/extractor.sh'. at >> /Users/kellymarchisio/L101Final/programs/mosesdecoder/scripts/training/ >> mert-moses.pl line 1775. >> cp: /Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/moses.ini: >> No such file or directory >> >> There may be an error with the mert-moses script itself used with M2, >> because moses.ini was never created within tmp.1 >> >> >> >> Additionally, in extract.err, I see: >> >> Binary write mode is NOT selected >> Scorer type: M2SCORER >> name: case value: true >> Data::m_score_type M2Scorer >> Data::Scorer type from Scorer: M2Scorer >> loading nbest from run1.best100.out.gz >> Levenshtein distance is greater than source size. >> Levenshtein distance is greater than source size. >> extractor(67381,0x7fffde7dd3c0) malloc: *** >> mach_vm_map(size=3368542481395712) failed (error code=3)*** error: can't >> allocate region >> *** set a breakpoint in malloc_error_break to debug >> Exception: std::bad_alloc >> >> >> >> I'm curious if you've come across these issues (I'm interested why I'm >> seeing "Levenshtein distance is greater than source size.") and if you have >> any pointers for how I can get mert-moses.pl to work for me with >> M2Scorer. >> >> >> >> Best, >> >> Kelly >> >> >> >> On Fri, Jan 12, 2018 at 9:53 PM, Marcin Junczys-Dowmunt < >> junc...@amu.edu.pl> wrote: >> >> Hi, >> >> We never really used it with EMS, so I do not think anyone can help you >> here. Did you have a look at the original repo: >> https://github.com/grammatical/baselines-emnlp2016 ? Otherwise we can >> probably take this off-list and try to help you personally 😊 >> >> >> >> *From: *Kelly Marchisio <kellymarchi...@gmail.com> >> *Sent: *Friday, January 12, 2018 6:20 PM >> *To: *moses-support <moses-support@mit.edu> >> *Subject: *[Moses-support] M2 Scorer in EMS for Grammatical Error >> Correction >> >> >> >> Does anyone have experience using the M2 scorer for grammatical error >> correction with EMS for tuning and evaluation? Junczys-Dowmunt & >> Grundkiewicz (2016) use M2 (https://github.com/grammatica >> l/baselines-emnlp2016/tree/c4fbcc09b45a46c7c46bdda2ba10484fa16e8f82), >> but I see no examples of using it with EMS. >> >> >> >> Does anyone have experience or advice on how I can use the M2 scorer for >> GEC in my project? I'm having trouble figuring out how to incorporate it >> without an example. (for instance, how best to setup experiment.meta & the >> config file to incorporate it) >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support