Re: [Moses-support] Estimating probabilities with KenLM

Prasanth K Tue, 26 Nov 2013 07:01:01 -0800

Ok. I have managed to re-create this error (no reason why it shouldn't come
back, I knew exactly what I told moses to do). So, the exact command run to
create the language model from the logs is as follows:


scripts/generic/trainlm-lmplz.perl -lmplz bin/lmplz -order 5 -T
europarl.en-sv/phrase-based-dup/tmp
-S 10G -text europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 -lm
 europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

Of course, all paths in the above command given were absolute paths (I just
removed them for readability.) When this is run, my log file from EMS gives
the following in LM_europarl_train.id.STDERR

EXECUTING bin/lmplz --order 5 -T europarl.en-sv/phrase-based-dup/tmp -S 10G
< europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 >
europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

=== 1/5 Counting and sorting n-grams ===

Reading stdin

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

****************************************************************************************************

Function not implemented

This does not get the language model step to crash, instead creates an
empty language model (0 lines). The below is the log file for
LM_europarl_binarize.id.STDERR

Reading europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

End of file Byte: 0 File: europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

ERROR

Clearly, something is wrong with my installation of kenlm (the decoding
with kenlm works just fine ..I have confirmed that now), which makes the
estimation go funny. The question is where I start to fix this?

Thanks.

- Regards,

Prasanth


On Tue, Nov 26, 2013 at 1:56 PM, Hieu Hoang <hieuho...@gmail.com> wrote:

>  ok, i can't reproduce your error
>   FUnction not implemented
> you should find out exactly how lmplz is being run, it may be that you
> have a slightly older version and doesn't know all the arguments you've
> given it.
>
>
> On 26/11/2013 06:47, Prasanth K wrote:
>
> Hello Hieu,
>
>  My first attempt was to specify the absolute amount of memory (10G) but
> that gave an error saying function not implemented. Later, when I tried
> specifying the relative size (80%), I got a similar parse error to what you
> have given above. Strange that it should
>
>  @Kenneth, thanks for the code to estimate physical memory. I am going to
> give it a shot and let you know how it goes.
>
>  - Regards,
> Prasanth
>
>
> On Mon, Nov 25, 2013 at 9:20 PM, Hieu Hoang <hieuho...@gmail.com> wrote:
>
>> Prasanth - what is the exact lmplz command that was ran by the EMS?
>>
>>
>> This works
>>      .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>> lm/europarl.lmplz -T /tmp -S 1G
>> This doesn't
>>     .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>> lm/europarl.lmplz -T /tmp -S 80%
>> it give the error
>>    util/usage.cc:220 in uint64_t util::<anonymous
>> namespace>::ParseNum(const std::string &) [Num = double] threw
>> SizeParseError because `!mem'.
>> Failed to parse 80% into a memory size because % was specified but the
>> physical memory size could not be determined.
>>
>>  However, it worked even with the source code from 4 days ago.
>>
>>
>> On 25/11/2013 19:07, Kenneth Heafield wrote:
>> > Hi,
>> >
>> >       I've taken a shot in the dark based on physmem.c to support
>> physical
>> > memory estimation on BSD and OS X.  Please clone
>> >
>> > github.com/kpu/kenlm
>> >
>> > and compile with
>> >
>> > ./bjam
>> >
>> > If that fails, please let Hieu and I know (maybe Hieu can help since he
>> > has OS X).  If it doesn't fail, run
>> >
>> > bin/lmplz
>> >
>> > with no argument.  The help message will include a line e.g.
>> >
>> > "This machine has 135224176640 bytes of memory."
>> >
>> > or
>> >
>> > "Unable to determine the amount of memory on this machine."
>> >
>> > If it works, then I'll push to Moses.  Trying to not break Moses master
>> > for OS X.
>> >
>> > Kenneth
>> >
>> > On 11/24/13 22:40, Prasanth K wrote:
>> >> Hi Kenneth,
>> >>
>> >> Thanks for the clarification w.r.t. calculating the memory size. But I
>> >> am running these on a Mac (10.9 Mavericks). Do you think I should still
>> >> port the lmplz code to Mac for the estimation of probabilities?
>> >>
>> >> One thing though, I did change the default clang compiler that comes
>> >> with this new Mac to a gcc-4.8 (not sure that changes anything in this
>> >> context).
>> >>
>> >> - Prasanth
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Nov 22, 2013 at 6:50 PM, Kenneth Heafield <mo...@kheafield.com
>> >> <mailto:mo...@kheafield.com>> wrote:
>> >>
>> >>      Hi,
>> >>
>> >>              What OS are you on?  Cygwin?  Apparently every OS reports
>> >>      memory size
>> >>      in a different way:
>> >>
>> >>
>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/physmem.c;h=2629936146e3042f927523322f18aca76996cd7f;hb=HEAD
>> >>
>> >>      The good news is that the above code is LGPLv2:
>> >>
>> >>
>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=modules/physmem;h=9644522e0493a85a9fb4ae7c4449741c2c1500ea;hb=HEAD
>> >>
>> >>      But currently I'm just using this short function that will fail
>> on some
>> >>      platforms:
>> >>
>> >>      uint64_t GuessPhysicalMemory() {
>> >>      #if defined(_WIN32) || defined(_WIN64)
>> >>        return 0;
>> >>      #elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
>> >>        long pages = sysconf(_SC_PHYS_PAGES);
>> >>        if (pages == -1) return 0;
>> >>        long page_size = sysconf(_SC_PAGESIZE);
>> >>        if (page_size == -1) return 0;
>> >>        return static_cast<uint64_t>(pages) *
>> >>      static_cast<uint64_t>(page_size);
>> >>      #else
>> >>        return 0;
>> >>      #endif
>> >>      }
>> >>
>> >>      If it fails, I just don't let users specify memory as a
>> percentage.  So
>> >>      one thing thing to fix is putting physmem.{h,c} in util then
>> changing
>> >>      calls to GuessPhysicalMemory.  But I'm also not a fan of the way
>> the GNU
>> >>      code gives up and makes up a number at the end.
>> >>
>> >>      The second porting issue is that lmplz makes parallel use of
>> pread,
>> >>      pwrite, and write.  Windows is unsafe in this regard (POSIX
>> requires
>> >>      that pread/pwrite not change the file pointer; Windows has no way
>> to
>> >>      implement that atomically).  To fix this, we'll always specify
>> the file
>> >>      offset in cases that happen concurrently.  Extend
>> util/stream/io.* with
>> >>      a PWrite class based on PWriteOrThrow then change FileBuffer to
>> use
>> >>      PWrite.  Then I guess one should rename
>> PReadOrThrow/PWriteOrThrow to
>> >>      something that indicates they're not-quite-POSIX on windows.
>>  Also, the
>> >>      macros in these functions should detect cygwin, bypassing cygwin's
>> >>      "Function not implemented" and calling Windows APIs directly
>> (they're
>> >>      already there for _WIN32).
>> >>
>> >>      I don't have a windows box so I can say what should be changed at
>> a high
>> >>      level, but need an actual user to ensure it compiles and runs
>> correctly.
>> >>
>> >>      Kenneth
>> >>
>> >>      On 11/22/13 06:49, Prasanth K wrote:
>> >>      > Hi,
>> >>      >
>> >>      > I am trying to use KenLM for building a language model on the
>> Europarl
>> >>      > corpus. Following the instructions in
>> >>      >
>> >>      (
>> http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc19
>> ),
>> >>      > I added the few lines for getting KenLM to estimate the LM
>> >>      probabilities
>> >>      > (order/n=5) to my config file to the EMS. The language model
>> dies down
>> >>      > during training saying that the "Function not implemented" at
>> counting
>> >>      > and sorting n-grams stage (the first stage itself). Does this
>> mean
>> >>      there
>> >>      > is something wrong with my installation? Or is just insufficient
>> >>      memory?
>> >>      >
>> >>      > Incidentally, when I started giving the amount of memory in
>> terms of %
>> >>      > (80%) there was an error "Failed to parse .. into memory size
>> because
>> >>      > physical memory size could not be determined". I am also
>> curious why
>> >>      > this happens?
>> >>      >
>> >>      > Kenneth, can you shed some light on this? Thanks.
>> >>      >
>> >>      > - Regards,
>> >>      > Prasanth
>> >>      >
>> >>      >
>> >>      >
>> >>      > --
>> >>      > "Theories have four stages of acceptance. i) this is worthless
>> >>      nonsense;
>> >>      > ii) this is an interesting, but perverse, point of view, iii)
>> this is
>> >>      > true, but quite unimportant; iv) I always said so."
>> >>      >
>> >>      >   --- J.B.S. Haldane
>> >>      >
>> >>      >
>> >>      > _______________________________________________
>> >>      > Moses-support mailing list
>> >>      > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> >>      > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >>      >
>> >>      _______________________________________________
>> >>      Moses-support mailing list
>> >>      Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> >>      http://mailman.mit.edu/mailman/listinfo/moses-support
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> "Theories have four stages of acceptance. i) this is worthless
>> nonsense;
>> >> ii) this is an interesting, but perverse, point of view, iii) this is
>> >> true, but quite unimportant; iv) I always said so."
>> >>
>> >>    --- J.B.S. Haldane
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>>  http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
>  --
> "Theories have four stages of acceptance. i) this is worthless nonsense;
> ii) this is an interesting, but perverse, point of view, iii) this is true,
> but quite unimportant; iv) I always said so."
>
>   --- J.B.S. Haldane
>
>
>


-- 
"Theories have four stages of acceptance. i) this is worthless nonsense;
ii) this is an interesting, but perverse, point of view, iii) this is true,
but quite unimportant; iv) I always said so."

  --- J.B.S. Haldane

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Estimating probabilities with KenLM

Reply via email to