Re: language pack #1

kellen sunderland Fri, 07 Oct 2016 08:50:19 -0700

I was actually going to try and build KenLM into a maven package that can
be easily distributed.  I haven't had time to work on it too much but I
think it shouldn't be too hard.


On Thu, Oct 6, 2016 at 4:16 PM, Matt Post <[email protected]> wrote:

> Okay, I've fixed the nonbreaking_prefixes path issue.
>
> The installation should now ignore your value of $JOSHUA entirely,
> preferring instead the bundled jar and scripts (maybe test this by
> unsetting $JOSHUA).
>
> New version:
>
>         http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz <
> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz>
>
> Please note: my tests show that using BerkeleyLM results in a notable drop
> in performance (1–2 BLEU points across many test sets). I am worried that
> we have introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that
> users don't have to compile KenLM, but we're probably going to need to
> provide the option to "upgrade" for those willing to try to compile it. Or
> we'll need a solution for distributing pre-built KenLM shared libraries...
>
> matt
>
>
>
> > On Oct 5, 2016, at 11:43 PM, John Hewitt <[email protected]> wrote:
> >
> > Quick further note -- I already had $JOSHUA set to a different directory,
> > so initially all the lookups were failing.
> >
> > It's possible current users of JOSHUA will as well when they download new
> > language packs. This should be an obvious and quick fix for the user,
> but I
> > don't know if there's something we could do in the name of making it even
> > clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
> > the directory change in prepare.sh, and printing a warning if it's not?)
> >
> > -John
> >
> > On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt <[email protected]>
> wrote:
> >
> >> Thanks, Matt!
> >>
> >> Some notes:
> >>
> >> When piping input into prepare.sh, I get the following output:
> >>
> >> WARNING: No known abbreviations for language 'es', attempting fall-back
> to
> >> English version...
> >> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
> >> joshua-es-en-2016-10-05/scripts/preparation/nonbre
> >> aking_prefixes
> >>
> >> Seems that line 12 of tokenize.pl:
> >> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
> >> should be:
> >> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
> >>
> >> When I make this modification, it works just fine for me.
> >> Also, tried in server mode -- seems to work without issue.
> >>
> >> (For reference -- executed on an openSUSE cluster)
> >>
> >> -John
> >>
> >>
> >>
> >> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post <[email protected]> wrote:
> >>
> >>> Hi folks,
> >>>
> >>> I have managed to assemble an actual working language pack. Consider
> this
> >>> a (near-final, I hope) draft of what we're rolling out for lots of
> >>> languages. Please download it, check out the README and associated
> files,
> >>> test it, and let me know what's missing or what needs to change.
> >>>
> >>>        http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-
> 05.tgz
> >>> <http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz>
> (2.1
> >>> GB)
> >>>
> >>> Suggested use:
> >>>
> >>>        tar xzvf apache-joshua-es-en-2016-10-05.tgz
> >>>        echo "\"Yo quiero Taco Bell,\", él dijo." \
> >>>                | ./apache-joshua-es-en-2016-10-05/prepare.sh \
> >>>                | ./apache-joshua-es-en-2016-10-05/joshua
> >>>
> >>> matt
> >>
> >>
> >>
>
>

Re: language pack #1

Reply via email to