I was actually going to try and build KenLM into a maven package that can be easily distributed. I haven't had time to work on it too much but I think it shouldn't be too hard.
On Thu, Oct 6, 2016 at 4:16 PM, Matt Post <[email protected]> wrote: > Okay, I've fixed the nonbreaking_prefixes path issue. > > The installation should now ignore your value of $JOSHUA entirely, > preferring instead the bundled jar and scripts (maybe test this by > unsetting $JOSHUA). > > New version: > > http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz < > http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz> > > Please note: my tests show that using BerkeleyLM results in a notable drop > in performance (1–2 BLEU points across many test sets). I am worried that > we have introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that > users don't have to compile KenLM, but we're probably going to need to > provide the option to "upgrade" for those willing to try to compile it. Or > we'll need a solution for distributing pre-built KenLM shared libraries... > > matt > > > > > On Oct 5, 2016, at 11:43 PM, John Hewitt <[email protected]> wrote: > > > > Quick further note -- I already had $JOSHUA set to a different directory, > > so initially all the lookups were failing. > > > > It's possible current users of JOSHUA will as well when they download new > > language packs. This should be an obvious and quick fix for the user, > but I > > don't know if there's something we could do in the name of making it even > > clearer. (Potentially checking whether $JOSHUA is the same as $PWD after > > the directory change in prepare.sh, and printing a warning if it's not?) > > > > -John > > > > On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt <[email protected]> > wrote: > > > >> Thanks, Matt! > >> > >> Some notes: > >> > >> When piping input into prepare.sh, I get the following output: > >> > >> WARNING: No known abbreviations for language 'es', attempting fall-back > to > >> English version... > >> ERROR: No abbreviations files found in /nlp/users/johnhew/apache- > >> joshua-es-en-2016-10-05/scripts/preparation/nonbre > >> aking_prefixes > >> > >> Seems that line 12 of tokenize.pl: > >> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes"; > >> should be: > >> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes"; > >> > >> When I make this modification, it works just fine for me. > >> Also, tried in server mode -- seems to work without issue. > >> > >> (For reference -- executed on an openSUSE cluster) > >> > >> -John > >> > >> > >> > >> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post <[email protected]> wrote: > >> > >>> Hi folks, > >>> > >>> I have managed to assemble an actual working language pack. Consider > this > >>> a (near-final, I hope) draft of what we're rolling out for lots of > >>> languages. Please download it, check out the README and associated > files, > >>> test it, and let me know what's missing or what needs to change. > >>> > >>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10- > 05.tgz > >>> <http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz> > (2.1 > >>> GB) > >>> > >>> Suggested use: > >>> > >>> tar xzvf apache-joshua-es-en-2016-10-05.tgz > >>> echo "\"Yo quiero Taco Bell,\", él dijo." \ > >>> | ./apache-joshua-es-en-2016-10-05/prepare.sh \ > >>> | ./apache-joshua-es-en-2016-10-05/joshua > >>> > >>> matt > >> > >> > >> > >
