Re: language pack #1

2016-10-25 Thread Matt Post
Hi Lewis,

I have parameters to set the default amount of memory when building the 
language pack. The comment therein is just boilerplate that didn't get 
parameterized. I'll add that to the script. In general, memory usage can be 
heuristically set to the size of the model files that are loaded (the grammar 
and the language model).

Great to hear that things are working well for you!

matt


> On Oct 24, 2016, at 11:48 PM, lewis john mcgibbney <lewi...@apache.org> wrote:
> 
> Hi Matt,
> I got around to testing out the language pack you posted and have a few
> suggestions.
> 
>   -  The Joshua bash script states in a number of places that ..."# The
>   default amount of memory is 4gb". This is not true as it is set to a
>   different (higher) number by default.
>   - When starting the Joshua server, I monitored memory usage (JProfiler)
>   and it seems to somewhat stabilize and linger at around 5 1/2 GB. Is this
>   normal based on the sie of the Berkeley LM?
>   - Translations are working pretty damn well. I've run a large amount of
>   current Spanish text relating to current news stories and the output looks
>   pretty comprehensive.
> 
> It would be great if we could update the Joshua Homebrew recipe with this
> language pack and also link to the pack from the Wiki.
> 
> Lewis
> 
> On Mon, Oct 10, 2016 at 2:48 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
> 
>> 
>> From: Matt Post <p...@cs.jhu.edu>
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Fri, 7 Oct 2016 11:51:41 -0400
>> Subject: Re: language pack #1
>> That would be awesome.
>> 
>> 



Re: language pack #1

2016-10-24 Thread lewis john mcgibbney
Hi Matt,
I got around to testing out the language pack you posted and have a few
suggestions.

   -  The Joshua bash script states in a number of places that ..."# The
   default amount of memory is 4gb". This is not true as it is set to a
   different (higher) number by default.
   - When starting the Joshua server, I monitored memory usage (JProfiler)
   and it seems to somewhat stabilize and linger at around 5 1/2 GB. Is this
   normal based on the sie of the Berkeley LM?
   - Translations are working pretty damn well. I've run a large amount of
   current Spanish text relating to current news stories and the output looks
   pretty comprehensive.

It would be great if we could update the Joshua Homebrew recipe with this
language pack and also link to the pack from the Wiki.

Lewis

On Mon, Oct 10, 2016 at 2:48 AM, <
dev-digest-h...@joshua.incubator.apache.org> wrote:

>
> From: Matt Post <p...@cs.jhu.edu>
> To: dev@joshua.incubator.apache.org
> Cc:
> Date: Fri, 7 Oct 2016 11:51:41 -0400
> Subject: Re: language pack #1
> That would be awesome.
>
>


Re: language pack #1

2016-10-07 Thread Matt Post
That would be awesome.

matt


> On Oct 7, 2016, at 11:49 AM, kellen sunderland  
> wrote:
> 
> I was actually going to try and build KenLM into a maven package that can
> be easily distributed.  I haven't had time to work on it too much but I
> think it shouldn't be too hard.
> 
> On Thu, Oct 6, 2016 at 4:16 PM, Matt Post  wrote:
> 
>> Okay, I've fixed the nonbreaking_prefixes path issue.
>> 
>> The installation should now ignore your value of $JOSHUA entirely,
>> preferring instead the bundled jar and scripts (maybe test this by
>> unsetting $JOSHUA).
>> 
>> New version:
>> 
>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz <
>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz>
>> 
>> Please note: my tests show that using BerkeleyLM results in a notable drop
>> in performance (1–2 BLEU points across many test sets). I am worried that
>> we have introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that
>> users don't have to compile KenLM, but we're probably going to need to
>> provide the option to "upgrade" for those willing to try to compile it. Or
>> we'll need a solution for distributing pre-built KenLM shared libraries...
>> 
>> matt
>> 
>> 
>> 
>>> On Oct 5, 2016, at 11:43 PM, John Hewitt  wrote:
>>> 
>>> Quick further note -- I already had $JOSHUA set to a different directory,
>>> so initially all the lookups were failing.
>>> 
>>> It's possible current users of JOSHUA will as well when they download new
>>> language packs. This should be an obvious and quick fix for the user,
>> but I
>>> don't know if there's something we could do in the name of making it even
>>> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
>>> the directory change in prepare.sh, and printing a warning if it's not?)
>>> 
>>> -John
>>> 
>>> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt 
>> wrote:
>>> 
 Thanks, Matt!
 
 Some notes:
 
 When piping input into prepare.sh, I get the following output:
 
 WARNING: No known abbreviations for language 'es', attempting fall-back
>> to
 English version...
 ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
 joshua-es-en-2016-10-05/scripts/preparation/nonbre
 aking_prefixes
 
 Seems that line 12 of tokenize.pl:
 my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
 should be:
 my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
 
 When I make this modification, it works just fine for me.
 Also, tried in server mode -- seems to work without issue.
 
 (For reference -- executed on an openSUSE cluster)
 
 -John
 
 
 
 On Wed, Oct 5, 2016 at 10:36 PM, Matt Post  wrote:
 
> Hi folks,
> 
> I have managed to assemble an actual working language pack. Consider
>> this
> a (near-final, I hope) draft of what we're rolling out for lots of
> languages. Please download it, check out the README and associated
>> files,
> test it, and let me know what's missing or what needs to change.
> 
>   http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-
>> 05.tgz
> 
>> (2.1
> GB)
> 
> Suggested use:
> 
>   tar xzvf apache-joshua-es-en-2016-10-05.tgz
>   echo "\"Yo quiero Taco Bell,\", él dijo." \
>   | ./apache-joshua-es-en-2016-10-05/prepare.sh \
>   | ./apache-joshua-es-en-2016-10-05/joshua
> 
> matt
 
 
 
>> 
>> 



Re: language pack #1

2016-10-06 Thread Matt Post
Okay, I've fixed the nonbreaking_prefixes path issue.

The installation should now ignore your value of $JOSHUA entirely, preferring 
instead the bundled jar and scripts (maybe test this by unsetting $JOSHUA).

New version:

http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz 


Please note: my tests show that using BerkeleyLM results in a notable drop in 
performance (1–2 BLEU points across many test sets). I am worried that we have 
introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that users don't 
have to compile KenLM, but we're probably going to need to provide the option 
to "upgrade" for those willing to try to compile it. Or we'll need a solution 
for distributing pre-built KenLM shared libraries...

matt



> On Oct 5, 2016, at 11:43 PM, John Hewitt  wrote:
> 
> Quick further note -- I already had $JOSHUA set to a different directory,
> so initially all the lookups were failing.
> 
> It's possible current users of JOSHUA will as well when they download new
> language packs. This should be an obvious and quick fix for the user, but I
> don't know if there's something we could do in the name of making it even
> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
> the directory change in prepare.sh, and printing a warning if it's not?)
> 
> -John
> 
> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt  wrote:
> 
>> Thanks, Matt!
>> 
>> Some notes:
>> 
>> When piping input into prepare.sh, I get the following output:
>> 
>> WARNING: No known abbreviations for language 'es', attempting fall-back to
>> English version...
>> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
>> joshua-es-en-2016-10-05/scripts/preparation/nonbre
>> aking_prefixes
>> 
>> Seems that line 12 of tokenize.pl:
>> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
>> should be:
>> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>> 
>> When I make this modification, it works just fine for me.
>> Also, tried in server mode -- seems to work without issue.
>> 
>> (For reference -- executed on an openSUSE cluster)
>> 
>> -John
>> 
>> 
>> 
>> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post  wrote:
>> 
>>> Hi folks,
>>> 
>>> I have managed to assemble an actual working language pack. Consider this
>>> a (near-final, I hope) draft of what we're rolling out for lots of
>>> languages. Please download it, check out the README and associated files,
>>> test it, and let me know what's missing or what needs to change.
>>> 
>>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz
>>>  (2.1
>>> GB)
>>> 
>>> Suggested use:
>>> 
>>>tar xzvf apache-joshua-es-en-2016-10-05.tgz
>>>echo "\"Yo quiero Taco Bell,\", él dijo." \
>>>| ./apache-joshua-es-en-2016-10-05/prepare.sh \
>>>| ./apache-joshua-es-en-2016-10-05/joshua
>>> 
>>> matt
>> 
>> 
>> 



Re: language pack #1

2016-10-05 Thread John Hewitt
Quick further note -- I already had $JOSHUA set to a different directory,
so initially all the lookups were failing.

It's possible current users of JOSHUA will as well when they download new
language packs. This should be an obvious and quick fix for the user, but I
don't know if there's something we could do in the name of making it even
clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
the directory change in prepare.sh, and printing a warning if it's not?)

-John

On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt  wrote:

> Thanks, Matt!
>
> Some notes:
>
> When piping input into prepare.sh, I get the following output:
>
> WARNING: No known abbreviations for language 'es', attempting fall-back to
> English version...
> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
> joshua-es-en-2016-10-05/scripts/preparation/nonbre
> aking_prefixes
>
> Seems that line 12 of tokenize.pl:
> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
> should be:
> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>
> When I make this modification, it works just fine for me.
> Also, tried in server mode -- seems to work without issue.
>
> (For reference -- executed on an openSUSE cluster)
>
> -John
>
>
>
> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post  wrote:
>
>> Hi folks,
>>
>> I have managed to assemble an actual working language pack. Consider this
>> a (near-final, I hope) draft of what we're rolling out for lots of
>> languages. Please download it, check out the README and associated files,
>> test it, and let me know what's missing or what needs to change.
>>
>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz
>>  (2.1
>> GB)
>>
>> Suggested use:
>>
>> tar xzvf apache-joshua-es-en-2016-10-05.tgz
>> echo "\"Yo quiero Taco Bell,\", él dijo." \
>> | ./apache-joshua-es-en-2016-10-05/prepare.sh \
>> | ./apache-joshua-es-en-2016-10-05/joshua
>>
>> matt
>
>
>