[jira] [Commented] (JOSHUA-288) Port fast_align to java

2016-10-07 Thread Matt Post (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556244#comment-15556244
 ] 

Matt Post commented on JOSHUA-288:
--

For AER, French is almost certainly the Hansards: 
http://www.isi.edu/natural-language/download/hansard/. I'm not sure for 
Chinese. It doesn't seem that Table 2 is described in the text. I think you 
could benchmark against just the Hansards. Or manually against whatever 
fast_align produces.

> Port fast_align to java
> ---
>
> Key: JOSHUA-288
> URL: https://issues.apache.org/jira/browse/JOSHUA-288
> Project: Joshua
>  Issue Type: New Feature
>Reporter: Matt Post
>Assignee: John Hewitt
>Priority: Minor
> Fix For: 6.2
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> It would be great to have a Java port of fast_align, so that we don't have to 
> worry about compiling it, and could distribute it via Maven.
> https://github.com/clab/fast_align



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


thrax problem

2016-10-07 Thread Matt Post
Hi folks,

I thought I'd let you know about a problem I discovered with Thrax. Can you 
spot it?

$ ls -lh grammar.gz
-rw-r--r-- 1 mpost staff 2.2G Oct  6 13:55 grammar.gz
$ gzip -cd 9/grammar.gz | cut -d\| -f4 | uniq -c | sort -n | tail
   8448  las 
   8643  a 
   9440  que 
   9595  se 
   9696  , 
  10617  los 
  10885  el 
  11687  en 
  11932  de 
  12738  la 

As you can see, for lots of source sides, there are tons of target options. The 
first time any rule is used, all the target sides are scored with 
estimateRule() in order to sort them (including a call to the LM), and then all 
but the top 20 (configurable with -num_translation_options) are discarded. This 
is a big waste: the useless rules are stored on disk, and while the 
compute-time waste is constant-time, it does make a difference in "warming up" 
the decoder and, of course, memory usage.

The problem is that Thrax takes all target sides it finds during training. It 
would be good to add an option to Thrax that only keeps the top X translation 
options for each source side (where X is maybe 100).

matt




Re: language pack #1

2016-10-07 Thread Matt Post
That would be awesome.

matt


> On Oct 7, 2016, at 11:49 AM, kellen sunderland  
> wrote:
> 
> I was actually going to try and build KenLM into a maven package that can
> be easily distributed.  I haven't had time to work on it too much but I
> think it shouldn't be too hard.
> 
> On Thu, Oct 6, 2016 at 4:16 PM, Matt Post  wrote:
> 
>> Okay, I've fixed the nonbreaking_prefixes path issue.
>> 
>> The installation should now ignore your value of $JOSHUA entirely,
>> preferring instead the bundled jar and scripts (maybe test this by
>> unsetting $JOSHUA).
>> 
>> New version:
>> 
>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz <
>> http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz>
>> 
>> Please note: my tests show that using BerkeleyLM results in a notable drop
>> in performance (1–2 BLEU points across many test sets). I am worried that
>> we have introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that
>> users don't have to compile KenLM, but we're probably going to need to
>> provide the option to "upgrade" for those willing to try to compile it. Or
>> we'll need a solution for distributing pre-built KenLM shared libraries...
>> 
>> matt
>> 
>> 
>> 
>>> On Oct 5, 2016, at 11:43 PM, John Hewitt  wrote:
>>> 
>>> Quick further note -- I already had $JOSHUA set to a different directory,
>>> so initially all the lookups were failing.
>>> 
>>> It's possible current users of JOSHUA will as well when they download new
>>> language packs. This should be an obvious and quick fix for the user,
>> but I
>>> don't know if there's something we could do in the name of making it even
>>> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
>>> the directory change in prepare.sh, and printing a warning if it's not?)
>>> 
>>> -John
>>> 
>>> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt 
>> wrote:
>>> 
 Thanks, Matt!
 
 Some notes:
 
 When piping input into prepare.sh, I get the following output:
 
 WARNING: No known abbreviations for language 'es', attempting fall-back
>> to
 English version...
 ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
 joshua-es-en-2016-10-05/scripts/preparation/nonbre
 aking_prefixes
 
 Seems that line 12 of tokenize.pl:
 my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
 should be:
 my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
 
 When I make this modification, it works just fine for me.
 Also, tried in server mode -- seems to work without issue.
 
 (For reference -- executed on an openSUSE cluster)
 
 -John
 
 
 
 On Wed, Oct 5, 2016 at 10:36 PM, Matt Post  wrote:
 
> Hi folks,
> 
> I have managed to assemble an actual working language pack. Consider
>> this
> a (near-final, I hope) draft of what we're rolling out for lots of
> languages. Please download it, check out the README and associated
>> files,
> test it, and let me know what's missing or what needs to change.
> 
>   http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-
>> 05.tgz
> 
>> (2.1
> GB)
> 
> Suggested use:
> 
>   tar xzvf apache-joshua-es-en-2016-10-05.tgz
>   echo "\"Yo quiero Taco Bell,\", él dijo." \
>   | ./apache-joshua-es-en-2016-10-05/prepare.sh \
>   | ./apache-joshua-es-en-2016-10-05/joshua
> 
> matt
 
 
 
>> 
>>