Re: moses2 vs. joshua

2016-10-06 Thread Mattmann, Chris A (3980)
Here here, great job and thanks for hosting

++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 10/6/16, 12:49 AM, "kellen sunderland"  wrote:

Will do, but it might be a few days before I get the time to do a proper
test.  Thanks for hosting Matt.

On Thu, Oct 6, 2016 at 2:19 AM, Matt Post  wrote:

> Hi folks,
>
> Sorry this took so long, long story. But the four models that Hieu shared
> with me are ready. You can download them here; they're each about 15–20 
GB.
>
>   http://cs.jhu.edu/~post/files/joshua-hiero-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-phrase-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>
> It'd be great if someone could test them on a machine with lots of cores,
> to see how things scale.
>
> matt
>
> On Sep 22, 2016, at 9:09 AM, Matt Post  wrote:
>
> Hi folks,
>
> I have finished the comparison. Here you can find graphs for ar-en and
> ru-en. The ground-up rewrite of Moses is
> about 2x–3x faster than Joshua.
>
> http://imgur.com/a/FcIbW
>
> One implication (untested) is that we are likely as fast as or faster than
> Moses.
>
> We could brainstorm things to do to close this gap. I'd be much happier
> with 2x or even 1.5x than with 3x, and I bet we could narrow this down. 
But
> I'd like to get the 6.1 release out of the way, first, so I'm pushing this
> off to next month. Sound cool?
>
> matt
>
>
> On Sep 19, 2016, at 6:26 AM, Matt Post  wrote:
>
> I can't believe I did this, but I mis-colored one of the hiero lines, and
> the Numbers legend doesn't show the line type. If you reload the dropbox
> file, it's fixed now. The difference is about 3x for both. Here's the 
table.
>
> Threads
> Joshua
> Moses2
> Joshua (hiero)
> Moses2 (hiero)
> Phrase rate
> Hiero rate
> 1
> 178
> 65
> 2116
> 1137
> 2.74
> 1.86
> 2
> 109
> 42
> 1014
> 389
> 2.60
> 2.61
> 4
> 78
> 29
> 596
> 213
> 2.69
> 2.80
> 6
> 72
> 25
> 473
> 154
> 2.88
> 3.07
>
> I'll put the models together and share them later today. This was on a
> 6-core machine and I agree it'd be nice to test with something much 
higher.
>
> matt
>
>
> On Sep 19, 2016, at 5:33 AM, kellen sunderland <
> kellen.sunderl...@gmail.com >> wrote:
>
> Do we just want to store these models somewhere temporarily?  I've got a
> OneDrive account and could share the models from there (as long as they're
> below 500GBs or so).
>
> On Mon, Sep 19, 2016 at 11:32 AM, kellen sunderland <
> kellen.sunderl...@gmail.com  >> wrote:
> Very nice results.  I think getting to within 25% of a optimized c++
> decoder from a Java decoder is impressive.  Great that Hieu has put in the
> work to make moses2 so fast as well, that gives organizations two quite
> nice decoding engines to choose from, both with reasonable performance.
>
> Matt: I had a question about the x axis here.  Is that number of threads?
> We should be scaling more or less linearly with the number of threads, is
> that the case here?  If you post the models somewhere I can also do a 
quick
> benchmark on a machine with a few more cores.
>
> -Kellen
>
>
> On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com >> wrote:
> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post  mailto:p...@cs.jhu.edu >> ha
> scritto:
>
> I'll ask Hieu; I don't anticipate any problems. One potential problem is
> that that models occupy about 15--20 GB; do you think Jenkins would host
> this?
>
>
> I'm not sure, can such models be downloaded and pruned 

Re: language pack #1

2016-10-06 Thread Matt Post
Okay, I've fixed the nonbreaking_prefixes path issue.

The installation should now ignore your value of $JOSHUA entirely, preferring 
instead the bundled jar and scripts (maybe test this by unsetting $JOSHUA).

New version:

http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-06.tgz 


Please note: my tests show that using BerkeleyLM results in a notable drop in 
performance (1–2 BLEU points across many test sets). I am worried that we have 
introduced a bug in LanguageModelFF.java. We use BerkeleyLM so that users don't 
have to compile KenLM, but we're probably going to need to provide the option 
to "upgrade" for those willing to try to compile it. Or we'll need a solution 
for distributing pre-built KenLM shared libraries...

matt



> On Oct 5, 2016, at 11:43 PM, John Hewitt  wrote:
> 
> Quick further note -- I already had $JOSHUA set to a different directory,
> so initially all the lookups were failing.
> 
> It's possible current users of JOSHUA will as well when they download new
> language packs. This should be an obvious and quick fix for the user, but I
> don't know if there's something we could do in the name of making it even
> clearer. (Potentially checking whether $JOSHUA is the same as $PWD after
> the directory change in prepare.sh, and printing a warning if it's not?)
> 
> -John
> 
> On Wed, Oct 5, 2016 at 11:32 PM, John Hewitt  wrote:
> 
>> Thanks, Matt!
>> 
>> Some notes:
>> 
>> When piping input into prepare.sh, I get the following output:
>> 
>> WARNING: No known abbreviations for language 'es', attempting fall-back to
>> English version...
>> ERROR: No abbreviations files found in /nlp/users/johnhew/apache-
>> joshua-es-en-2016-10-05/scripts/preparation/nonbre
>> aking_prefixes
>> 
>> Seems that line 12 of tokenize.pl:
>> my $mydir = "$ENV{JOSHUA}/scripts/preparation/nonbreaking_prefixes";
>> should be:
>> my $mydir = "$ENV{JOSHUA}/scripts/nonbreaking_prefixes";
>> 
>> When I make this modification, it works just fine for me.
>> Also, tried in server mode -- seems to work without issue.
>> 
>> (For reference -- executed on an openSUSE cluster)
>> 
>> -John
>> 
>> 
>> 
>> On Wed, Oct 5, 2016 at 10:36 PM, Matt Post  wrote:
>> 
>>> Hi folks,
>>> 
>>> I have managed to assemble an actual working language pack. Consider this
>>> a (near-final, I hope) draft of what we're rolling out for lots of
>>> languages. Please download it, check out the README and associated files,
>>> test it, and let me know what's missing or what needs to change.
>>> 
>>>http://cs.jhu.edu/~post/files/apache-joshua-es-en-2016-10-05.tgz
>>>  (2.1
>>> GB)
>>> 
>>> Suggested use:
>>> 
>>>tar xzvf apache-joshua-es-en-2016-10-05.tgz
>>>echo "\"Yo quiero Taco Bell,\", él dijo." \
>>>| ./apache-joshua-es-en-2016-10-05/prepare.sh \
>>>| ./apache-joshua-es-en-2016-10-05/joshua
>>> 
>>> matt
>> 
>> 
>> 



[jira] [Commented] (JOSHUA-290) Provide Joshua artifact as a bundle

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551971#comment-15551971
 ] 

ASF GitHub Bot commented on JOSHUA-290:
---

Github user mjpost commented on a diff in the pull request:

https://github.com/apache/incubator-joshua/pull/69#discussion_r82191791
  
--- Diff: src/main/java/org/apache/joshua/decoder/package-info.java ---
@@ -23,4 +23,7 @@
  * of any actual decoding algorithm. Rather, such code is in 
  * child packages of this package.
  */
-package org.apache.joshua.decoder;
\ No newline at end of file
+@Version("0.1.0")
+package org.apache.joshua.decoder;
+
+import org.osgi.annotation.versioning.Version;
--- End diff --

Yes, 7 maven-multi-module is merged into 7, so that's the best place for it.


> Provide Joshua artifact as a bundle
> ---
>
> Key: JOSHUA-290
> URL: https://issues.apache.org/jira/browse/JOSHUA-290
> Project: Joshua
>  Issue Type: Task
>  Components: build
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>
> I think it'd be good if we could make the Joshua artifact an OSGi _bundle_.
> This would have no impact on plain java applications but would give the 
> following benefits:
> - make it possible to install it in OSGi environments
> - optionally introduce semantic versioning (in addition with the baseline 
> plugin) that would help track e.g. if changes in APIs break backward 
> compatibility 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (JOSHUA-290) Provide Joshua artifact as a bundle

2016-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JOSHUA-290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551238#comment-15551238
 ] 

ASF GitHub Bot commented on JOSHUA-290:
---

Github user KellenSunderland commented on a diff in the pull request:

https://github.com/apache/incubator-joshua/pull/69#discussion_r8215
  
--- Diff: src/main/java/org/apache/joshua/decoder/package-info.java ---
@@ -23,4 +23,7 @@
  * of any actual decoding algorithm. Rather, such code is in 
  * child packages of this package.
  */
-package org.apache.joshua.decoder;
\ No newline at end of file
+@Version("0.1.0")
+package org.apache.joshua.decoder;
+
+import org.osgi.annotation.versioning.Version;
--- End diff --

I would vote for putting it in the 7 branch actually.  @mjpost what do you 
think?


> Provide Joshua artifact as a bundle
> ---
>
> Key: JOSHUA-290
> URL: https://issues.apache.org/jira/browse/JOSHUA-290
> Project: Joshua
>  Issue Type: Task
>  Components: build
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>
> I think it'd be good if we could make the Joshua artifact an OSGi _bundle_.
> This would have no impact on plain java applications but would give the 
> following benefits:
> - make it possible to install it in OSGi environments
> - optionally introduce semantic versioning (in addition with the baseline 
> plugin) that would help track e.g. if changes in APIs break backward 
> compatibility 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-joshua pull request #69: JOSHUA-290 - provide Joshua as a bundle

2016-10-06 Thread KellenSunderland
Github user KellenSunderland commented on a diff in the pull request:

https://github.com/apache/incubator-joshua/pull/69#discussion_r8215
  
--- Diff: src/main/java/org/apache/joshua/decoder/package-info.java ---
@@ -23,4 +23,7 @@
  * of any actual decoding algorithm. Rather, such code is in 
  * child packages of this package.
  */
-package org.apache.joshua.decoder;
\ No newline at end of file
+@Version("0.1.0")
+package org.apache.joshua.decoder;
+
+import org.osgi.annotation.versioning.Version;
--- End diff --

I would vote for putting it in the 7 branch actually.  @mjpost what do you 
think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: moses2 vs. joshua

2016-10-06 Thread kellen sunderland
Will do, but it might be a few days before I get the time to do a proper
test.  Thanks for hosting Matt.

On Thu, Oct 6, 2016 at 2:19 AM, Matt Post  wrote:

> Hi folks,
>
> Sorry this took so long, long story. But the four models that Hieu shared
> with me are ready. You can download them here; they're each about 15–20 GB.
>
>   http://cs.jhu.edu/~post/files/joshua-hiero-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-phrase-ar-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>   http://cs.jhu.edu/~post/files/joshua-hiero-ru-en.tbz
>
> It'd be great if someone could test them on a machine with lots of cores,
> to see how things scale.
>
> matt
>
> On Sep 22, 2016, at 9:09 AM, Matt Post  wrote:
>
> Hi folks,
>
> I have finished the comparison. Here you can find graphs for ar-en and
> ru-en. The ground-up rewrite of Moses is
> about 2x–3x faster than Joshua.
>
> http://imgur.com/a/FcIbW
>
> One implication (untested) is that we are likely as fast as or faster than
> Moses.
>
> We could brainstorm things to do to close this gap. I'd be much happier
> with 2x or even 1.5x than with 3x, and I bet we could narrow this down. But
> I'd like to get the 6.1 release out of the way, first, so I'm pushing this
> off to next month. Sound cool?
>
> matt
>
>
> On Sep 19, 2016, at 6:26 AM, Matt Post  wrote:
>
> I can't believe I did this, but I mis-colored one of the hiero lines, and
> the Numbers legend doesn't show the line type. If you reload the dropbox
> file, it's fixed now. The difference is about 3x for both. Here's the table.
>
> Threads
> Joshua
> Moses2
> Joshua (hiero)
> Moses2 (hiero)
> Phrase rate
> Hiero rate
> 1
> 178
> 65
> 2116
> 1137
> 2.74
> 1.86
> 2
> 109
> 42
> 1014
> 389
> 2.60
> 2.61
> 4
> 78
> 29
> 596
> 213
> 2.69
> 2.80
> 6
> 72
> 25
> 473
> 154
> 2.88
> 3.07
>
> I'll put the models together and share them later today. This was on a
> 6-core machine and I agree it'd be nice to test with something much higher.
>
> matt
>
>
> On Sep 19, 2016, at 5:33 AM, kellen sunderland <
> kellen.sunderl...@gmail.com >> wrote:
>
> Do we just want to store these models somewhere temporarily?  I've got a
> OneDrive account and could share the models from there (as long as they're
> below 500GBs or so).
>
> On Mon, Sep 19, 2016 at 11:32 AM, kellen sunderland <
> kellen.sunderl...@gmail.com  >> wrote:
> Very nice results.  I think getting to within 25% of a optimized c++
> decoder from a Java decoder is impressive.  Great that Hieu has put in the
> work to make moses2 so fast as well, that gives organizations two quite
> nice decoding engines to choose from, both with reasonable performance.
>
> Matt: I had a question about the x axis here.  Is that number of threads?
> We should be scaling more or less linearly with the number of threads, is
> that the case here?  If you post the models somewhere I can also do a quick
> benchmark on a machine with a few more cores.
>
> -Kellen
>
>
> On Mon, Sep 19, 2016 at 10:53 AM, Tommaso Teofili <
> tommaso.teof...@gmail.com >> wrote:
> Il giorno sab 17 set 2016 alle ore 15:23 Matt Post  mailto:p...@cs.jhu.edu >> ha
> scritto:
>
> I'll ask Hieu; I don't anticipate any problems. One potential problem is
> that that models occupy about 15--20 GB; do you think Jenkins would host
> this?
>
>
> I'm not sure, can such models be downloaded and pruned at runtime, or do
> they need to exist on the Jenkins machine ?
>
>
>
> (ru-en grammars still packing, results will probably not be in until much
> later today)
>
> matt
>
>
> On Sep 17, 2016, at 3:19 PM, Tommaso Teofili  mailto:tommaso.teof...@gmail.com >>
>
> wrote:
>
>
> Hi Matt,
>
> I think it'd be really valuable if we could be able to repeat the same
> tests (given parallel corpus is available) in the future, any chance you
> can share script / code to do that ? We may even consider adding a
>
> Jenkins
>
> job dedicated to continuously monitor performances as we work on Joshua
> master branch.
>
> WDYT?
>
> Anyway thanks for sharing the very interesting comparisons.
> Regards,
> Tommaso
>
> Il giorno sab 17 set 2016 alle ore 12:29 Matt Post  mailto:p...@cs.jhu.edu >> ha
> scritto:
>
> Ugh, I think the mailing list deleted the attachment. Here is an attempt
> around our censors:
>
> https://www.dropbox.com/s/80up63reu4q809y/ar-en-joshua-moses2.png?dl=0<
> https://www.dropbox.com/s/80up63reu4q809y/ar-en-joshua-moses2.png?dl=0>
>
>
> On Sep 17, 2016, at 12:21 PM, Matt Post  cs.jhu.edu >> wrote:
>
> Hi everyone,
>
> One thing we did this week at MT Marathon was a speed comparison of
>
> Joshua 6.1