from:"Matt Post"

Re: Dockerhub hosted images

2016-11-23 Thread Matt Post

Okay, I have this with

docker run -it kellens/apache-joshua-es-en-2016-10-05 bash

It seems we are missing Perl (./prepare.sh fails), and we should replace the 
LanguageModel line with a KenLM instance and build that. I bet we'll need 
Python, too.




> On Nov 23, 2016, at 8:15 AM, Matt Post  wrote:
> 
> Kellen, can I bother you to post a few first steps? I've successfully pulled 
> this down to my mac but now do not know how to find it, edit it, or run it. 
> I'm porting through the documentation and will find it eventually but this 
> would save me a bit of time.
> 
> 
>> On Nov 23, 2016, at 8:07 AM, kellen sunderland  
>> wrote:
>> 
>> Yes my next step was going to be getting it hosted officially.
>> 
>> I'll go ahead and open a ticket.  I think I'll hold off on pushing to the
>> Apache account until I've done a little more testing though.
>> 
>> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney"  wrote:
>> 
>>> Hi Kellen,
>>> Nice :)
>>> Another option is for us to host these via the Apache account.
>>> https://hub.docker.com/r/apache/
>>> We could then add a badge to our README which points to the Dockerfile(s).
>>> Do you want to open a ticket over on the INFRA Jira for this?
>>> 
>>> On Tue, Nov 22, 2016 at 1:57 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> From: kellen sunderland 
>>>> To: "dev@joshua.incubator.apache.org" 
>>>> Cc:
>>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>>> Subject: Re: Dockerhub hosted images
>>>> Ok, the first image should be properly uploaded now.
>>>> 
>>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>>> 
>>>> -Kellen
>>>> 
>>>> 
>>> 
>

Re: Any symal experts?

2016-11-23 Thread Matt Post

John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed 
replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt  wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John

Re: Any symal experts?

2016-11-23 Thread Matt Post

I think it will be much less of a headache. The GIZA++ code is notorious for 
being unreadable, and the Perl piece of that pipeline only hurts (even though 
Philipp's Perl is unusually clear). I think adding atools to your port is the 
way to go, and that it's written in C++ should facilitate that.




> On Nov 23, 2016, at 12:25 PM, John Hewitt  wrote:
> 
> It'll be a headache because it also has no documentation, but to be fair it
> may be less of a headache / a better long-term solution than trying to move
> forward with this hackier solution.
> 
> I'll keep the symal use on the backburner and start putting together an
> atools port.
> 
> -John
> 
> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:
> 
>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>> indeed replaced them with "atools"; how much work would it be to port that?
>> 
>> 
>>> On Nov 23, 2016, at 12:11 PM, John Hewitt 
>> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>> into
>>> the pipeline.
>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>> that I haven't ported to Java.
>>> We package symal (which symmetricizes alignments) with Joshua right now
>> for
>>> GIZA++, so I'm attempting to re-use that.
>>> However, symal uses the .bal format, which it fails to describe.
>>> It gets away with this because files from GIZA++ are piped through
>>> giza2bal.pl, which itself is not well documented.
>>> I'm attempting to write, say, fastalign2bal.py.
>>> With a bit of tinkering, I got at the .bal format:
>>> 
>>> 1
>>> 
>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>> 
>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>> 
>>> A template for which would be
>>> 
>>> 1
>>> 
>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> 
>>> 
>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>> some fastalign2bal.py output.
>>> A few hours with gdb made some progress (for as far as I can tell, the
>>> formats are identical) but if anyone has experience with symal, I would
>>> greatly appreciate some consultation.
>>> 
>>> -John
>> 
>>

Re: Downloading of non ASF licensed code

2016-11-28 Thread Matt Post

This would be easy to do. Maybe just a simple prompt that alerts the user? 
Something like

echo "Warning: this script downloads many tools used in building and 
running"
echo "Joshua. Not all of them are Apache Licensed. If you wish to 
continue, hit Enter".
read j
if [[ ! -z $j ]]; then
echo "Quitting."
fi



> On Nov 25, 2016, at 10:41 AM, Tom Barber  wrote:
> 
> This may have come up before in the whole licensing chat so apologies if
> I'm just going over old ground.
> 
> The download-deps.sh file obviously downloads and builds stuff with non ASF
> licenses, I realise this is for model training purposes only, and 99.9%
> wont care, but should we consider putting a prompt into that script warning
> people. I ask because a company might add in the training modules blindly
> assuming because the script is distributed by the ASF the modules are also
> ASL2.0.
> 
> Just a thought.
> 
> Tom
> 
> -- 
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689

★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-11-28 Thread Matt Post

One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the 
existing models that then get used. There is a research topic here for how to 
make it better (particularly, how to set the weights of rules that are added at 
runtime instead of learned from bitext), but it works really well for adding 
words that are OOV (since it's always cheaper to use the OOV). Here's a demo of 
how this works (this feature is included in the language packs). 


https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools 
called "translation memories", where translators are translating documents, and 
the sentences get queried against their past translations and matched in a 
fuzzy fashion. The big tool on the market for this is SDL Trados 
. 
I'm not talking about selling a product, but in a space that big, there have 
got to be a lot of people who'd rather just run their own system, than shell 
out for an expensive (and ugly) tool. So there is a big niche for an open 
source tool, and currently nothing really filling it. The "dynamic phrase 
table" feature above provides the beginnings of offering a TM competitor, but 
one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning 
infrastructure in Joshua. The use-case I imagine is that Joshua could sit on 
top of a large tuning set across diverse domains (e.g, formal news, informal 
web logs, spoken dialogue, etc). You could then add new phrases in sentences as 
above, which would get automatically aligned, and then everything could be 
retuned at the user's request (or perhaps at night). This way, when people 
added new data to their models, Joshua would automatically find the best 
weights, either immediately or on some schedule. There'd be less worry about 
bit rot.

- Data collection and sharing. Another cool idea would be to allow people to 
easily send us data. If we get to a place where people are building custom 
dynamic phrase tables, a cool ability would be to make it easy for people to 
upload the data they have added to their private systems, which we could then 
collect and further distribute. So Joshua could become an easy means for people 
to crowdsource data used for translation systems. This is obviously just a 
high-level idea that would require a lot of details to be figured out, but it 
would be super cool.

matt

Re: Signing off a Joshua Release

2016-11-29 Thread Matt Post

Same here, thanks, Tom.


> On Nov 27, 2016, at 3:38 AM, kellen sunderland  
> wrote:
> 
> Definitely guilty of this.  I'll check the release checklist in the
> future.  Thanks for the reminder Tom.
> 
> On Nov 26, 2016 1:27 PM, "Tom Barber"  wrote:
> 
> Hello folks,
> 
> I see plenty of +1's going through the release vote,  which is great to see
> people taking an active role in getting the release shipped.
> 
> For those of you who are new to the ASF there are a bunch of requirements
> to sign off for a release which you can find here:
> 
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> 
> My current concern is that people who are new to the incubator are +1'ing
> software for release without check all or part of the release cycle. Whilst
> not mandatory, when you +1 a release please can you try to indicate what
> you've checked. The reason for this is,  the tag Lewis has built off isn't
> the tip of master, so if you're basing  your +1 on your day to day
> development and knowledge of the code base, that's not always whats
> shipped. Also in the branching process,  its possible merges or alterations
> were accidentally made that Lewis has missed (this is very unlikely I know
> but you know, code changes). Also people build software on different OS's,
> versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> mean it builds on mine, for example.
> 
> Also regarding licenses, disclaimers etc, people notice different things or
> interpret stuff differently. its always possible that someone might miss a
> library etc so its important multiple eyes run over the same stuff.
> 
> Cheers,
> 
> Tom
> 
> --
> Tom Barber
> CTO Spicule LTD
> t...@spicule.co.uk
> 
> http://spicule.co.uk
> 
> GB: +44(0)5603641316
> US: +18448141689

modernmt

2016-12-01 Thread Matt Post

Just came across this, and it's really cool:

https://github.com/ModernMT/MMT

See the README for some great use cases. I'm surprised I'd never heard of this 
before as it's EU funded and associated with U Edinburgh.

Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

2016-12-01 Thread Matt Post

It wouldn't be hard to add some TMX-like features, no. There are some technical 
challenges, though — for example, the current demo lets you add phrases, but 
that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run 
John's fast_align implementation (with a saved model) to break down that new 
sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili  wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post  <mailto:p...@cs.jhu.edu>> ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ 
>> <http://www.sdl.com/solution/language/translation-productivity/trados-studio/>>.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt

Re: modernmt

2016-12-01 Thread Matt Post

John,

Thanks for sharing, this is really helpful. I didn't realize that Marcello was 
involved.

I think we can identify with the NMT danger. I still think there is a big niche 
that deep learning approaches won't reach for a few years, until GPUs become 
super prevalent. Which is why I like ModernMT's approaches, which overlap with 
many of the things I've been thinking. One thing I really like is there 
automatic context-switching approach. This is a great way to build 
general-purpose models, and I'd like to mimic it. I have some general ideas 
about how this should be implemented but am also looking into the literature 
here.

matt


> On Dec 1, 2016, at 1:46 PM, John Hewitt  wrote:
> 
> I had a few good conversations over dinner with this team at AMTA in Austin
> in October.
> They seem to be in the interesting position where their work is good, but
> is in danger of being superseded by neural MT as they come out of the gate.
> Clearly, it has benefits over NMT, and is easier to adopt, but may not be
> the winner over the long run.
> 
> Here's the link
> <https://amtaweb.org/wp-content/uploads/2016/11/MMT_Tutorial_FedericoTrombetti_wide-cover.pdf>
> to their AMTA tutorial.
> 
> -John
> 
> On Thu, Dec 1, 2016 at 10:17 AM, Mattmann, Chris A (3010) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
>> Wow seems like this kind of overlaps with BigTranslate as well.. thanks
>> for passing
>> along Matt
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010)
>> Manager, Open Source Projects Formulation and Development Office (8212)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++
>> 
>> 
>> On 12/1/16, 4:47 AM, "Matt Post"  wrote:
>> 
>>Just came across this, and it's really cool:
>> 
>>https://github.com/ModernMT/MMT
>> 
>>See the README for some great use cases. I'm surprised I'd never heard
>> of this before as it's EU funded and associated with U Edinburgh.
>> 
>>

Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-01 Thread Matt Post

Hi folks,

What's the status of this? Can we check off items from the list below that have 
been completed?

matt


> On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> We have a number of issues to fix which were picked up over on general@. In
> particular, we received excellent feedback from my good friend Justin [12]
> [13]. As the general@ VOTE has not had 72 hours to stew I am not going to
> close it, however we should take this time to fix the issues with master
> before we spin an RC#3. These can be summarized as follows.
> I've opened a Jira issue to track all of this.
> https://issues.apache.org/jira/browse/JOSHUA-324
> Lets track the progress on the Jira ticket.
> 
> ==
> - Your missing incubating in the release artifacts name. [1]
> - There are a number of binary files in the source release that look to be
> compiled source code.
> 
> I checked:
> - name doesn’t include incubating
> - signatures and hashes correct
> - DISCLAIMER exists
> - LICENSE is missing a few things (see below)
> - a source file is missing an Apache header [7]
> - Several unexpected binary files are contained in the source release
> [8][9][10][11]
> - Can compile from source
> 
> License is missing:
> - MIT licensed normalize.css v3.0.3 bundled in [5]
> - glyph icon fonts [6]
> 
> Not an issue but it's a little odd to have LICENSE and NOTICE.txt - usually
> both are bare or both have .txt extension.
> 
> Also while looking at your site I noticed that the download links of you
> incubating site [2] points to github, please change to point to the offical
> release area.
> Also the 6.1 release has already been tagged and it available for public
> download on github [4]  before this vote is finished. This is IMO against
> Apache release policy [3] please remove.
> 
> I also notice you recently released the language packs (18th Nov) but there
> doesn’t seem to have been a vote for that? Any reason for this?
> ===
> 
> [1] http://incubator.apache.org/incubation/Incubation_Policy.html#Releases
> [2]
> https://cwiki.apache.org/confluence/display/JOSHUA/Apache+Joshua+%28Incubating%29+Home
> [3] http://www.apache.org/dev/release.html#what
> [4] https://github.com/apache/incubator-joshua/releases
> [5] ./demo/bootstrap/css/bootstrap.min.css
> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
> [8] ./bin/GIZA++
> [9] ./bin/mkcls
> [10 ]./bin/snt2cooc.out
> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
> [12]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
> [13]
> http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
> 
> 
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

apache parent POM?

2016-12-09 Thread Matt Post


Hi,

I notice that our POM depends on version 10 of the parent 
org.apache:apache POM:



  
org.apache
apache
10
  ...
That POM is about five years old. The current version however is 18. Is 
there a reason to use 10? I can build fine with 18.


Matt

Re: Issues to Fix with Apache Joshua 6.1 RC#2

2016-12-12 Thread Matt Post

Lewis, do you have time to pick this up again? It'd be great to get this out 
before Christmas.

Or is there something you need from me?

matt


> On Dec 2, 2016, at 5:09 AM, kellen sunderland  
> wrote:
> 
> [7] has been fixed.
> 
> Tom's comments lead me to think that [8][9][10] can be removed from the
> release.
> 
> I'm not totally clear on what we need to do to resolve the licensing issues
> [5] and [6].  Do we simply need to give attribution to these projects in
> our LICENSE.txt file?
> 
> 
> 
> On Thu, Dec 1, 2016 at 10:44 PM, Matt Post  wrote:
> 
>> Hi folks,
>> 
>> What's the status of this? Can we check off items from the list below that
>> have been completed?
>> 
>> matt
>> 
>> 
>>> On Nov 29, 2016, at 4:24 PM, lewis john mcgibbney 
>> wrote:
>>> 
>>> Hi Folks,
>>> We have a number of issues to fix which were picked up over on general@.
>> In
>>> particular, we received excellent feedback from my good friend Justin
>> [12]
>>> [13]. As the general@ VOTE has not had 72 hours to stew I am not going
>> to
>>> close it, however we should take this time to fix the issues with master
>>> before we spin an RC#3. These can be summarized as follows.
>>> I've opened a Jira issue to track all of this.
>>> https://issues.apache.org/jira/browse/JOSHUA-324
>>> Lets track the progress on the Jira ticket.
>>> 
>>> ==
>>> - Your missing incubating in the release artifacts name. [1]
>>> - There are a number of binary files in the source release that look to
>> be
>>> compiled source code.
>>> 
>>> I checked:
>>> - name doesn’t include incubating
>>> - signatures and hashes correct
>>> - DISCLAIMER exists
>>> - LICENSE is missing a few things (see below)
>>> - a source file is missing an Apache header [7]
>>> - Several unexpected binary files are contained in the source release
>>> [8][9][10][11]
>>> - Can compile from source
>>> 
>>> License is missing:
>>> - MIT licensed normalize.css v3.0.3 bundled in [5]
>>> - glyph icon fonts [6]
>>> 
>>> Not an issue but it's a little odd to have LICENSE and NOTICE.txt -
>> usually
>>> both are bare or both have .txt extension.
>>> 
>>> Also while looking at your site I noticed that the download links of you
>>> incubating site [2] points to github, please change to point to the
>> offical
>>> release area.
>>> Also the 6.1 release has already been tagged and it available for public
>>> download on github [4]  before this vote is finished. This is IMO against
>>> Apache release policy [3] please remove.
>>> 
>>> I also notice you recently released the language packs (18th Nov) but
>> there
>>> doesn’t seem to have been a vote for that? Any reason for this?
>>> ===
>>> 
>>> [1] http://incubator.apache.org/incubation/Incubation_Policy.
>> html#Releases
>>> [2]
>>> https://cwiki.apache.org/confluence/display/JOSHUA/
>> Apache+Joshua+%28Incubating%29+Home
>>> [3] http://www.apache.org/dev/release.html#what
>>> [4] https://github.com/apache/incubator-joshua/releases
>>> [5] ./demo/bootstrap/css/bootstrap.min.css
>>> [6] apache-joshua-6.1/demo/bootstrap/fonts/*
>>> [7] ./src/test/java/org/apache/joshua/decoder/ff/tm/OwnerMapTest.java
>>> [8] ./bin/GIZA++
>>> [9] ./bin/mkcls
>>> [10 ]./bin/snt2cooc.out
>>> [11] ,/src/test/resources/berkeley_lm/lm.berkeleylm.gz
>>> [12]
>>> http://www.mail-archive.com/general%40incubator.apache.org/msg57543.html
>>> [13]
>>> http://www.mail-archive.com/general%40incubator.apache.org/msg57551.html
>>> 
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>> 
>>

Re: Apache Joshua Project

2016-12-13 Thread Matt Post


> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> <http://incubator.apache.org/projects/joshua.html>
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
> 3) Are there ways of increasing translation quality without changing 
> (extending) language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> <http://joshua.incubator.apache.org/6.0/faq.html>
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post  <mailto:p...@cs.jhu.edu>>:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak > <mailto:alru...@gmail.com>> wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post > <mailto:p...@cs.jhu.edu>>:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak >> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 mailto:alru...@gmail.com>>:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post >> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> Inline below:
>>>> 
>>>>> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak >>>> <mailto:alru...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>>> 
>>>>> 
>>>>> We need to prepare all infrastructure now so you can make changes in 
>>>>> future. Preparation will take time. Right now I have several questions 
>>>>>

Re: Apache Joshua Project

2016-12-13 Thread Matt Post


> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  <mailto:alru...@gmail.com>> wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> <http://incubator.apache.org/projects/joshua.html>
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630>3) 
Are there ways of increasing translation quality without changing (extending) 
language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> <http://joshua.incubator.apache.org/6.0/faq.html>
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post  <mailto:p...@cs.jhu.edu>>:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak > <mailto:alru...@gmail.com>> wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post > <mailto:p...@cs.jhu.edu>>:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak >> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 mailto:alru...@gmail.com>>:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post >> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> Inline below:
>>>> 
>>>>> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak >>>> <mailto:alru...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>>> 
>>>>> 
>>>>> We need to prepare all infrastructure now so you can make changes in 
&g

Re: Apache Joshua Project

2016-12-14 Thread Matt Post

1. the lm cannot be used with moses. we have berkeleylm format you need kenlm. 
we are releasing kenlm soon. kenlm is better but it requires the user to 
compile c++ code which can be tricky. 

2/3. please see the README in each language pack. you need to pass input text 
through "prepare.sh" which does tokenization. 

matt (from my phone)

> Le 14 déc. 2016 à 06:16, Aliaksei Rudak  a écrit :
> 
> Hi Matt, 
> Thanks for answers.
> 
> 1) Can language models inside Joshua language packs work with Moses MT ? If 
> yes - can you give me the link how to run them on it ? 
> 
> 2) I installed several instances (German, Spanish, Russian) and all of them 
> have the same strange issue. Trying to translate one sentence. 
> 
> For example from Spanish to English
> "Además podrás encontrar las audiciones de los textos con distintos acentos 
> del español. "
> 
> Translates as
> "Also auditions, you'll find texts with different accents of español"
> 
> It means that one word in sentence (español) is not translated correct. But 
> it's ok if you translating single word ( español )
> 
> Same for other languages (German, Russian). All words (except one or 
> sometimes 2 words) are not translated. Do you know how to fix this ?
> 
> 3) How to translate sentences with punctuation marks (comma, exclamation, 
> question marks etc) ?
> 
> Translating from Spanish to English gives error
> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
> pregunta."
> 
> If you try to translate words separated with commas it not translates these 
> words
> "inglés, francés, alemán y portugués"
> 
> output
> "Inglés, francés, german and portuguese"
> 
> Regards,
> Alexei
> 
> 
> 
> 
> 
> 2016-12-13 17:44 GMT+03:00 Matt Post :
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html
>>> 
>> 
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
>>> 3) Are there ways of increasing translation quality without changing 
>>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incubator.apache.org/6.0/faq.html
>>> 
>> 
>> Yes but it's complicated. The best way is to add data, but there are lots of 
>> other models and parameter variations that could be tried.
>> 
>>> 4) How can I reduce the amount of memory each language pair instance use 
>>> without losing process speed and quality?
>>> 
>> If you can find German–French parallel data, use that. Otherwise, pivot 
>> through another language.
>>> 5) To make translation from German to French do I need to make translation 
>>> via English conversion ? (like German to English first and then English to 
>>> French) 
>>> 
>>> I mean for the case without German-French parallel data.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-12 17:58 GMT+03:00 Matt Post :
>>>> No, each has to be run separately. But not all are equally good, so I 
>>>> suggest starting with a few and building up.
>>>> 
>>>> If you get KenLM working in place of BerkeleyLM, the language models will 
>>>> be shared between them if they are on the same machine. I will post 
>>>> instructions soon.
>>>> 
>>>> Yes, each one has two language models that are interpolated.
>>>> 
>>>> 
>>>> 
>>>>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak  wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>&g

Re: Apache Joshua Project

2016-12-14 Thread Matt Post

1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
builds KenLM and the KenLM library. If you can do that, that is all you need to 
do.

2. http://opus.lingfil.uu.se is a great place to get parallel data; it's where 
we got all the data we use.

3. Joshua has a Java API (undocumented) but not a C++ one.


> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak  wrote:
> 
> 1) Can you estimate approximate date of releasing language packs with kenlm 
> model ? I have a teammate who know c++ well so If we have more information 
> (or tutorial) of how to do that by ourselves we can share the result with 
> others. So it will be benefit for all.
> 
> 2) Where can I get or buy parallel corpora for other languages ? Where did 
> you get data for current huge language packs? I found several sources but 
> they so small in size.
> 
> 3) Is there any document of how to create offline translation system based on 
> Joshua and make it as c++ library for example ?
> 
> 
> 
> 
> 2016-12-14 14:33 GMT+03:00 Matt Post  <mailto:p...@cs.jhu.edu>>:
> 1. the lm cannot be used with moses. we have berkeleylm format you need 
> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
> to compile c++ code which can be tricky. 
> 
> 2/3. please see the README in each language pack. you need to pass input text 
> through "prepare.sh" which does tokenization. 
> 
> matt (from my phone)
> 
> Le 14 déc. 2016 à 06:16, Aliaksei Rudak  <mailto:alru...@gmail.com>> a écrit :
> 
>> Hi Matt, 
>> Thanks for answers.
>> 
>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>> yes - can you give me the link how to run them on it ? 
>> 
>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>> have the same strange issue. Trying to translate one sentence. 
>> 
>> For example from Spanish to English
>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>> del español. "
>> 
>> Translates as
>> "Also auditions, you'll find texts with different accents of español"
>> 
>> It means that one word in sentence (español) is not translated correct. But 
>> it's ok if you translating single word ( español )
>> 
>> Same for other languages (German, Russian). All words (except one or 
>> sometimes 2 words) are not translated. Do you know how to fix this ?
>> 
>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>> question marks etc) ?
>> 
>> Translating from Spanish to English gives error
>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>> pregunta."
>> 
>> If you try to translate words separated with commas it not translates these 
>> words
>> "inglés, francés, alemán y portugués"
>> 
>> output
>> "Inglés, francés, german and portuguese"
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 
>> 2016-12-13 17:44 GMT+03:00 Matt Post > <mailto:p...@cs.jhu.edu>>:
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak >> <mailto:alru...@gmail.com>> wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html 
>>> <http://incubator.apache.org/projects/joshua.html>
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630>3)
>>  Are there ways of increasing translation quality without changing 
>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incubator.apache.org/6.0/faq.html 
>>> <http://joshua.incubator.apache.org/6.0/faq.html>
>> Yes but i

Re: Apache Joshua Project

2016-12-16 Thread Matt Post

There is not enough information for me to answer your question. I don't see any 
problems.

$ echo "i'll give you 10% of the asking price" | ./prepare.sh | ./joshua
I'll give you 10 % of the asking price


> On Dec 16, 2016, at 3:22 AM, Aliaksei Rudak  wrote:
> 
> Also there is a problem with parsing (%) sign in sentences. Do you know how 
> to solve this ?
> 
> 2016-12-15 10:57 GMT+03:00 Aliaksei Rudak  <mailto:alru...@gmail.com>>:
> Hi Matt,
> 
> English-Russian language pack has broken link
> https://cwiki.apache.org/confluence/home.apache.org/~lewismc/language-pack-en-ru-2016-10-28.tar.gz
>  
> <https://cwiki.apache.org/confluence/home.apache.org/~lewismc/language-pack-en-ru-2016-10-28.tar.gz>
> 
> When do you plan to create and upload other languages ?
> 
> Regards,
> Alexei
> 
> 2016-12-14 21:50 GMT+03:00 Matt Post  <mailto:p...@cs.jhu.edu>>:
> 1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
> builds KenLM and the KenLM library. If you can do that, that is all you need 
> to do.
> 
> 2. http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/> is a great place to 
> get parallel data; it's where we got all the data we use.
> 
> 3. Joshua has a Java API (undocumented) but not a C++ one.
> 
> 
>> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak > <mailto:alru...@gmail.com>> wrote:
>> 
>> 1) Can you estimate approximate date of releasing language packs with kenlm 
>> model ? I have a teammate who know c++ well so If we have more information 
>> (or tutorial) of how to do that by ourselves we can share the result with 
>> others. So it will be benefit for all.
>> 
>> 2) Where can I get or buy parallel corpora for other languages ? Where did 
>> you get data for current huge language packs? I found several sources but 
>> they so small in size.
>> 
>> 3) Is there any document of how to create offline translation system based 
>> on Joshua and make it as c++ library for example ?
>> 
>> 
>> 
>> 
>> 2016-12-14 14:33 GMT+03:00 Matt Post > <mailto:p...@cs.jhu.edu>>:
>> 1. the lm cannot be used with moses. we have berkeleylm format you need 
>> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
>> to compile c++ code which can be tricky. 
>> 
>> 2/3. please see the README in each language pack. you need to pass input 
>> text through "prepare.sh" which does tokenization. 
>> 
>> matt (from my phone)
>> 
>> Le 14 déc. 2016 à 06:16, Aliaksei Rudak > <mailto:alru...@gmail.com>> a écrit :
>> 
>>> Hi Matt, 
>>> Thanks for answers.
>>> 
>>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>>> yes - can you give me the link how to run them on it ? 
>>> 
>>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>>> have the same strange issue. Trying to translate one sentence. 
>>> 
>>> For example from Spanish to English
>>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>>> del español. "
>>> 
>>> Translates as
>>> "Also auditions, you'll find texts with different accents of español"
>>> 
>>> It means that one word in sentence (español) is not translated correct. But 
>>> it's ok if you translating single word ( español )
>>> 
>>> Same for other languages (German, Russian). All words (except one or 
>>> sometimes 2 words) are not translated. Do you know how to fix this ?
>>> 
>>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>>> question marks etc) ?
>>> 
>>> Translating from Spanish to English gives error
>>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>>> pregunta."
>>> 
>>> If you try to translate words separated with commas it not translates these 
>>> words
>>> "inglés, francés, alemán y portugués"
>>> 
>>> output
>>> "Inglés, francés, german and portuguese"
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-13 17:44 GMT+03:00 Matt Post >> <mailto:p...@cs.jhu.edu>>:
>>> 
>>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak >>> <mailto:alru...@gmail.com>> wrote:
>>>> 
>>>> 1) If English-German pair will be recompiled t

Re: Problem on web page

2016-12-20 Thread Matt Post

Thanks! Fixed.


> On Dec 19, 2016, at 9:35 AM, Fabrizio Gotti  wrote:
> 
> Hi,
> 
> On the page
> 
> https://cwiki.apache.org/confluence/display/JOSHUA/Notes+on+Language+Pack+Creation
> 
> the link "Here are a number of things that may be useful for you to know in 
> using them. 
> "
>  points to a corpus, not to the page expected.
> 
> Best,
> 
> Fabrizio Gotti
> Université de Montréal

joshua release

2016-12-20 Thread Matt Post

Lewis — any chance you can pick this back up? I think we've covered all of the 
issues?

Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are you 
just throwing out an idea or are you interested in doing this? I think the way 
to go would be to set this up on a branch (off 7), and then I could test it on 
some languages.


> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili  
> wrote:
> 
> Hi all,
> 
> I was talking to Joern (Apache OpenNLP committer) recently and it came up
> the idea that we could use OpenNLP for the data preprocessing phase in
> Joshua as to allow tokenization, sentence detection, etc.
> As I was reading through our doc [1] this is currently done with dedicated
> scripts; we could make that part pluggable (with a default simple Java
> implementation) and allow more fine grained control over it using libraries
> like OpenNLP:
> 
> What would people think?
> 
> Regards,
> Tommaso
> 
> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas

Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

> On Dec 21, 2016, at 10:36 AM, Joern Kottmann  wrote:
> 
> I am happy to support a bit with this, we can also see if things in OpenNLP
> need to be changed to make this work smoothly.

Great!

> One challenge is to train OpenNLP on all the languages you support. Do you
> have training data that could be used to train the tokenizer and sentence
> detector?

For the sentence-splitter, I imagine you could make use of the source side of 
our parallel corpus, which has thousands to millions of sentences, one per line.

For tokenization (and normalization), we don't typically train models but 
instead use a set of manually developed heuristics, which may or may not be 
sentence-specific. See

https://github.com/apache/incubator-joshua/blob/master/scripts/preparation/tokenize.pl

How much training data do you generally need for each task?

> 
> Jörn
>

Re: Pluggable preprocessing and OpenNLP

2016-12-21 Thread Matt Post

7 → master is indeed the plan, as soon as we ship 6.1.

matt


> On Dec 21, 2016, at 1:25 PM, Tommaso Teofili  
> wrote:
> 
> Il giorno mer 21 dic 2016 alle ore 16:00 Matt Post  ha
> scritto:
> 
>> Sure, that'd be nice to do. I'd love to get rid of the Perl scripts. Are
>> you just throwing out an idea or are you interested in doing this?
> 
> 
> I'd be happy to do it. If Joern can help out that'd be of course very
> appreciated.
> 
> 
>> I think the way to go would be to set this up on a branch (off 7), and
>> then I could test it on some languages.
>> 
> 
> sure, and hopefully branch 7 becomes our new master soon after the 6.1
> release.
> 
> Regards,
> Tommaso
> 
> 
>> 
>> 
>>> On Dec 21, 2016, at 5:33 AM, Tommaso Teofili 
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I was talking to Joern (Apache OpenNLP committer) recently and it came up
>>> the idea that we could use OpenNLP for the data preprocessing phase in
>>> Joshua as to allow tokenization, sentence detection, etc.
>>> As I was reading through our doc [1] this is currently done with
>> dedicated
>>> scripts; we could make that part pluggable (with a default simple Java
>>> implementation) and allow more fine grained control over it using
>> libraries
>>> like OpenNLP:
>>> 
>>> What would people think?
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Project+Ideas
>> 
>>

Re: Any symal experts?

2017-01-03 Thread Matt Post

John — Any updates on here?


> On Nov 23, 2016, at 12:28 PM, Matt Post  wrote:
> 
> I think it will be much less of a headache. The GIZA++ code is notorious for 
> being unreadable, and the Perl piece of that pipeline only hurts (even though 
> Philipp's Perl is unusually clear). I think adding atools to your port is the 
> way to go, and that it's written in C++ should facilitate that.
> 
> 
> 
> 
>> On Nov 23, 2016, at 12:25 PM, John Hewitt  wrote:
>> 
>> It'll be a headache because it also has no documentation, but to be fair it
>> may be less of a headache / a better long-term solution than trying to move
>> forward with this hackier solution.
>> 
>> I'll keep the symal use on the backburner and start putting together an
>> atools port.
>> 
>> -John
>> 
>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:
>> 
>>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>>> indeed replaced them with "atools"; how much work would it be to port that?
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt 
>>> wrote:
>>>> 
>>>> Hey everyone,
>>>> 
>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>> into
>>>> the pipeline.
>>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>>> that I haven't ported to Java.
>>>> We package symal (which symmetricizes alignments) with Joshua right now
>>> for
>>>> GIZA++, so I'm attempting to re-use that.
>>>> However, symal uses the .bal format, which it fails to describe.
>>>> It gets away with this because files from GIZA++ are piped through
>>>> giza2bal.pl, which itself is not well documented.
>>>> I'm attempting to write, say, fastalign2bal.py.
>>>> With a bit of tinkering, I got at the .bal format:
>>>> 
>>>> 1
>>>> 
>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>> 
>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>> 
>>>> A template for which would be
>>>> 
>>>> 1
>>>> 
>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> 
>>>> 
>>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>>> some fastalign2bal.py output.
>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>> formats are identical) but if anyone has experience with symal, I would
>>>> greatly appreciate some consultation.
>>>> 
>>>> -John
>>> 
>>> 
>

Re: Any symal experts?

2017-01-10 Thread Matt Post

No worries, just curious. Thanks for the update.


> On Jan 9, 2017, at 10:35 PM, John Hewitt  wrote:
> 
> I have to admit, no. Projects in graduate courses got the best of my time
> at the end of last semester, and I took the winter break to stay away from
> work and recover a bit.
> 
> Back from break now; will give an update soon.
> 
> -John
> 
> On Tue, Jan 3, 2017 at 12:03 PM, Matt Post  wrote:
> 
>> John — Any updates on here?
>> 
>> 
>>> On Nov 23, 2016, at 12:28 PM, Matt Post  wrote:
>>> 
>>> I think it will be much less of a headache. The GIZA++ code is notorious
>> for being unreadable, and the Perl piece of that pipeline only hurts (even
>> though Philipp's Perl is unusually clear). I think adding atools to your
>> port is the way to go, and that it's written in C++ should facilitate that.
>>> 
>>> 
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:25 PM, John Hewitt 
>> wrote:
>>>> 
>>>> It'll be a headache because it also has no documentation, but to be
>> fair it
>>>> may be less of a headache / a better long-term solution than trying to
>> move
>>>> forward with this hackier solution.
>>>> 
>>>> I'll keep the symal use on the backburner and start putting together an
>>>> atools port.
>>>> 
>>>> -John
>>>> 
>>>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:
>>>> 
>>>>> John — I suggest trying to ditch those GIZA++ tools entirely.
>> fast_align
>>>>> indeed replaced them with "atools"; how much work would it be to port
>> that?
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt 
>>>>> wrote:
>>>>>> 
>>>>>> Hey everyone,
>>>>>> 
>>>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>>>> into
>>>>>> the pipeline.
>>>>>> Fast Align does not produce symmetrical alignments -- it relies on a
>> tool
>>>>>> that I haven't ported to Java.
>>>>>> We package symal (which symmetricizes alignments) with Joshua right
>> now
>>>>> for
>>>>>> GIZA++, so I'm attempting to re-use that.
>>>>>> However, symal uses the .bal format, which it fails to describe.
>>>>>> It gets away with this because files from GIZA++ are piped through
>>>>>> giza2bal.pl, which itself is not well documented.
>>>>>> I'm attempting to write, say, fastalign2bal.py.
>>>>>> With a bit of tinkering, I got at the .bal format:
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>>>> 
>>>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>>>> 
>>>>>> A template for which would be
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>>>> alignment2 ... alignmentN]
>>>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>>>> alignment2 ... alignmentN]
>>>>>> 
>>>>>> 
>>>>>> However, I'm hitting some pretty nasty errors with symal when I pipe
>> in
>>>>>> some fastalign2bal.py output.
>>>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>>>> formats are identical) but if anyone has experience with symal, I
>> would
>>>>>> greatly appreciate some consultation.
>>>>>> 
>>>>>> -John
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Pluggable preprocessing and OpenNLP

2017-01-13 Thread Matt Post

Hi Jörn,

[Sent again without the picture since Apache rejects those, unfortunately...]

You just need monolingual text, so I suggest downloading either the tokenized 
or untokenized versions. Unfortunately, Opus doesn't make it easy to provide 
directly links to individual languages. But do this:

1. Go to http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/>

2. Choose de → en (or some other language pair)

3. In the "mono" or "raw" columns (depending on whether you want tokenized or 
untokenized text), click the language file for the dataset you want.

matt


> On Jan 12, 2017, at 6:07 AM, Joern Kottmann  <mailto:kottm...@gmail.com>> wrote:
> 
> Do you have a pointer to an actual file? Or download package?
> 
> Jörn
> 
> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili  <mailto:tommaso.teof...@gmail.com>
>> wrote:
> 
>> I think the parallel corpuses are taken from [1], so we could start with
>> training sentdetect for language packs at [2].
>> 
>> Regards,
>> Tommaso
>> 
>> [1] : http://opus.lingfil.uu.se/ <http://opus.lingfil.uu.se/>
>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
>> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
>> 
>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann > <mailto:kottm...@gmail.com>
>>> 
>> ha scritto:
>> 
>>> Sorry, for late reply, can you point me to a link for the parallel
>> corpus?
>>> We might just want to add formats support for it to OpenNLP.
>>> 
>>> Do you use tokenize.pl for all languages or do you have language
>> specific
>>> heuristics?
>>> It would be great to have an additional more capable rule based tokenizer
>>> in OpenNLP.
>>> 
>>> The sentence splitter can be trained on a few thousand sentences or so, I
>>> think that will work out nicely.
>>> 
>>> Jörn
>>> 
>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post >> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> 
>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann >>>> <mailto:kottm...@gmail.com>>
>>> wrote:
>>>>> 
>>>>> I am happy to support a bit with this, we can also see if things in
>>>> OpenNLP
>>>>> need to be changed to make this work smoothly.
>>>> 
>>>> Great!
>>>> 
>>>> 
>>>>> One challenge is to train OpenNLP on all the languages you support.
>> Do
>>>> you
>>>>> have training data that could be used to train the tokenizer and
>>> sentence
>>>>> detector?
>>>> 
>>>> For the sentence-splitter, I imagine you could make use of the source
>>> side
>>>> of our parallel corpus, which has thousands to millions of sentences,
>> one
>>>> per line.
>>>> 
>>>> For tokenization (and normalization), we don't typically train models
>> but
>>>> instead use a set of manually developed heuristics, which may or may
>> not
>>> be
>>>> sentence-specific. See
>>>> 
>>>>https://github.com/apache/incubator-joshua/blob/master/ 
>>>> <https://github.com/apache/incubator-joshua/blob/master/>
>>>> scripts/preparation/tokenize.pl
>>>> 
>>>> How much training data do you generally need for each task?
>>>> 
>>>> 
>>>>> 
>>>>> Jörn
>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: Rebase on Relese

2017-01-13 Thread Matt Post

Hi Lewis,

Welcome back!

I think we have checked off all the things on your list, and are ready any time 
for the release. Do you have the time to double-check, and then to head up this 
effort?

matt

> On Jan 13, 2017, at 11:59 AM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> Where are we with the release? I need to apologize for disappearing. Phone
> off and Laptop off for close to 3 weeks.
> Can someone bring me up-to-date with where we are?
> Thanks
> Lewis
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Plugging self-hosted Joshua into mailman?

2017-01-17 Thread Matt Post

Hello,

Joshua would be suitable to this. We have models built for FR→EN and ES→EN. I 
want to improve these because some certain data was left out. I could also 
build ones for the other direction.

One question — What do you mean about 3rd party services being "untrustworthy"?

matt


> On Jan 16, 2017, at 12:27 PM, Karel Novotný  wrote:
> 
> Hello developers,
> 
> I am new to this list, so missing a lot of background. Apologies
> beforehand for eventually dumb questions...
> 
> We would like to build a self-hosted machine translation system that
> could be plugged into our mailman installs. The objective is that the
> members of our multicultural network would be able to send email in
> their mother language and it would be delivered to the list
> machine-translated (and vise versa). The translation pairs we care about
> most are EN<->FR and EN<->ES
> 
> Our dream scenario is:
> 
> 1. A translator machine is installed on our server, so the messages
> don't need to be run through untrustworthy 3rd party services (googletrans)
> 2. Mailman (or similar) is connected to such a translator
> 3. Mailing list users can opt to receive messages sent to the mailing
> list in following format:
> 
> 
> Message body
> --
> Message body translated
> -
> 
> 4. Similarly, the system can be configured so that when receiving
> messages from specific senders the messages get translated from FR or ES
> into EN
> 
> Our default language used on lists is EN
> 
> Is Joshua relevant for this? Any previous experience with similar setup?
> I suppose that a lot of configuration would be needed, but at this point
> I want to know if I am not completely mistaken when considering your
> Joshua for this.
> 
> Thanks
> 
> karel
> 
> ---
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get&search=0x7FDEF502377E4FCA
> 
>

Re: Pluggable preprocessing and OpenNLP

2017-01-18 Thread Matt Post

Hi,

Sorry, what file format are you talking about? Can you point me to an example 
of the Moses file format? Is this just plain text, one sentence per line?

In general the Moses format is the standard, to the extent that there are any 
standards in MT (they are all mostly informal).

matt

PS. Are you on dev@joshua, or do I need to keep CC'ing you at your address?


> On Jan 16, 2017, at 5:42 PM, Joern Kottmann  wrote:
> 
> Hello,
> 
> we came to the conclusion that it would make sense to add direct
> formats support for letsmt and moses files.
> 
> Here our two issues:
> https://issues.apache.org/jira/browse/OPENNLP-938
> https://issues.apache.org/jira/browse/OPENNLP-939
> 
> Does it make sense for you if we support those formats?
> Did we miss an important format?
> 
> The training works quite fine, but it will take me a bit more time to
> get the evaluation to return something useful. The OpenNLP Sentence
> Detector can only split on end-of-sentence (eos) chars. And if there is
> a sentence without an eos chars it gets treated as a mistake by the
> evaluation.
> 
> Do you have a specific language which would be good for testing for
> you?
> 
> The tokenizer can probably trained as well, I saw a couple of tokenized
> data sets. Maybe that makes sense for you too.
> 
> Jörn
> 
> 
> 
> On Fri, 2017-01-13 at 09:48 -0500, Matt Post wrote:
>> Hi Jörn,
>> 
>> [Sent again without the picture since Apache rejects those,
>> unfortunately...]
>> 
>> You just need monolingual text, so I suggest downloading either the
>> tokenized or untokenized versions. Unfortunately, Opus doesn't make
>> it easy to provide directly links to individual languages. But do
>> this:
>> 
>> 1. Go to http://opus.lingfil.uu.se
>> 
>> 2. Choose de → en (or some other language pair)
>> 
>> 3. In the "mono" or "raw" columns (depending on whether you want
>> tokenized or untokenized text), click the language file for the
>> dataset you want.
>> 
>> matt
>> 
>> 
>>> On Jan 12, 2017, at 6:07 AM, Joern Kottmann 
>>> wrote:
>>> 
>>> Do you have a pointer to an actual file? Or download package?
>>> 
>>> Jörn
>>> 
>>> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili >> gmail.com
>>>> wrote:
>>>> I think the parallel corpuses are taken from [1], so we could
>>>> start with
>>>> training sentdetect for language packs at [2].
>>>> 
>>>> Regards,
>>>> Tommaso
>>>> 
>>>> [1] : http://opus.lingfil.uu.se/
>>>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language
>>>> +Packs
>>>> 
>>>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann >>> gmail.com
>>>> ha scritto:
>>>> 
>>>>> Sorry, for late reply, can you point me to a link for the
>>>>> parallel
>>>> corpus?
>>>>> We might just want to add formats support for it to OpenNLP.
>>>>> 
>>>>> Do you use tokenize.pl for all languages or do you have
>>>>> language
>>>> specific
>>>>> heuristics?
>>>>> It would be great to have an additional more capable rule based
>>>>> tokenizer
>>>>> in OpenNLP.
>>>>> 
>>>>> The sentence splitter can be trained on a few thousand
>>>>> sentences or so, I
>>>>> think that will work out nicely.
>>>>> 
>>>>> Jörn
>>>>> 
>>>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post 
>>>>> wrote:
>>>>> 
>>>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann >>>>>> l.com>
>>>>> wrote:
>>>>>>> I am happy to support a bit with this, we can also see if
>>>>>>> things in
>>>>>> OpenNLP
>>>>>>> need to be changed to make this work smoothly.
>>>>>> 
>>>>>> Great!
>>>>>> 
>>>>>> 
>>>>>>> One challenge is to train OpenNLP on all the languages you
>>>>>>> support.
>>>> Do
>>>>>> you
>>>>>>> have training data that could be used to train the
>>>>>>> tokenizer and
>>>>> sentence
>>>>>>> detector?
>>>>>> 
>>>>>> For the sentence-splitter, I imagine you could make use of
>>>>>> the source
>>>>> side
>>>>>> of our parallel corpus, which has thousands to millions of
>>>>>> sentences,
>>>> one
>>>>>> per line.
>>>>>> 
>>>>>> For tokenization (and normalization), we don't typically
>>>>>> train models
>>>> but
>>>>>> instead use a set of manually developed heuristics, which may
>>>>>> or may
>>>> not
>>>>> be
>>>>>> sentence-specific. See
>>>>>> 
>>>>>>https://github.com/apache/incubator-joshua/blob/master
>>>>>> /
>>>>>> scripts/preparation/tokenize.pl
>>>>>> 
>>>>>> How much training data do you generally need for each task?
>>>>>> 
>>>>>> 
>>>>>>> Jörn
>>>>>>> 
>> 
>>

Re: mvn assembly issues

2017-01-19 Thread Matt Post

I have never seen this error before! It seems like this must have something to 
do with the build environment where this is being done? Maybe there are tar 
options to not store the userid or to set it to something?


> On Jan 18, 2017, at 9:08 PM, David Meikle  wrote:
> 
> Hey Lewis,
> 
>> On 18 Jan 2017, at 22:02, lewis john mcgibbney  wrote:
>> 
>> Hi Folks,
>> Anyone know how to work through this issue? The code in question can be
>> found at
>> https://github.com/apache/incubator-joshua/blob/master/pom.xml#L287-L309
>> Lewis
>> 
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 16.222 s
>> [INFO] Finished at: 2017-01-18T13:59:41-08:00
>> [INFO] Final Memory: 37M/639M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>> (source-release-assembly) on project joshua-incubating: Execution
>> source-release-assembly of goal
>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user id
>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
>> 
>> -- 
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
> 
> 
> Normally the switching tar to posix mode does the trick when I have had this 
> before - normally when logged into a AD domain on my Mac.  What is the full 
> log with -X saying?
> 
> Cheers,
> Dave
>

Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post

Karel — On this point, I don't think you should have to use the tutorials, 
which tell you how to identify training data and build new translation models 
yourself. I imagine that you would be more interested in downloading pre-built 
models that don't really require you to be an expert in MT. See this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

matt


> On Jan 17, 2017, at 12:07 PM, lewis john mcgibbney  wrote:
> 
> Hi Karel,
> The short answer is yes.
> I would advise you to start at the Tutorial
> https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started
> If you find anything which causes you problems then please write back here.
> Once you have skipped through the tutorial then you will have a much better
> feel for the workflow required.
> I can see the Apache Tika language identification and translate API's being
> of particular use here when considered in a runtime context. We have a
> Joshua implementation over in Tika which can aid you in this task however
> try the Joshua tutorial first.
> Lewis
> 
> On Mon, Jan 16, 2017 at 7:41 AM, Chris Mattmann  wrote:
> 
>> Hi Karel,
>> 
>> I would recommend moving this thread to dev@joshua.incubator.apache.org
>> instead of the private list. I’ve moved private to BCC.
>> 
>> Thank you.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 1/16/17, 6:58 AM, wrote:
>> 
>>Hello,
>> 
>>We would like to build a self-hosted machine translation system that
>>could be plugged into our mailman installs. The objective is that the
>>members of our multicultural network would be able to send email in
>>their mother language and it would be delivered to the list
>>machine-translated (and vise versa).
>> 
>>Are we on the right track with Joshua? I suppose that a lot of
>>configuration would be needed, but at this point I want to know if I am
>>not completely mistaken when considering your sw for this.
>> 
>>Thanks
>> 
>>karel
>> 
>> 
>>--
>>~~~
>>Karel Novotny
>>Knowledge Sharing & Network Development Coordinator
>>APC - The Association for Progressive Communications
>>https://www.apc.org
>>GSM: +420 605 243 246 (GMT +1)
>>jabber: ka...@riseup.net
>>Working/online: Monday - Thursday
>>~~~
>>My public OpenPGP key: https://pgp.mit.edu/pks/lookup?op=get&search=
>> 0x7FDEF502377E4FCA
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Plugging self-hosted Joshua into mailman?

2017-01-19 Thread Matt Post


> On Jan 17, 2017, at 11:55 AM, Karel Novotný  wrote:
> 
> Hello Matt,
> 
> Thanks for responding...
> 
> On 17.1.2017 17:31, Matt Post wrote:
>> Hello,
>> 
>> Joshua would be suitable to this. We have models built for FR→EN and ES→EN. 
>> I want to improve these because some certain data was left out. I could also 
>> build ones for the other direction.
> That's excellent news. Can you please tell me a bit more about what you
> mean by having models for FR→EN and ES→EN ? Does this mean that the tool
> is ready to be used by other applications (e.g. mailman) to auto-translate?
> 
> Have you had any previous experience with similar implementation as I
> described?

This just means we have pre-built models (which we call "language packs") that 
you can just download and immediately use to translate from French to English 
and from Spanish to English. For the complete list of language packs, along 
with instructions for how to use it, see this page:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

You can just download any of these, unpack them, and start translating. The 
quality will vary, but for these two languages should be reasonable.

To translate, the data you send to Joshua has to have already been 
sentence-split, because Joshua expects to receive input one sentence at a time. 
Joshua provides an API that you can make use of. Do you have any kind of 
expectations about your volume requirements? How many sentences will you be 
translating per day?

matt


>> 
>> One question — What do you mean about 3rd party services being 
>> "untrustworthy"?
> 
> We wish to auto-translate lists with private conversations, so we can
> not run those by systems where we don't know (don't have control of)
> what happens with the data. That's all, I didn't want to accuse anyone.

Oh, that makes perfect sense. For some reason I assumed you were translating 
public mailing lists, but if you're doing private ones, it is reasonable to 
want to keep the data entirely in-house.


> thanks
> 
> karel
> 
>> 
>> matt
>> 
>> 
>>> On Jan 16, 2017, at 12:27 PM, Karel Novotný  wrote:
>>> 
>>> Hello developers,
>>> 
>>> I am new to this list, so missing a lot of background. Apologies
>>> beforehand for eventually dumb questions...
>>> 
>>> We would like to build a self-hosted machine translation system that
>>> could be plugged into our mailman installs. The objective is that the
>>> members of our multicultural network would be able to send email in
>>> their mother language and it would be delivered to the list
>>> machine-translated (and vise versa). The translation pairs we care about
>>> most are EN<->FR and EN<->ES
>>> 
>>> Our dream scenario is:
>>> 
>>> 1. A translator machine is installed on our server, so the messages
>>> don't need to be run through untrustworthy 3rd party services (googletrans)
>>> 2. Mailman (or similar) is connected to such a translator
>>> 3. Mailing list users can opt to receive messages sent to the mailing
>>> list in following format:
>>> 
>>> 
>>> Message body
>>> --
>>> Message body translated
>>> -
>>> 
>>> 4. Similarly, the system can be configured so that when receiving
>>> messages from specific senders the messages get translated from FR or ES
>>> into EN
>>> 
>>> Our default language used on lists is EN
>>> 
>>> Is Joshua relevant for this? Any previous experience with similar setup?
>>> I suppose that a lot of configuration would be needed, but at this point
>>> I want to know if I am not completely mistaken when considering your
>>> Joshua for this.
>>> 
>>> Thanks
>>> 
>>> karel
>>> 
>>> ---
>>> 
>>> -- 
>>> ~~~
>>> Karel Novotny 
>>> Knowledge Sharing & Network Development Coordinator
>>> APC - The Association for Progressive Communications 
>>> https://www.apc.org
>>> GSM: +420 605 243 246 (GMT +1)
>>> jabber: ka...@riseup.net
>>> Working/online: Monday - Thursday
>>> ~~~
>>> My public OpenPGP key: 
>>> https://pgp.mit.edu/pks/lookup?op=get&search=0x7FDEF502377E4FCA
>>> 
>>> 
>> 
> 
> -- 
> ~~~
> Karel Novotny 
> Knowledge Sharing & Network Development Coordinator
> APC - The Association for Progressive Communications 
> https://www.apc.org <https://www.apc.org/>
> GSM: +420 605 243 246 (GMT +1)
> jabber: ka...@riseup.net
> Working/online: Monday - Thursday
> ~~~
> My public OpenPGP key: 
> https://pgp.mit.edu/pks/lookup?op=get&search=0x7FDEF502377E4FCA 
> <https://pgp.mit.edu/pks/lookup?op=get&search=0x7FDEF502377E4FCA>

Re: Podling Report Reminder - February 2017

2017-01-30 Thread Matt Post

Folks — I'll take care of this next week, after February 6.

matt

> On Jan 30, 2017, at 10:18 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC

Re: Podling Report Reminder - February 2017

2017-02-01 Thread Matt Post

Folks,

I added the Joshua report.

https://wiki.apache.org/incubator/February2017 


It is due today. Feel free to make comments or initiate discussion here but 
otherwise what's there is what will be sent.

matt


> On Jan 25, 2017, at 7:21 PM, johndam...@apache.org wrote:
> 
> Dear podling,
> 
> This email was sent by an automated system on behalf of the Apache
> Incubator PMC. It is an initial reminder to give you plenty of time to
> prepare your quarterly board report.
> 
> The board meeting is scheduled for Wed, 22 February 2017, 10:30 am PDT.
> The report for your podling will form a part of the Incubator PMC
> report. The Incubator PMC requires your report to be submitted 2 weeks
> before the board meeting, to allow sufficient time for review and
> submission (Wed, February 08).
> 
> Please submit your report with sufficient time to allow the Incubator
> PMC, and subsequently board members to review and digest. Again, the
> very latest you should submit your report is 2 weeks prior to the board
> meeting.
> 
> Thanks,
> 
> The Apache Incubator PMC
> 
> Submitting your Report
> 
> --
> 
> Your report should contain the following:
> 
> *   Your project name
> *   A brief description of your project, which assumes no knowledge of
>the project or necessarily of its field
> *   A list of the three most important issues to address in the move
>towards graduation.
> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>aware of
> *   How has the community developed since the last report
> *   How has the project developed since the last report.
> 
> This should be appended to the Incubator Wiki page at:
> 
> https://wiki.apache.org/incubator/February2017
> 
> Note: This is manually populated. You may need to wait a little before
> this page is created from a template.
> 
> Mentors
> ---
> 
> Mentors should review reports for their project(s) and sign them off on
> the Incubator wiki page. Signing off reports shows that you are
> following the project - projects that are not signed may raise alarms
> for the Incubator PMC.
> 
> Incubator PMC

problems with BerkeleyLM

2017-02-01 Thread Matt Post

Hi folks,

I've found some problems with BerkeleyLM. I haven't diagnosed it yet, and am 
not going to have time for a week or two at least, but thought I'd bring it to 
everyone's attention because this affects our no-external-dependency releases.

As for the solution, in addition to trying to track down this problem, I've 
been working on a docker solution for helping people easily add KenLM to the 
language packs.

The problem can be seen in the following. I trained a English--German model, 
using the state minimizing KenLM (KenLM/Full). You can see the BLEU scores on a 
number of test sets below. If I then swap out the StateMinimizingLanguageModel 
for a regular LanguageModel but using KenLM to represent (KenLM/LM), I get a 
drop as expected. If I then swap out KenLM for BerkeleyLM, I get a further huge 
drop.

I wouldn't expect this large of a drop in either situation, but the BerkeleyLM 
one is especially troubling.

Anyway, troubleshooting is forthcoming, but I am sharing this in case anyone is 
using BerkeleyLM somewhere.

matt

---
news-test2008
KenLM/Full:   => BLEU = 0.1464
KenLM/LM: => BLEU = 0.1168
BerkeleyLM:   => BLEU = 0.0800

newstest2008-14.de-en
KenLM/Full:   => BLEU = 0.1524
KenLM/LM: => BLEU = 0.1235
BerkeleyLM:   => BLEU = 0.0839

newstest2009
KenLM/Full:   => BLEU = 0.1372
KenLM/LM: => BLEU = 0.1113
BerkeleyLM:   => BLEU = 0.0793

newstest2010
KenLM/Full:   => BLEU = 0.1487
KenLM/LM: => BLEU = 0.1213
BerkeleyLM:   => BLEU = 0.0847

newstest2011
KenLM/Full:   => BLEU = 0.1473
KenLM/LM: => BLEU = 0.1192
BerkeleyLM:   => BLEU = 0.0826

newstest2012
KenLM/Full:   => BLEU = 0.1488
KenLM/LM: => BLEU = 0.1205
BerkeleyLM:   => BLEU = 0.0797

newstest2013
KenLM/Full:   => BLEU = 0.1692
KenLM/LM: => BLEU = 0.1391
BerkeleyLM:   => BLEU = 0.0923

newstest2014.de-en
KenLM/Full:   => BLEU = 0.1669
KenLM/LM: => BLEU = 0.1351
BerkeleyLM:   => BLEU = 0.0881

newstest2016.de-en
KenLM/Full:   => BLEU = 0.2177
KenLM/LM: => BLEU = 0.1724
BerkeleyLM:   => BLEU = 0.1117

Re: Cutting RC3

2017-02-23 Thread Matt Post

Thank you for heading this up, Tommaso! I'll be able to catch up on this after 
today.

matt


> On Feb 23, 2017, at 3:06 AM, Tommaso Teofili  
> wrote:
> 
> probably because of the mentioned network issues the artifacts ended up in
> two separate staging repositories in Nexus, which is undesired.
> I'll drop those repos, rollback the changes on the pom, delete the current
> tag in git and perform again mvn release:prepare / perform today.
> 
> Regards,
> Tommaso
> 
> Il giorno mer 22 feb 2017 alle ore 16:39 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Hi all,
>> 
>> Maven is in the extremely slow (because of my bandwidth) process of
>> deploying stuff on Nexus as part of the mvn release:perform phase.
>> In the meantime perhaps is a good idea not to commit to the master branch,
>> until we get the RC3 voted and hence approved / rejected.
>> 
>> Thanks and regards,
>> Tommaso
>>

Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-02-26 Thread Matt Post

Hi folks,

First, Tommaso, thank you for pulling this together!

I want to remind everyone that there's a checklist to go through before sending 
your +1. Here's from an email from Tom Barber a while back:

> Hello folks,
> 
> I see plenty of +1's going through the release vote,  which is great to see
> people taking an active role in getting the release shipped.
> 
> For those of you who are new to the ASF there are a bunch of requirements
> to sign off for a release which you can find here:
> 
> http://incubator.apache.org/guides/releasemanagement.html#check-list 
> 
> 
> My current concern is that people who are new to the incubator are +1'ing
> software for release without check all or part of the release cycle. Whilst
> not mandatory, when you +1 a release please can you try to indicate what
> you've checked. The reason for this is,  the tag Lewis has built off isn't
> the tip of master, so if you're basing  your +1 on your day to day
> development and knowledge of the code base, that's not always whats
> shipped. Also in the branching process,  its possible merges or alterations
> were accidentally made that Lewis has missed (this is very unlikely I know
> but you know, code changes). Also people build software on different OS's,
> versions of OS's etc so just because it builds on  Lewis's laptop doesn't
> mean it builds on mine, for example.
> 
> Also regarding licenses, disclaimers etc, people notice different things or
> interpret stuff differently. its always possible that someone might miss a
> library etc so its important multiple eyes run over the same stuff.
> 
> Cheers,
> 
> Tom

I'm hoping I'll have time to go through this tomorrow.

matt



> On Feb 25, 2017, at 2:41 AM, Tommaso Teofili  
> wrote:
> 
> Hi Folks,
> Please VOTE on the Apache Joshua 6.1 Release Candidate #3.
> 
> We solved 36 issues:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319720&version=12335049
> 
> Git source tag (3447715b3aa0a48ed79465d80618bd5a2f7a7558):
> https://s.apache.org/XIxJ
> 
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachejoshua-1004
> 
> Source Release Artifacts:
> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
> 
> PGP release keys (signed using 891768A5):
> *https://git1-us-west.apache.org/repos/asf?p=incubator-joshua.git;a=blob_plain;f=KEYS;h=aa18365bf5c8c8fb17b084f783a75c3a2460a98d;hb=HEAD
> *
> 
> Vote will be open for 72 hours.
> Thank you to everyone that is able to VOTE as well as everyone that
> contributed to Apache Joshua 6.1.
> 
> [ ] +1, let's get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
> 
> Regards,
> Tommaso

Re: Question before preparing RC3

2017-02-28 Thread Matt Post

Thanks, Tommaso. I'm not actually sure what the right process is, here.

matt


> On Feb 22, 2017, at 8:04 AM, Tommaso Teofili  
> wrote:
> 
> I've realized it should be fine, therefore I'm proceeding with cutting the
> new RC.
> 
> Regards,
> Tommaso
> 
> Il giorno lun 20 feb 2017 alle ore 16:21 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Hi all,
>> 
>> over the weekend I've picked up JOSHUA-324 in order to try to push for our
>> first Apache release.
>> Before proceeding I have the following concern: when performing the dryRun
>> of the Maven release:prepare phase, as taken from the Wiki [1], I notice
>> that the produced pom.xml's version is 6.1-SNAPSHOT, is that expected for a
>> dryRun or is there any issue in the command ?
>> I would have expected the src.zip package in the target directory to
>> contain a pom.xml whose version is 6.1.
>> 
>> Any clarification would be appreciated.
>> Regards,
>> Tommaso
>> 
>> [1] :
>> https://cwiki.apache.org/confluence/display/JOSHUA/Joshua+Release+Management+Procedure#JoshuaReleaseManagementProcedure-Preparingareleasecandidate%28RC%29forcommunityVOTE%27ing
>>

Re: Chinese Language Pack

2017-03-02 Thread Matt Post

Yes, this on the docket, but there is currently no timeline. Perhaps sometime 
this month?

matt


> On Mar 1, 2017, at 8:18 PM, emu...@cock.li wrote:
> 
> Hello
> 
> Are there any plans to release a Chinese to English language pack in the near 
> future and approximate timeline?
> 
> 
> Thanks!
> Emu

Re: Dockerhub hosted images

2017-03-03 Thread Matt Post

Folks,

I've updated the code with a few changes that will support Dockerized language 
packs. The nice thing is that this makes it easy to include KenLM.

Here are some changes that were made:

- Joshua now notes what directory the config file was found in and loads 
relative paths found in the config file relative to that directory 
automatically. This means you don't have to "cd" to the LP (language pack) 
directory before running Joshua.

- I fixed the HTTP server to take multiple "q=" lines, just like the Google 
translate API. Before, they only took one "q=" line. This should mean (I'll 
test later today) that the HTTP server can handle throughput essentially at the 
rates of the TCP server.

- I added (but haven't pushed yet) the KenLM model files to the language packs. 
In addition, I added a file "joshua.config.kenlm". These are not used except by 
Docker.

- I fixed the docker setup. See the new file:


https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile
 
<https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile>

This docker container builds KenLM. It then expects to be run with docker 
mounting an existing language pack to /model. It then runs the 
joshua.config.kenlm file, running it as a server in HTTP mode. See the README 
file for information:


https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm
 
<https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm>

If anyone wants to test this out, please do. You can grab an updated language 
pack (version 3) here:


http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz 
<http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz>

(Warning: 9 GB)

matt


> On Nov 23, 2016, at 10:14 AM, kellen sunderland  
> wrote:
> 
> Yeah it should just be docker 'pull kellens/apache-joshua-es-en-2016-10-05'
> then 'docker run -it kellens/apache-joshua-es-en-2016-10-05 /bin/bash' or
> something similar.  I think the default command should eventually be to run
> the http server, so ideally we'd just do 'docker run -p 5674
> kellens/apache-joshua-es-en-2016-10-05' and that would start up the http
> server on port 5674.
> 
> Good point on Perl + Python, I can add them.
> 
> -Kellen
> 
> On Wed, Nov 23, 2016 at 3:22 PM, Matt Post  wrote:
> 
>> Okay, I have this with
>> 
>>docker run -it kellens/apache-joshua-es-en-2016-10-05 bash
>> 
>> It seems we are missing Perl (./prepare.sh fails), and we should replace
>> the LanguageModel line with a KenLM instance and build that. I bet we'll
>> need Python, too.
>> 
>> 
>> 
>> 
>>> On Nov 23, 2016, at 8:15 AM, Matt Post  wrote:
>>> 
>>> Kellen, can I bother you to post a few first steps? I've successfully
>> pulled this down to my mac but now do not know how to find it, edit it, or
>> run it. I'm porting through the documentation and will find it eventually
>> but this would save me a bit of time.
>>> 
>>> 
>>>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>>> 
>>>> Yes my next step was going to be getting it hosted officially.
>>>> 
>>>> I'll go ahead and open a ticket.  I think I'll hold off on pushing to
>> the
>>>> Apache account until I've done a little more testing though.
>>>> 
>>>> On Nov 23, 2016 5:22 AM, "lewis john mcgibbney" 
>> wrote:
>>>> 
>>>>> Hi Kellen,
>>>>> Nice :)
>>>>> Another option is for us to host these via the Apache account.
>>>>> https://hub.docker.com/r/apache/
>>>>> We could then add a badge to our README which points to the
>> Dockerfile(s).
>>>>> Do you want to open a ticket over on the INFRA Jira for this?
>>>>> 
>>>>> On Tue, Nov 22, 2016 at 1:57 PM, <
>>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>>> 
>>>>>> From: kellen sunderland 
>>>>>> To: "dev@joshua.incubator.apache.org" > org>
>>>>>> Cc:
>>>>>> Date: Tue, 22 Nov 2016 22:56:56 +0100
>>>>>> Subject: Re: Dockerhub hosted images
>>>>>> Ok, the first image should be properly uploaded now.
>>>>>> 
>>>>>> https://hub.docker.com/r/kellens/apache-joshua-es-en-2016-10-05/
>>>>>> 
>>>>>> -Kellen
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> 
>>

Re: [VOTE] Release Apache Joshua 6.1 (Incubating)

2017-03-04 Thread Matt Post

Tommaso,

What's your timeline for fixing this? I just pushed in some changes that add 
docker support and provide multithreading for the HTTP server. It would be nice 
to include those, BUT if it's a lot of extra work, we can just add them later 
(or you could point me to the doc you followed, and I'll do it on Monday)

matt


> On Mar 1, 2017, at 1:09 PM, John Hewitt  wrote:
> 
> Tommaso, thanks for the RC.
> Kellen, thanks for checking for the -1.
> 
> -John
> 
> On Wed, Mar 1, 2017 at 1:03 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> 
>> For a short term fix for the unit test we can delete lines 48 and 50 from
>> LMGrammarBerkeleyTest.java.
>> 
>> A bit of a longer term solution would be that we could have a @BeforeClass
>> setup method that simply zips the uncompressed files.
>> 
>> Thanks again for putting this together Tommaso.
>> 
>> 
>> On Wed, Mar 1, 2017 at 6:43 PM, Tommaso Teofili >> 
>> wrote:
>> 
>>> thanks Kellen,
>>> 
>>> I get the very same issues.
>>> It's probably my fault having copied .md5 and .sha files from the staging
>>> repo as I didn't have them within my target directory.
>>> I also get the same test failure.
>>> 
>>> Hence -1 from me too.
>>> I'll roll it back, fix the issues and create RC4.
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> 
>>> 
>>> Il giorno mer 1 mar 2017 alle ore 17:54 kellen sunderland <
>>> kellen.sunderl...@gmail.com> ha scritto:
>>> 
>>>> I have to -1 this release for the time being.  For me the signatures
>> and
>>>> hashes don't seem to match the binaries downloaded.  Could you double
>>> check
>>>> that they match for you Tommaso?  I'm also getting a unit test that
>> fails
>>>> when I run 'mvn clean package'.  I'm digging a little more into this
>> one,
>>>> but suspect a missing file.
>>>> 
>>>> 
>>>> 
>>>> Here's what I've checked so far:
>>>> 
>>>> Release artifacts must include incubating in the final file name - YES
>>>> Release artifacts must include a disclaimer within the release
>>> artifact(s)
>>>> as noted - YES
>>>> Every ASF release MUST contain one or more source packages, which MUST
>> be
>>>> sufficient for a user to build and test the release provided they have
>>>> access to the appropriate platform and tools. - NO
>>>>-Not building due to failing test (BerkleyLM failure).  I'm
>> digging a
>>>> bit more into this.
>>>> 
>>>> Every artifact distributed to the public through Apache channels MUST
>> be
>>>> accompanied by one file containing an OpenPGP compatible ASCII armored
>>>> detached signature and another file containing an MD5 checksum.
>>>>- .asc - NO
>>>>I get warning:
>>>>"gpg --verify joshua-incubating-6.1-src.tar.gz.asc
>>>> joshua-incubating-6.1-src.tar.gz
>>>>gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
>>>> 891768A5
>>>>gpg: Good signature from "Tommaso Teofili "
>>>> [unknown]
>>>>gpg: WARNING: This key is not certified with a trusted signature!
>>>>gpg:  There is no indication that the signature belongs to
>>> the
>>>> owner."
>>>>- .md5 - NO
>>>>My md5 of joshua-incubating-6.1-src.tar.gz is
>>>> 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.
>>> gz.md5
>>>> indicates it should be 22b738eeae45757715080702a5bd2789
>>>>- .sha - NO
>>>>My sha of joshua-incubating-6.1-src.tar.gz is
>>>> 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
>>>> joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
>>>> 2a55b6d341dddc5369b22a4802a86ec40accd0a1
>>>>- KEYS - YES
>>>> 
>>>> On Mon, Feb 27, 2017 at 3:55 AM, Matt Post  wrote:
>>>> 
>>>>> Hi folks,
>>>>> 
>>>>> First, Tommaso, thank you for pulling this together!
>>>>> 
>>>>> I want to remind everyone that there's a checklist to go through
>> before
>>>>> sending your +1. Here's from an email from Tom Barber a while back:
>>>>> 
>>>

Re: Dockerhub hosted images

2017-03-07 Thread Matt Post

FYI, I stress-tested the Joshua server with the following protocol: for both 
the TCP and HTTP servers, I started a six-thread server, and then sent five 
simultaneous 16k documents at each. The translation times were as follows:

TCP: (times: 8:07 8:06 8:06)

for x in 1 2 3 4; do for num in $(seq 1 5); do cat corpus.es | nc 
localhost 5674 > t.tcp.$num & done; time wait; done)

HTTP: (times: 7:25 7:34 7:20)

for x in 1 2 3 4; do for num in $(seq 1 5); do 
/home/hltcoe/mpost/code/joshua/scripts/support/query_http.py -s localhost -p 
5674 corpus.es > t.out.$num & done; time wait; done

The HTTP query takes 100 lines of the test set at a time, constructs the 
RESTful query string (with 100 url-encoded "q=..." lines), and sends it to the 
server.

So the bottom line is that the HTTP server both has an extended 
Google-translate API (which also supports other things like adding rules) and 
is a bit faster.

I'm documenting the RESTful API here: 
https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API

matt


> On Mar 3, 2017, at 11:24 AM, Matt Post  wrote:
> 
> Folks,
> 
> I've updated the code with a few changes that will support Dockerized 
> language packs. The nice thing is that this makes it easy to include KenLM.
> 
> Here are some changes that were made:
> 
> - Joshua now notes what directory the config file was found in and loads 
> relative paths found in the config file relative to that directory 
> automatically. This means you don't have to "cd" to the LP (language pack) 
> directory before running Joshua.
> 
> - I fixed the HTTP server to take multiple "q=" lines, just like the Google 
> translate API. Before, they only took one "q=" line. This should mean (I'll 
> test later today) that the HTTP server can handle throughput essentially at 
> the rates of the TCP server.
> 
> - I added (but haven't pushed yet) the KenLM model files to the language 
> packs. In addition, I added a file "joshua.config.kenlm". These are not used 
> except by Docker.
> 
> - I fixed the docker setup. See the new file:
> 
>   
> https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile
>  
> <https://github.com/apache/incubator-joshua/blob/master/distribution/docker/kenlm/Dockerfile>
> 
> This docker container builds KenLM. It then expects to be run with docker 
> mounting an existing language pack to /model. It then runs the 
> joshua.config.kenlm file, running it as a server in HTTP mode. See the README 
> file for information:
> 
>   
> https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm
>  
> <https://github.com/apache/incubator-joshua/tree/master/distribution/docker/kenlm>
> 
> If anyone wants to test this out, please do. You can grab an updated language 
> pack (version 3) here:
> 
>   
> http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz 
> <http://cs.jhu.edu/~post/language-packs/apache-joshua-es-en-2017-03-03.tgz>
> 
> (Warning: 9 GB)
> 
> matt
> 
> 
>> On Nov 23, 2016, at 10:14 AM, kellen sunderland 
>>  wrote:
>> 
>> Yeah it should just be docker 'pull kellens/apache-joshua-es-en-2016-10-05'
>> then 'docker run -it kellens/apache-joshua-es-en-2016-10-05 /bin/bash' or
>> something similar.  I think the default command should eventually be to run
>> the http server, so ideally we'd just do 'docker run -p 5674
>> kellens/apache-joshua-es-en-2016-10-05' and that would start up the http
>> server on port 5674.
>> 
>> Good point on Perl + Python, I can add them.
>> 
>> -Kellen
>> 
>> On Wed, Nov 23, 2016 at 3:22 PM, Matt Post  wrote:
>> 
>>> Okay, I have this with
>>> 
>>>   docker run -it kellens/apache-joshua-es-en-2016-10-05 bash
>>> 
>>> It seems we are missing Perl (./prepare.sh fails), and we should replace
>>> the LanguageModel line with a KenLM instance and build that. I bet we'll
>>> need Python, too.
>>> 
>>> 
>>> 
>>> 
>>>> On Nov 23, 2016, at 8:15 AM, Matt Post  wrote:
>>>> 
>>>> Kellen, can I bother you to post a few first steps? I've successfully
>>> pulled this down to my mac but now do not know how to find it, edit it, or
>>> run it. I'm porting through the documentation and will find it eventually
>>> but this would save me a bit of time.
>>>> 
>>>> 
>>>>> On Nov 23, 2016, at 8:07 AM, kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>>>

Re: [jira] [Commented] (JOSHUA-331) Address Apache Joshua 6.1 RC#3 Issues

2017-03-08 Thread Matt Post

Hi Tommaso,

I'm afraid I'm not at all familiar with the release process and am not sure 
what to do here. Can you simply retrace these steps and do it again correctly?

matt


> On Mar 7, 2017, at 8:31 AM, Tommaso Teofili (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899440#comment-15899440
>  ] 
> 
> Tommaso Teofili commented on JOSHUA-331:
> 
> 
> now all seems fine to me, one concern I have is that for RC3 I didn't have 
> .MD5 checksums in my target directory after having done _mvn release:prepare_ 
> and _mvn release:perform_ and therefore I took the ones from the staging repo 
> and copied them to _/dist_ assuming that they got generated using my key, 
> which of course was not the case.
> How should we proceed there ?
> 
> 
>> Address Apache Joshua 6.1 RC#3 Issues
>> -
>> 
>>Key: JOSHUA-331
>>URL: https://issues.apache.org/jira/browse/JOSHUA-331
>>Project: Joshua
>> Issue Type: Task
>> Components: release
>>   Reporter: Tommaso Teofili
>>   Assignee: Tommaso Teofili
>>Fix For: 6.1
>> 
>> 
>> Address the following issues:
>> {quote}
>> Every ASF release MUST contain one or more source packages, which MUST be
>> sufficient for a user to build and test the release provided they have
>> access to the appropriate platform and tools. - NO
>>-Not building due to failing test (BerkleyLM failure).  I'm digging a
>> bit more into this.
>> {quote}
>> {quote}
>> Every artifact distributed to the public through Apache channels MUST be
>> accompanied by one file containing an OpenPGP compatible ASCII armored
>> detached signature and another file containing an MD5 checksum.
>>- .asc - NO
>>I get warning:
>>"gpg --verify joshua-incubating-6.1-src.tar.gz.asc
>> joshua-incubating-6.1-src.tar.gz
>>gpg: Signature made Thu Feb 23 09:15:17 2017 CET using RSA key ID
>> 891768A5
>>gpg: Good signature from "Tommaso Teofili "
>> [unknown]
>>gpg: WARNING: This key is not certified with a trusted signature!
>>gpg:  There is no indication that the signature belongs to the
>> owner."
>>- .md5 - NO
>>My md5 of joshua-incubating-6.1-src.tar.gz is
>> 504976876b01294811293aa45b5400f5, the joshua-incubating-6.1-src.tar.gz.md5
>> indicates it should be 22b738eeae45757715080702a5bd2789
>>- .sha - NO
>>My sha of joshua-incubating-6.1-src.tar.gz is
>> 4AB5BA24301590F36AE6452DACC3F21CBD8B3FEC, the
>> joshua-incubating-6.1-src.tar.gz.md5 indicates it should be
>> 2a55b6d341dddc5369b22a4802a86ec40accd0a1
>>- KEYS - YES
>> {quote}
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)

Re: Apache Joshua Question

2017-03-09 Thread Matt Post

Hi Guanzhong,

I suggest you look at the demo that is included with Joshua. That contains 
sample Javascript code that shows how to connect to a RESTful instance of 
Joshua. See the README file in this directory:

https://github.com/apache/incubator-joshua/tree/master/demo 
<https://github.com/apache/incubator-joshua/tree/master/demo>

as well as the HTML / JS at

https://github.com/apache/incubator-joshua/blob/master/demo/index.html 
<https://github.com/apache/incubator-joshua/blob/master/demo/index.html>
https://github.com/apache/incubator-joshua/blob/master/demo/demo.js 
<https://github.com/apache/incubator-joshua/blob/master/demo/demo.js>

The page for the RESTful server should also be of some help (I assume you have 
seen this):


https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API?src=contextnavpagetreemode
 
<https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API?src=contextnavpagetreemode>

Sincerely,
Matt Post


> On Mar 9, 2017, at 9:58 AM, Guanzhong Wang  
> wrote:
> 
> Dear Joshua Development Team
> I am a web developer working in DC area now. In recent years I am very
> interested in the knowledge of nlp, I noticed that you just released the
> apache joshua REST API doc on website. Recently I'm trying to integrate
> joshua to my java based web service, but didn't get any progress. I have no
> idea how to pass args to joshua, cause I can't find any joshua java API or
> lib. I see joshua can run a script to bring up a http server, but instead
> of it do you think I can integrate it to my own http server? Any Idea?
> 
> 
> Thanks.

Re: [VOTE] Release Apache Joshua 6.1 (Incubating) RC4

2017-03-20 Thread Matt Post

Folks — This is still in my queue so let's keep this open.

matt


> On Mar 16, 2017, at 8:56 PM, John Hewitt  wrote:
> 
> Lewis is right about the week. Sorry, everyone. This week had a DARPA
> meeting in Atlanta. I'll get my +/-1 out tomorrow.
> 
> -John
> 
> On Thu, Mar 16, 2017 at 8:53 PM, Michael A. Hedderich <
> m...@michael-hedderich.de> wrote:
> 
>> Hi,
>> 
>> Thanks Tommaso for putting the release together!
>> 
>> I was traveling to the US, sorry for the delay from my side.
>> 
>> Here is my list:
>> - build from tag: passed
>> - build from staging repo (zip and gz): passed
>> - build from source release artifacts (zip and gz): passed
>> - md5, sha1 and acc match within the stagging repo
>> - md5 and acc match within the source release artifacts
>> 
>> What does not match for me are the md5 or sha1 of the stagging repo with
>> those of the source release artifacts. E.g.
>> https://repository.apache.org/content/repositories/
>> orgapachejoshua-1005/org/apache/joshua/joshua-incubating/6.1/joshua-
>> incubating-6.1-src.tar.gz.md5
>> vs
>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>> joshua-incubating-6.1-src.tar.gz.md5
>> 
>> Is this the expected behavior?
>> 
>> The link to the check-list that Tom Barber had sent around in the past (
>> http://incubator.apache.org/guides/releasemanagement.html#check-list) does
>> not seem to be valid anymore. At least for me the anchor point does not
>> work and I could not find the check-list on this page or one of its
>> subpages. Does anyone know if this list still exists? If not, should we put
>> such a list on the Joshua PPMC Wiki?
>> 
>> Regards,
>> Michael
>> 
>> 
>> 2017-03-16 20:11 GMT-04:00 lewis john mcgibbney :
>> 
>>> Hi Tommaso,
>>> It looks like you caught the PPMC on a bad week... we will get the VOTE'd
>>> done worry ;)
>>> Thanks for putting the RC together.
>>> Comments inline
>>> 
>>> On Mon, Mar 13, 2017 at 3:58 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>> SIGS look good so do tags and staging repos.
>>> 
>>> On primary release src at
>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>> joshua-incubating-6.1-src.tar.gz,
>>> the compressed archive is called joshua-incubating-6.1-src, when I
>>> decompress it, it is called apache-joshua-6.1-incubating. This is a minor
>>> inconsistency which we may wish to address for next incubating release.
>>> 
>>> When I build (mvn clean install) I get the following... damn laptop. This
>>> is the same issue I got when I tried to spin the original RC2 myself.
>> This
>>> is specific to my environment s not a blocker.
>>> 
>>> [INFO]
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO]
>>> 
>>> [INFO] Total time: 29.351 s
>>> [INFO] Finished at: 2017-03-16T17:07:16-07:00
>>> [INFO] Final Memory: 41M/697M
>>> [INFO]
>>> 
>>> [ERROR] Failed to execute goal
>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>>> (source-release-assembly) on project joshua-incubating: Execution
>>> source-release-assembly of goal
>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user
>>> id
>>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e
>>> switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please read the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/
>> PluginExecutionException
>>> 
>>> A mvn clean test results in the following
>>> 
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 18.971 s
>>> [INFO] Finished at: 2017-03-16T17:09:25-07:00
>>> [INFO] Final Memory: 34M/608M
>>> [INFO]
>>> 
>>> 
>>> CHANGES, DISCLAIMER, LICENSE, NOTICE and README all look good. DOAP is
>>> slightly out of date, however it reflects the first RC.
>>> 
>>> 
>>> [X] +1, let's get it released!!!
 
>>> 
>>> Thank you Tommaso
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>>> 
>>

Re: [VOTE] Release Apache Joshua 6.1 (Incubating) RC4

2017-03-31 Thread Matt Post

+1

✓ MD5 sums (tar and zip)
✓ includes DISCLAIMER
✓ build from src distribution (zip and tgz): 168 tests run, 31 skipped
✓ verified both GPG signatures

I agree about Michael's earlier point: the file name is 
joshua-incubating-6.1-src.tar.gz but it unpacks to 
apache-joshua-incubating-6.1. This discrepancy is okay for now but should be 
fixed in the future.

(at some point when we're in person we should exchange GPG keys)

matt

> On Mar 20, 2017, at 9:53 PM, Matt Post  wrote:
> 
> Folks — This is still in my queue so let's keep this open.
> 
> matt
> 
> 
>> On Mar 16, 2017, at 8:56 PM, John Hewitt  wrote:
>> 
>> Lewis is right about the week. Sorry, everyone. This week had a DARPA
>> meeting in Atlanta. I'll get my +/-1 out tomorrow.
>> 
>> -John
>> 
>> On Thu, Mar 16, 2017 at 8:53 PM, Michael A. Hedderich <
>> m...@michael-hedderich.de> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks Tommaso for putting the release together!
>>> 
>>> I was traveling to the US, sorry for the delay from my side.
>>> 
>>> Here is my list:
>>> - build from tag: passed
>>> - build from staging repo (zip and gz): passed
>>> - build from source release artifacts (zip and gz): passed
>>> - md5, sha1 and acc match within the stagging repo
>>> - md5 and acc match within the source release artifacts
>>> 
>>> What does not match for me are the md5 or sha1 of the stagging repo with
>>> those of the source release artifacts. E.g.
>>> https://repository.apache.org/content/repositories/
>>> orgapachejoshua-1005/org/apache/joshua/joshua-incubating/6.1/joshua-
>>> incubating-6.1-src.tar.gz.md5
>>> vs
>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>> joshua-incubating-6.1-src.tar.gz.md5
>>> 
>>> Is this the expected behavior?
>>> 
>>> The link to the check-list that Tom Barber had sent around in the past (
>>> http://incubator.apache.org/guides/releasemanagement.html#check-list) does
>>> not seem to be valid anymore. At least for me the anchor point does not
>>> work and I could not find the check-list on this page or one of its
>>> subpages. Does anyone know if this list still exists? If not, should we put
>>> such a list on the Joshua PPMC Wiki?
>>> 
>>> Regards,
>>> Michael
>>> 
>>> 
>>> 2017-03-16 20:11 GMT-04:00 lewis john mcgibbney :
>>> 
>>>> Hi Tommaso,
>>>> It looks like you caught the PPMC on a bad week... we will get the VOTE'd
>>>> done worry ;)
>>>> Thanks for putting the RC together.
>>>> Comments inline
>>>> 
>>>> On Mon, Mar 13, 2017 at 3:58 PM, <
>>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>>> 
>>>> SIGS look good so do tags and staging repos.
>>>> 
>>>> On primary release src at
>>>> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/
>>>> joshua-incubating-6.1-src.tar.gz,
>>>> the compressed archive is called joshua-incubating-6.1-src, when I
>>>> decompress it, it is called apache-joshua-6.1-incubating. This is a minor
>>>> inconsistency which we may wish to address for next incubating release.
>>>> 
>>>> When I build (mvn clean install) I get the following... damn laptop. This
>>>> is the same issue I got when I tried to spin the original RC2 myself.
>>> This
>>>> is specific to my environment s not a blocker.
>>>> 
>>>> [INFO]
>>>> 
>>>> [INFO] BUILD FAILURE
>>>> [INFO]
>>>> 
>>>> [INFO] Total time: 29.351 s
>>>> [INFO] Finished at: 2017-03-16T17:07:16-07:00
>>>> [INFO] Final Memory: 41M/697M
>>>> [INFO]
>>>> 
>>>> [ERROR] Failed to execute goal
>>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single
>>>> (source-release-assembly) on project joshua-incubating: Execution
>>>> source-release-assembly of goal
>>>> org.apache.maven.plugins:maven-assembly-plugin:3.0.0:single failed: user
>>>> id
>>>> '498339010' is too big ( > 2097151 ). -> [Help 1]
>>>> [ERROR]
>>>> [ERROR] To see the full stack trace of the errors,

Re: ping on RC4 vote

2017-03-31 Thread Matt Post

Yes, I've verified that those don't match, either.

I can't think of a reason that they *shouldn't* match. Tommaso, do you have any 
idea why they're different? Are these two locations out of sync?



> On Mar 29, 2017, at 12:58 PM, Michael A. Hedderich 
>  wrote:
> 
> Hi,
> 
> from my last mail:
> 
> "What does not match for me are the md5 or sha1 of the stagging repo with
> those of the source release artifacts. E.g. https://repository.apache.org/
> content/repositories/orgapachejoshua-1005/org/apache/joshua/joshua
> -incubating/6.1/joshua-incubating-6.1-src.tar.gz.md5 vs
> https://dist.apache.org/repos/dist/dev/incubator/joshua/6.1/joshua
> -incubating-6.1-src.tar.gz.md5  "
> 
> If this is the expected behavior, then its a +1 from me, too.
> 
> Cheers,
> Michael
> 
> 2017-03-29 12:07 GMT-04:00 lewis john mcgibbney :
> 
>> Hi Folks,
>> I would also like to encourage people to take a look and VOTE as soon as
>> possible.
>> I'm in regular contact with some folks over at the Linguistic Data
>> Consortium [0] (as are several of us I'm sure) and they have tentatively
>> agreed to announce our release (should it be done by then) in their next
>> newsletter... which has a wide reader base.
>> 
>> Thank you Tommaso for hanging on here.
>> 
>> To clarify, I'm a +1
>> 
>> [0] https://www.ldc.upenn.edu/
>> 
>> On Wed, Mar 29, 2017 at 8:39 AM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> 
>>> From: Tommaso Teofili 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Bcc:
>>> Date: Wed, 29 Mar 2017 15:39:18 +
>>> Subject: Re: ping on RC4 vote
>>> ping
>>> 
>>> 
>> 
> 
> 
> 2017-03-29 12:07 GMT-04:00 lewis john mcgibbney :
> 
>> Hi Folks,
>> I would also like to encourage people to take a look and VOTE as soon as
>> possible.
>> I'm in regular contact with some folks over at the Linguistic Data
>> Consortium [0] (as are several of us I'm sure) and they have tentatively
>> agreed to announce our release (should it be done by then) in their next
>> newsletter... which has a wide reader base.
>> 
>> Thank you Tommaso for hanging on here.
>> 
>> To clarify, I'm a +1
>> 
>> [0] https://www.ldc.upenn.edu/
>> 
>> On Wed, Mar 29, 2017 at 8:39 AM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> 
>>> From: Tommaso Teofili 
>>> To: "dev@joshua.incubator.apache.org" 
>>> Cc:
>>> Bcc:
>>> Date: Wed, 29 Mar 2017 15:39:18 +
>>> Subject: Re: ping on RC4 vote
>>> ping
>>> 
>>> 
>>

Re: Plugging self-hosted Joshua into mailman?

2017-04-06 Thread Matt Post

Karel,

I'm way overdue on this email, so perhaps you've dropped this entirely, but I 
thought I'd respond to these points (inline below).

> On Jan 19, 2017, at 7:18 PM, Karel Novotný  <mailto:ka...@apc.org>> wrote:
> 
> 
> 
> On 19.1.2017 15:15, Matt Post wrote:
>> Karel — On this point, I don't think you should have to use the tutorials, 
>> which tell you how to identify training data and build new translation 
>> models yourself. I imagine that you would be more interested in downloading 
>> pre-built models that don't really require you to be an expert in MT. See 
>> this page:
>> 
>>  https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
>> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
> 
> Thanks Matt for clarifications: Actually did download the language pairs
> yesterday and tried to run them to test the webapp by doing:
> 
> ./joshua -server-port 5674 -server-type http
> and
> firefox "web/index.html?server=localhost&port=5674"
> 
> However, it started consuming more and more memory until it jammed my
> computer completely (dual core 8GB ram). It might have been some bad
> config on my side though, or some other omission.

I don't remember what model you were using, but the model size is going to be 
roughly proportional to a "du -sh model/" in the language pack directory.

There is now a Dockerized Joshua that makes it easy to use KenLM, which reduces 
these requirements quite a bit.

We could also just build smaller models. If nothing else, to start out with, 
and then improve on later.


> Our sysadmin should be able to make use of the API you mentioned.
> 
> If all sentences must be sent separate Then I suppose that there is
> no way that we would automatically re-compose any formatting
> (paragraphs), right? Having translated text in one big block or as
> separate phrases on separate lines might make translating of messages a
> bit challenging.

I would think this could be recomposed rather easily, but yes, it would take 
some bookkeeping. What we really want is a tool that could wrap Joshua and 
manage this for us — take a document, extract the sentences, get the 
translations (as generic annotations, perhaps), substitute them back in, and 
then return the document. Doesn't Tika do this, to an an extent?


> As for the volume While this is difficult to estimate, I've made a
> calculation based on monthly volume in list archives in the absolute
> peak month. The average per day is approx 1000 sentences, so it might be
> around 3000 in peak days.

This is nothing — minutes of computing, at best, and there are knobs you can 
turn to change this.


> thanks for your interest in this.
> 
> karel
> 
>> 
>> matt
>> 
>> 
>>> On Jan 17, 2017, at 12:07 PM, lewis john mcgibbney >> <mailto:lewi...@apache.org>> wrote:
>>> 
>>> Hi Karel,
>>> The short answer is yes.
>>> I would advise you to start at the Tutorial
>>> https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started 
>>> <https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started>
>>> If you find anything which causes you problems then please write back here.
>>> Once you have skipped through the tutorial then you will have a much better
>>> feel for the workflow required.
>>> I can see the Apache Tika language identification and translate API's being
>>> of particular use here when considered in a runtime context. We have a
>>> Joshua implementation over in Tika which can aid you in this task however
>>> try the Joshua tutorial first.
>>> Lewis
>>> 
>>> On Mon, Jan 16, 2017 at 7:41 AM, Chris Mattmann  wrote:
>>> 
>>>> Hi Karel,
>>>> 
>>>> I would recommend moving this thread to dev@joshua.incubator.apache.org
>>>> instead of the private list. I’ve moved private to BCC.
>>>> 
>>>> Thank you.
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> 
>>>> 
>>>> On 1/16/17, 6:58 AM, wrote:
>>>> 
>>>>   Hello,
>>>> 
>>>>   We would like to build a self-hosted machine translation system that
>>>>   could be plugged into our mailman installs. The objective is that the
>>>>   members of our multicultural network would be able to send email in
>>>>   their mother language and it would be delivered to the list
>>>>   machine-translated (and vise versa).
>>>> 
>>>>   Are we on the right track with Joshua?

Re: Plugging self-hosted Joshua into mailman?

2017-04-25 Thread Matt Post


>> There is now a Dockerized Joshua that makes it easy to use KenLM, which 
>> reduces these requirements quite a bit.
> 
> Ok. I will talk to our sysadmin and see if he can do this. I myself
> don't know what 'dockerized' means in this context. If it is a separate
> pack/module, can you please point me to it (we are interested in
> En<->Fr, En<->Es, and Fr<->Es combinations)?

Docker is a virtual environment tool that makes it easy to share executable 
code. Here, it facilitates compiling KenLM, which some people have trouble with.

We haven't built any models out of English, or any models that don't include 
English, unfortunately.


>> We could also just build smaller models. If nothing else, to start out with, 
>> and then improve on later.
>> 
>> 
>>> Our sysadmin should be able to make use of the API you mentioned.
>>> 
>>> If all sentences must be sent separate Then I suppose that there is
>>> no way that we would automatically re-compose any formatting
>>> (paragraphs), right? Having translated text in one big block or as
>>> separate phrases on separate lines might make translating of messages a
>>> bit challenging.
>> I would think this could be recomposed rather easily, but yes, it would take 
>> some bookkeeping. What we really want is a tool that could wrap Joshua and 
>> manage this for us — take a document, extract the sentences, get the 
>> translations (as generic annotations, perhaps), substitute them back in, and 
>> then return the document. Doesn't Tika do this, to an an extent?
> 
> I don't know :-)  But maybe someone else on this list has experience
> with this.
> 
> Thanks Matt.
> 
> karel
> 
>> 
>> 
>>> As for the volume While this is difficult to estimate, I've made a
>>> calculation based on monthly volume in list archives in the absolute
>>> peak month. The average per day is approx 1000 sentences, so it might be
>>> around 3000 in peak days.
>> This is nothing — minutes of computing, at best, and there are knobs you can 
>> turn to change this.
>> 
>> 
>>> thanks for your interest in this.
>>> 
>>> karel
>>> 
 matt
 
 
> On Jan 17, 2017, at 12:07 PM, lewis john mcgibbney   >> wrote:
> 
> Hi Karel,
> The short answer is yes.
> I would advise you to start at the Tutorial
> https://cwiki.apache.org/confluence/display/JOSHUA/Getting+Started 
>   >
> If you find anything which causes you problems then please write back 
> here.
> Once you have skipped through the tutorial then you will have a much 
> better
> feel for the workflow required.
> I can see the Apache Tika language identification and translate API's 
> being
> of particular use here when considered in a runtime context. We have a
> Joshua implementation over in Tika which can aid you in this task however
> try the Joshua tutorial first.
> Lewis
> 
> On Mon, Jan 16, 2017 at 7:41 AM, Chris Mattmann  > wrote:
> 
>> Hi Karel,
>> 
>> I would recommend moving this thread to dev@joshua.incubator.apache.org 
>> 
>> instead of the private list. I’ve moved private to BCC.
>> 
>> Thank you.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 1/16/17, 6:58 AM, wrote:
>> 
>>  Hello,
>> 
>>  We would like to build a self-hosted machine translation system that
>>  could be plugged into our mailman installs. The objective is that the
>>  members of our multicultural network would be able to send email in
>>  their mother language and it would be delivered to the list
>>  machine-translated (and vise versa).
>> 
>>  Are we on the right track with Joshua? I suppose that a lot of
>>  configuration would be needed, but at this point I want to know if I am
>>  not completely mistaken when considering your sw for this.
>> 
>>  Thanks
>> 
>>  karel
>> 
>> 
>>  --
>>  ~~~
>>  Karel Novotny
>>  Knowledge Sharing & Network Development Coordinator
>>  APC - The Association for Progressive Communications
>>  https://www.apc.org
>>  GSM: +420 605 243 246 (GMT +1)
>>  jabber: ka...@riseup.net
>>  Working/online: Monday - Thursday
>>  ~~~
>>  My public OpenPGP key: https://pgp.mit.edu/pks/lookup?op=get&search=
>> 0x7FDEF502377E4FCA
>> 
>> 
>> 
>> 
>> 
>> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>>> -- 
>>> ~~

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-24 Thread Matt Post

Hi,

If you are using a language pack, you must have KenLM installed into the 
language pack directory under lib/ since that is where it looks. Can you copy 
libken.so to that directory and see if it works?




> On May 23, 2017, at 10:54 PM, Hoàng Đình Long  wrote:
> 
> Hello,
> 
> I have followed the Fisher call home tutorial and I built 3 models based on
> that document.
> 
> Then I use this script to build a language pack based on the model number 3:
> 
> $JOSHUA/scripts/language-pack/build_lp.sh es-en 3/tune/joshua.config.final
> 4g
> 
> The scripts ended well and I created a folder named releases.
> 
> In "releases" folder, there is a folder named
> "apache-joshua-es-en-2017-05-24". I cd into that folder.
> 
> I created a file named "example.es" with this sentence inside "común y
> corriente" and ran this command:
> 
> cat example.es | ./prepare.sh | ./joshua > output.en
> 
> It reported the following error:
> 
> Exception in thread "main" java.lang.RuntimeException: Unable to
> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
> -lm_file model/lm.kenlm'!
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
> ... 3 more
> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
> java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:107)
> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.initializeLM(StateMinimizingLanguageModel.java:63)
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(StateMinimizingLanguageModel.java:47)
> ... 8 more
> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:103)
> 
> 
> I think I have installed KenLM properly. If I hadn't installed it, I
> wouldn't have been able to follow the tutorial and build the language pack,
> would I? Did I miss something here?
> 
> -- 
> _Long HĐi_

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-25 Thread Matt Post

Is the file model/lm.kenlm in place?


> On May 24, 2017, at 10:15 PM, Hoàng Đình Long  wrote:
> 
> Hello,
> 
> In the $JOSHUA home directory (which I built from source code cloned from
> Github), I found a lib directory which contains only 1 file
> named libken.so. Then I copy the lib folder into the language pack release
> folder (because the language pack folder doesn't have this sub folder).
> 
> The error remains the same when I try to use the language pack:
> 
> Exception in thread "main" java.lang.RuntimeException: Unable to
> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
> -lm_file model/lm.kenlm'!
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
> ... 3 more
> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
> java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:107)
> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.initializeLM(StateMinimizingLanguageModel.java:63)
> at
> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
> at
> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(StateMinimizingLanguageModel.java:47)
> ... 8 more
> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at
> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.java:103)
> ... 12 more
> 
> 
> On Wed, May 24, 2017 at 9:06 PM, Matt Post  wrote:
> 
>> Hi,
>> 
>> If you are using a language pack, you must have KenLM installed into the
>> language pack directory under lib/ since that is where it looks. Can you
>> copy libken.so to that directory and see if it works?
>> 
>> 
>> 
>> 
>>> On May 23, 2017, at 10:54 PM, Hoàng Đình Long 
>> wrote:
>>> 
>>> Hello,
>>> 
>>> I have followed the Fisher call home tutorial and I built 3 models based
>> on
>>> that document.
>>> 
>>> Then I use this script to build a language pack based on the model
>> number 3:
>>> 
>>> $JOSHUA/scripts/language-pack/build_lp.sh es-en
>> 3/tune/joshua.config.final
>>> 4g
>>> 
>>> The scripts ended well and I created a folder named releases.
>>> 
>>> In "releases" folder, there is a folder named
>>> "apache-joshua-es-en-2017-05-24". I cd into that folder.
>>> 
>>> I created a file named "example.es" with this sentence inside "común y
>>> corriente" and ran this command:
>>> 
>>> cat example.es | ./prepare.sh | ./joshua > output.en
>>> 
>>> It reported the following error:
>>> 
>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>> -lm_file model/lm.kenlm'!
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:642)
>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAccessorImpl.java:45)
>>>

Re: Podling Report Reminder - June 2017

2017-05-25 Thread Matt Post

Thank you, Lewis!


> On May 25, 2017, at 3:25 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> I've populated this report. If any mentors are able to look it over, it
> would be appreciated.
> Lewis
> 
> On Tue, May 23, 2017 at 3:00 AM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
> 
>> 
>> dev Digest 23 May 2017 10:00:50 - Issue 206
>> 
>> Topics (messages 2179 through 2179)
>> 
>> Podling Report Reminder - June 2017
>>2179 by: johndament.apache.org
>> 
>> Administrivia:
>> 
>> -
>> To post to the list, e-mail: dev@joshua.incubator.apache.org
>> To unsubscribe, e-mail: dev-digest-unsubscr...@joshua.incubator.apache.org
>> For additional commands, e-mail: dev-digest-help@joshua.
>> incubator.apache.org
>> 
>> --
>> 
>> 
>> 
>> -- Forwarded message --
>> From: johndam...@apache.org
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Bcc:
>> Date: Tue, 23 May 2017 10:00:46 -
>> Subject: Podling Report Reminder - June 2017
>> Dear podling,
>> 
>> This email was sent by an automated system on behalf of the Apache
>> Incubator PMC. It is an initial reminder to give you plenty of time to
>> prepare your quarterly board report.
>> 
>> The board meeting is scheduled for Wed, 21 June 2017, 10:30 am PDT.
>> The report for your podling will form a part of the Incubator PMC
>> report. The Incubator PMC requires your report to be submitted 2 weeks
>> before the board meeting, to allow sufficient time for review and
>> submission (Wed, June 07).
>> 
>> Please submit your report with sufficient time to allow the Incubator
>> PMC, and subsequently board members to review and digest. Again, the
>> very latest you should submit your report is 2 weeks prior to the board
>> meeting.
>> 
>> Thanks,
>> 
>> The Apache Incubator PMC
>> 
>> Submitting your Report
>> 
>> --
>> 
>> Your report should contain the following:
>> 
>> *   Your project name
>> *   A brief description of your project, which assumes no knowledge of
>>the project or necessarily of its field
>> *   A list of the three most important issues to address in the move
>>towards graduation.
>> *   Any issues that the Incubator PMC or ASF Board might wish/need to be
>>aware of
>> *   How has the community developed since the last report
>> *   How has the project developed since the last report.
>> *   How does the podling rate their own maturity.
>> 
>> This should be appended to the Incubator Wiki page at:
>> 
>> https://wiki.apache.org/incubator/June2017
>> 
>> Note: This is manually populated. You may need to wait a little before
>> this page is created from a template.
>> 
>> Mentors
>> ---
>> 
>> Mentors should review reports for their project(s) and sign them off on
>> the Incubator wiki page. Signing off reports shows that you are
>> following the project - projects that are not signed may raise alarms
>> for the Incubator PMC.
>> 
>> Incubator PMC
>> 
>> 
>> 
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-30 Thread Matt Post

It looks like it can't find libken.so, so that is probably not in your 
$LD_LIBRARY_PATH. This is supposed to be set by the "joshua" script, so 
something must be wrong. I'm not sure what but it shouldn't be too difficult to 
track down at the terminal.

matt


> On May 25, 2017, at 10:43 PM, Hoàng Đình Long  wrote:
> 
> Yes, in the language pack/model folder, there are lm.kenlm, grammar.glue
> files and a sub folder named grammar.packed
> 
> On Thu, May 25, 2017 at 5:56 PM, Matt Post  wrote:
> 
>> Is the file model/lm.kenlm in place?
>> 
>> 
>>> On May 24, 2017, at 10:15 PM, Hoàng Đình Long 
>> wrote:
>>> 
>>> Hello,
>>> 
>>> In the $JOSHUA home directory (which I built from source code cloned from
>>> Github), I found a lib directory which contains only 1 file
>>> named libken.so. Then I copy the lib folder into the language pack
>> release
>>> folder (because the language pack folder doesn't have this sub folder).
>>> 
>>> The error remains the same when I try to use the language pack:
>>> 
>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>> -lm_file model/lm.kenlm'!
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:642)
>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>> at
>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>> Decoder.java:638)
>>> ... 3 more
>>> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
>>> java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>> at
>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>> java:107)
>>> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
>>> at
>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.
>> initializeLM(StateMinimizingLanguageModel.java:63)
>>> at
>>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(
>> LanguageModelFF.java:132)
>>> at
>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(
>> StateMinimizingLanguageModel.java:47)
>>> ... 8 more
>>> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>>> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>>> at java.lang.System.loadLibrary(System.java:1122)
>>> at
>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>> java:103)
>>> ... 12 more
>>> 
>>> 
>>> On Wed, May 24, 2017 at 9:06 PM, Matt Post  wrote:
>>> 
>>>> Hi,
>>>> 
>>>> If you are using a language pack, you must have KenLM installed into the
>>>> language pack directory under lib/ since that is where it looks. Can you
>>>> copy libken.so to that directory and see if it works?
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On May 23, 2017, at 10:54 PM, Hoàng Đình Long 
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I have followed the Fisher call home tutorial and I built 3 models
>> based
>>>> on
>>>>> that document.
>>>>> 
>>>>> Then I use this script to build a language pack based on the model
>>>> number 3:
>>>>> 
>>>>> $JOSHUA/scripts/language-pack/build_lp.sh es-en
>>>> 3/tune/joshua.config.final
>>>>> 4g
>>>>> 
>>>>> The scripts ended well and I created a folder named releases.
>>>>> 
>>>>> In "releases" folder, there is a folder named
>>>>> "apache-joshua-

Re: java.lang.UnsatisfiedLinkError: no ken in java.library.path

2017-05-31 Thread Matt Post

Yes, LD_LIBRARY_PATH should also include your system library paths. It looks 
like things are good for you!

matt


> On May 31, 2017, at 12:32 AM, Hoàng Đình Long  wrote:
> 
> Hi Matt,
> 
> Thank you very much. That's exactly the reason.
> In the tutorial, there is only a step to setup boost. And I made an
> environment variable called LD_LIBRARY_PATH="/usr/include/boost". In
> /usr/include/boost, libken.so file doesn't exist. So I appended the path
> like this:
> 
> LD_LIBRARY_PATH="/usr/include/boost:/home/long/Working/joshua-tutorial/runs/releases/apache-joshua-es-en-2017-05-30/lib"
> 
> Now it works great.
> 
> Is it the way it is supposed to happen?
> 
> Anyway, thank you!
> I will review the overall picture and see what to do next.
> 
> On Tue, May 30, 2017 at 10:44 PM, Matt Post  wrote:
> 
>> It looks like it can't find libken.so, so that is probably not in your
>> $LD_LIBRARY_PATH. This is supposed to be set by the "joshua" script, so
>> something must be wrong. I'm not sure what but it shouldn't be too
>> difficult to track down at the terminal.
>> 
>> matt
>> 
>> 
>>> On May 25, 2017, at 10:43 PM, Hoàng Đình Long 
>> wrote:
>>> 
>>> Yes, in the language pack/model folder, there are lm.kenlm, grammar.glue
>>> files and a sub folder named grammar.packed
>>> 
>>> On Thu, May 25, 2017 at 5:56 PM, Matt Post  wrote:
>>> 
>>>> Is the file model/lm.kenlm in place?
>>>> 
>>>> 
>>>>> On May 24, 2017, at 10:15 PM, Hoàng Đình Long 
>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> In the $JOSHUA home directory (which I built from source code cloned
>> from
>>>>> Github), I found a lib directory which contains only 1 file
>>>>> named libken.so. Then I copy the lib folder into the language pack
>>>> release
>>>>> folder (because the language pack folder doesn't have this sub folder).
>>>>> 
>>>>> The error remains the same when I try to use the language pack:
>>>>> 
>>>>> Exception in thread "main" java.lang.RuntimeException: Unable to
>>>>> instantiate feature function 'StateMinimizingLanguageModel -lm_order 3
>>>>> -lm_file model/lm.kenlm'!
>>>>> at
>>>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>>>> Decoder.java:642)
>>>>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>>>>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>>>>> at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>>>> at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>>> NativeConstructorAccessorImpl.java:62)
>>>>> at
>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>>>> DelegatingConstructorAccessorImpl.java:45)
>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>>> at
>>>>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(
>>>> Decoder.java:638)
>>>>> ... 3 more
>>>>> Caused by: org.apache.joshua.decoder.ff.lm.KenLM$KenLMLoadException:
>>>>> java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.KenLM.initializeSystemLibrary(KenLM.
>>>> java:107)
>>>>> at org.apache.joshua.decoder.ff.lm.KenLM.(KenLM.java:58)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.
>>>> initializeLM(StateMinimizingLanguageModel.java:63)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(
>>>> LanguageModelFF.java:132)
>>>>> at
>>>>> org.apache.joshua.decoder.ff.lm.StateMinimizingLanguageModel.(
>>>> StateMinimizingLanguageModel.java:47)
>>>>> ... 8 more
>>>>> Caused by: java.lang.UnsatisfiedLinkError: no ken in java.library.path
>>>>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
>>>>> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
>>>>> at java.lang.System.loadLibrary(System.java:1122)
>>

Re: Thumbs up from general@ to release Joshua 6.1 (Incubating)

2017-06-12 Thread Matt Post

Thanks for all your work on this, folks!

matt


> On Jun 10, 2017, at 4:00 PM, lewis john mcgibbney  wrote:
> 
> Hi Folks,
> Both Justin and John have provided us with +1's for releasing... which is
> quite frankly great.
> We've been undertaking a good bit of due diligence for this release... it
> has admittedly taken a hellish amount of time to push through. On the
> bright side, we have now nearly made the first official Apache release
> which is a huge milestone for the project and for getting the word out that
> we are alive and kicking in the Incubator.
> Huge thank you to Tommaso who has been acting as release manager and
> community liason so to speak. It makes a huge difference and is greatly
> appreciated.
> Once Tommaso's RESULT thread hits general@ we can progress with the
> remaining release management items.
> Hopefully there will be a release announcement pretty soon.
> In the meantime, can everyone being thinking about appropriate avenue's and
> communication forums for us to publicize the release announcement? If you
> could, please append them to the release management document on the Joshua
> wiki.
> Best
> Lewis
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: Merging 7.X into master??? + cleaning up branches

2017-06-28 Thread Matt Post

This is definitely a good idea. Many of these branches are dead and are 
unlikely to contain much that can be merged in, and are therefore probably best 
deleted. The plan for 7 was a big simplification of much of the guts, but with 
the transition to neural approaches in the research community, this is unlikely 
to be done unless it finds a new champion.




> On Jun 28, 2017, at 3:43 AM, Tommaso Teofili  
> wrote:
> 
> +1 for both cleaning up branches *and* merging 7 branch into master.
> 
> Regarding branches and Git let me read through the links and I'll share my
> opinion.
> 
> Regards,
> Tommaso
> 
> Il giorno mer 28 giu 2017 alle ore 06:41 Chris Mattmann 
> ha scritto:
> 
>> Hey Team,
>> 
>> I recommend that Joshua consider adopting the Tika and/or Nutch
>> contribution
>> policy RE: branches and Git:
>> 
>> https://github.com/apache/tika/#contributing-via-github
>> https://github.com/apache/nutch/#contributing
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> On 6/27/17, 9:36 PM, "lewis john mcgibbney"  wrote:
>> 
>>Hi Folks,
>>Two things...
>> 
>>   1. Currently the branches for Joshua are a bit of a mess... it
>> would be
>>   better if they were named after JIRA issues such that the mappings
>> back to
>>   some concrete development were explicit. Does anyone want to clean
>> these up?
>>   2. Now that 6.1-incubating is released and live, Is there any
>> desire to
>>   merge 7.X branch into master and continue development there? I was
>> not
>>   involved with the 7.X development but it looked like a significant
>> step
>>   forward... it would be a shame for that work to stagnate.
>> 
>>Thanks,
>> 
>>lewis
>> 
>>--
>>http://home.apache.org/~lewismc/
>>@hectorMcSpector
>>http://www.linkedin.com/in/lmcgibbney
>> 
>> 
>> 
>>

Re: [ANNOUNCE] - Apache Joshua 6.1 incubating release

2017-06-28 Thread Matt Post

Yes, tighter integration with other Apache projects sounds like a good idea to 
me. Rewriting Thrax to use a more modern tool would also be hugely helpful to 
Joshua in the long term. It is getting harder and harder to find and maintain 
(much less justify) Hadoop clusters that are separate from other research ones.


> On Jun 28, 2017, at 3:42 AM, Tommaso Teofili  
> wrote:
> 
> +1
> 
> Tommaso
> 
> Il giorno mer 28 giu 2017 alle ore 07:46 lewis john mcgibbney <
> lewi...@apache.org> ha scritto:
> 
>> Hi Suneel,
>> I think it's worth opening a JIRA issue and we can possibly mark it for
>> 7.X?
>> lewis
>> 
>> On Tue, Jun 27, 2017 at 9:36 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>> 
>>> 
>>> From: Suneel Marthi 
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Bcc:
>>> Date: Fri, 23 Jun 2017 01:59:28 -0400
>>> Subject: Re: [ANNOUNCE] - Apache Joshua 6.1 incubating release
>>> Congrats on the release.
>>> 
>>> I have been a silent lurker on this channel since I first heard of Joshua
>>> last September at Amazon, Berlin.
>>> 
>>> Tommaso and myself recently did a talk at Berlin Buzzwords 2017 -
>>> 'Embracing Diversity - searching over multiple languages' [1]
>>> using Apache Joshua for Machine Translation, and Apache OpenNLP for
>>> Language detection.
>>> 
>>> I have been wondering how much of the present VLPS can be replaced by
>>> OpenNLP with Flink/Beam pipelines.
>>> I did a talk last week at Hadoop Summit, San Jose about 'Large Scale Text
>>> processing with Apache OpenNLP and Apache Flink [2].
>>> 
>>> Also that Thrax which is presently MapReduce based, can definitely be
>>> ported over to modern streaming distributed frameworks like Flink/Kafka
>>> Streams/Beam.
>>> 
>>> 
>>> [1]
>>> https://www.youtube.com/watch?v=ZrWxySF-9KY&index=20&t=2s&;
>>> list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt
>>> [2] https://www.slideshare.net/SuneelMarthi/large-scale-text-processing
>>> 
>>> 
>>> 
>>

Re: Merging 7.X into master??? + cleaning up branches

2017-07-04 Thread Matt Post

Whether to integrate neural stuff in Joshua is an interesting question. The 
research direction has been to develop fully neural systems that leave behind 
the phrase-based and hierarchical framework entirely. Doing this in Joshua 
would basically require a ground-up rewrite and is probably not worth the time. 
Moses has neural feature functions; for example, you can use a Nematus model as 
a rescore feature (though it breaks dynamic programming). This might be 
reasonable to implement as a project but it would be quite a bit of work and 
introduce GPU requirements that would raise the question of why you'd use 
Joshua if you had a GPU available. I think that it would be better to focus on 
low-resource scenarios and user-focused applications, instead.


> On Jun 29, 2017, at 12:35 PM, Tommaso Teofili  
> wrote:
> 
> Hi Matt,
> 
> Il giorno gio 29 giu 2017 alle ore 05:21 Matt Post  ha
> scritto:
> 
>> This is definitely a good idea. Many of these branches are dead and are
>> unlikely to contain much that can be merged in, and are therefore probably
>> best deleted. The plan for 7 was a big simplification of much of the guts,
>> but with the transition to neural approaches in the research community,
>> this is unlikely to be done unless it finds a new champion.
>> 
> 
> do you think we should look at NMT in the Joshua project ?
> Or is it more like you are more interested on NMT at the moment ?
> Or both ? :)
> 
> Other than that let's merge 7 to master and drop the remaining stuff,
> except that for the PR for JOSHUA-290 [1] which should be merged into 7
> branch.
> 
> Regards,
> Tommaso
> 
> [1] : https://github.com/apache/incubator-joshua/pull/71
> 
> 
>> 
>> 
>> 
>> 
>>> On Jun 28, 2017, at 3:43 AM, Tommaso Teofili 
>> wrote:
>>> 
>>> +1 for both cleaning up branches *and* merging 7 branch into master.
>>> 
>>> Regarding branches and Git let me read through the links and I'll share
>> my
>>> opinion.
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> Il giorno mer 28 giu 2017 alle ore 06:41 Chris Mattmann <
>> mattm...@apache.org>
>>> ha scritto:
>>> 
>>>> Hey Team,
>>>> 
>>>> I recommend that Joshua consider adopting the Tika and/or Nutch
>>>> contribution
>>>> policy RE: branches and Git:
>>>> 
>>>> https://github.com/apache/tika/#contributing-via-github
>>>> https://github.com/apache/nutch/#contributing
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> 
>>>> 
>>>> On 6/27/17, 9:36 PM, "lewis john mcgibbney"  wrote:
>>>> 
>>>>   Hi Folks,
>>>>   Two things...
>>>> 
>>>>  1. Currently the branches for Joshua are a bit of a mess... it
>>>> would be
>>>>  better if they were named after JIRA issues such that the mappings
>>>> back to
>>>>  some concrete development were explicit. Does anyone want to clean
>>>> these up?
>>>>  2. Now that 6.1-incubating is released and live, Is there any
>>>> desire to
>>>>  merge 7.X branch into master and continue development there? I was
>>>> not
>>>>  involved with the 7.X development but it looked like a significant
>>>> step
>>>>  forward... it would be a shame for that work to stagnate.
>>>> 
>>>>   Thanks,
>>>> 
>>>>   lewis
>>>> 
>>>>   --
>>>>   http://home.apache.org/~lewismc/
>>>>   @hectorMcSpector
>>>>   http://www.linkedin.com/in/lmcgibbney
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>>

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-23 Thread Matt Post

what is the file size of lm dot kenlm and lm.gz? that will tell you if they 
built fine. 

check that joshua config path to lm is valid. thrown error might be off. 

matt (from my phone)

> Le 23 août 2017 à 15:27, Jeffrey Smith (JIRA)  a écrit :
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138334#comment-16138334
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> PS. There is a  "runs/1/tune/model/lm.kenlm". It is a soft-link to 
> .../joshua-tutorial/runs/1/lm.kenlm . Perhaps this is not what it is supposed 
> to be?
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
>> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>>at 
>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:761)
>>at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:514)
>>at org.apache.joshua.decoder.Decoder.(Decoder.java:122)
>>at org.apache.joshua.decoder.JoshuaDecoder.main(JoshuaDecoder.java:69)
>> Caused by: java.lang.reflect.InvocationTargetException
>>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-23 Thread Matt Post

What's the file size of grammar.gz? Looks like it didn't get extracted.


> On Aug 23, 2017, at 8:14 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138814#comment-16138814
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Getting closer:
> 
> I deleted runs/1 and re-ran:
> 
> $JOSHUA/bin/pipeline.pl \
> --rundir 1 \
> --readme "Baseline Hiero run" \
> --source es \
> --target en \
> --type hiero \
> --corpus $FISHER/corpus/asr/fisher_train \
> --tune $FISHER/corpus/asr/fisher_dev \
> --test $FISHER/corpus/asr/fisher_dev2 \
> --maxlen 11 \
> --maxlen-tune 11 \
> --maxlen-test 11 \
> --tuner-iterations 1 \
> --lm-order 3
> 
> The example got farther but ended with the following error. 
> ...
> * Packing grammar at "grammar.gz" to 
> "/data/joshua-tutorial/runs/1/tune/model/grammar.gz.packed"
> * Running the grammar-packer.pl script with the command: 
> /data/joshua/scripts/support/grammar-packer.pl -a -T /tmp -g grammar.gz -o 
> /data/joshua-tutorial/runs/1/tune/model/grammar.gz.packed
> Exception in thread "main" java.util.NoSuchElementException
>at org.apache.joshua.util.io.LineReader.next(LineReader.java:276)
>at 
> org.apache.joshua.tools.GrammarPacker.getGrammarReader(GrammarPacker.java:239)
>at org.apache.joshua.tools.GrammarPacker.pack(GrammarPacker.java:184)
>at 
> org.apache.joshua.tools.GrammarPackerCli.run(GrammarPackerCli.java:120)
>at 
> org.apache.joshua.tools.GrammarPackerCli.main(GrammarPackerCli.java:137)
> * FATAL: Couldn't pack the grammar.
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/a

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-25 Thread Matt Post

You said you're on OS X? This should work, but you might try building in a 
Docker container. There's a Dockerfile in distribution/docker/kenlm


> On Aug 25, 2017, at 1:24 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141502#comment-16141502
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Thanks. I appreciate you getting back to me on this. I only jave JDK 8 on 
> this system. I did the above steps with the same issue. Perhaps this is a 
> problem.
> 
> Joshua is here:
> [ec2-user@ip-172-31-4-253 runs]$ echo $JOSHUA
> /data/joshua
> 
> I installed the joshua tutorial files in:
> /data/joshua-tutorial so I am running tutorial from 
> /data/joshua-tutorial/runs
> 
> when I run:
> $JOSHUA/bin/pipeline.pl \
>  --rundir 1 \
>  --readme "Baseline Hiero run" \
>  --source es \
>  --target en \
>  --type hiero \
>  --corpus $FISHER/corpus/asr/fisher_train \
>  --tune $FISHER/corpus/asr/fisher_dev \
>  --test $FISHER/corpus/asr/fisher_dev2 \
>  --maxlen 11 \
>  --maxlen-tune 11 \
>  --maxlen-test 11 \
>  --tuner-iterations 1 \
>  --lm-order 3
> 
> I still get the error I described
> 
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: Unable to instantiate 
>> feature function 'StateMinimizingLanguageModel -lm_order 5 -lm_file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/model/lm.kenlm'!
>>  at 
>> org.apache.joshua.decoder.Decoder.ini

Re: [jira] [Commented] (JOSHUA-277) UnsatisfiedLinkError: no ken in java.library.path

2017-08-28 Thread Matt Post

Hi,

There is no Joshua manual, unfortunately, just the Confluence pages.

I looked at your run and it seems that Thrax is failing. I don't know what your 
Hadoop configuration is like, but that is likely the problem (see thrax.log in 
these directories). If you setup Hadoop incorrectly, or don't have enough 
space, or set it up on a network share instead of local disks, all of these 
things can cause problems.

matt


> On Aug 28, 2017, at 2:46 PM, Jeffrey Smith (JIRA)  wrote:
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143731#comment-16143731
>  ] 
> 
> Jeffrey Smith commented on JOSHUA-277:
> --
> 
> Thank you for your help. Here are the two directories in question. Also, do
> you know of a manual, similar to the moses manual, for joshua? I couldn't
> seem to find one.
> Moses does work on this same computer I am running this on.
> 
> joshua-tutorial.tar.gz
> 
> 
> joshua.tar.gz
> 
> 
> 
> 
> 
> 
> 
> -- 
> *Jeffrey Smith, PhD*
> Chief Systems Engineer and E2 Lead
> Multi Agency Collaboration Environment (MACE)
> Sierra Nevada Corporation
> 3076 Centreville Road, Herndon, VA 20171
> 703-464-6434 (Office)
> 603-566-0124 (Cell)
> jeff.sm...@macefusion.com
> 
> 
>> UnsatisfiedLinkError: no ken in java.library.path
>> -
>> 
>>Key: JOSHUA-277
>>URL: https://issues.apache.org/jira/browse/JOSHUA-277
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: Thamme Gowda
>> 
>> I followed this guide 
>> http://joshua.incubator.apache.org/6.0/quick-start.html to test the latest 
>> build.
>> Assuming there few things are broken due to newer maven build system, I 
>> tried to fix pipeline.pl to get the quick start guide working.
>> Which files from kenlm build should I add to JNI path? (I am unable to 
>> locate the library file in the kenlm build output)
>> Here is the full log:
>> {code}
>> $JOSHUA/bin/pipeline.pl --source bn --target en --type hiero 
>> --no-prepare --aligner berkeley --corpus input/bn-en/tok/training.bn-en  
>>--tune input/bn-en/tok/dev.bn-en --test input/bn-en/tok/devtest.bn-en
>> [train-copy-and-filter] cached, skipping...
>> [train-vocab-bn] cached, skipping...
>> [train-vocab-en] cached, skipping...
>> [tune-copy-and-filter] cached, skipping...
>> [tune-vocab-bn] cached, skipping...
>> [tune-vocab-en.0] cached, skipping...
>> [tune-vocab-en.1] cached, skipping...
>> [tune-vocab-en.2] cached, skipping...
>> [tune-vocab-en.3] cached, skipping...
>> [test-copy-and-filter] cached, skipping...
>> [test-vocab-bn] cached, skipping...
>> [test-vocab-en.0] cached, skipping...
>> [test-vocab-en.1] cached, skipping...
>> [test-vocab-en.2] cached, skipping...
>> [test-vocab-en.3] cached, skipping...
>> [source-numlines] cached, skipping...
>> [source-numlines] retrieved cached result =>20788
>> [berkeley-aligner-chunk-0] cached, skipping...
>> [aligner-combine] cached, skipping...
>> [pack-grammar] cached, skipping...
>> [lm-sort-uniq] cached, skipping...
>> [kenlm] cached, skipping...
>> [compile-kenlm] cached, skipping...
>> [glue-tune] cached, skipping...
>> Error: Could not find or load main class 
>> joshua.util.encoding.EncoderConfiguration
>> [tune-bundle] cached, skipping...
>> [mert-1] rebuilding...
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  [CHANGED]
>>  dep=tune/model/grammar.packed/slice_0.source [CHANGED]
>>  
>> dep=/Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config.final
>>  [NOT FOUND]
>>  
>> cmd=/Users/thammegr/work/projects/apache/incubator-joshua/scripts/training/run_tuner.py
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.bn
>>  
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/data/tune/corpus.en
>>  --tunedir 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune 
>> --tuner mert --decoder 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/decoder_command
>>  --decoder-config 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.config
>>  --decoder-output-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/output.nbest
>>  --decoder-log-file 
>> /Users/thammegr/work/projects/apache/incubator-joshua/data/bn-en/tune/joshua.log
>>  --iterations 10 --metric 'BLEU 4 closest'
>>  JOB FAILED (return code 1)
>> Exception in thread "main" java.lang.RuntimeException: Un

Re: About how to use Jousha translator

2017-09-12 Thread Matt Post

Hi,

The mention of Google referred only to the public API. That is, Joshua's server 
mode will answer to RESTful style queries. This is implemented 

There are not any new language packs forthcoming in the near future that I am 
aware of. 

matt (from my phone)

> Le 12 sept. 2017 à 14:44, lewis john mcgibbney  a écrit :
> 
> If I were you I would simply contact dev@joshia with that query then.
> Someone on the list should hopefully see the comment and respond.
> It looks like an update to this documentation is possibly required as I am
> not sure if anyone is actively working on this... I may be wrong however!
> 
>> On Tue, Sep 12, 2017 at 3:07 AM Tehetena Alemu  wrote:
>> 
>> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> 
>> "*Version 3 Language Packs Coming Soon*
>> (March 2017) Version 3 language packs with Kenlm (via Docker) and more
>> complete Google Translate API support
>>  are coming soon.
>> If you have questions, comments, concerns, or wish to help, please post
>> questions to the Joshua mailing list: d...@joshua.apache.org."
>> 
>> Tehetena Alemu
>> 
>> On Tue, Sep 12, 2017 at 1:45 AM, lewis john mcgibbney 
>> wrote:
>> 
>>> Where did you get this information from?
>>> 
>>> On Mon, Sep 11, 2017 at 12:28 PM, Tehetena Alemu 
>>> wrote:
>>> 
 Thank you very much Lewis , it is very kind of you. Your help means a
> lot. By the way, 2 weeks is the time i took on trying diffrent options ,
> but not for getting  a response.
> 
 
 On the other way, I just found out jousha pack 3 will be released soon,
 with  Google translation. When will it be released ? It will be a very good
 contribution to my paper.
 
 Best,
 
 
 --
 Tehetena Alemu
 
 
>>> 
>>> 
>>> --
>>> http://home.apache.org/~lewismc/
>>> @hectorMcSpector
>>> http://www.linkedin.com/in/lmcgibbney
>>> 
>> 
>> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney

Re: About how to use Jousha translator

2017-09-12 Thread Matt Post

This is the best I can do:

https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API 
<https://cwiki.apache.org/confluence/display/JOSHUA/RESTful+API>


> On Sep 12, 2017, at 9:19 PM, Tehetena Alemu  <mailto:tehet...@gmail.com>> wrote:
> 
> Hi Matt,
> 
> Thanks for your response. Would you mind to give me a clue how I can use this 
> plublic API to translate from amharic to English or other ?
> 
> On Tuesday, September 12, 2017, Matt Post  <mailto:p...@cs.jhu.edu>> wrote:
> Hi,
> 
> The mention of Google referred only to the public API. That is, Joshua's 
> server mode will answer to RESTful style queries. This is implemented
> 
> There are not any new language packs forthcoming in the near future that I am 
> aware of.
> 
> matt (from my phone)
> 
> > Le 12 sept. 2017 à 14:44, lewis john mcgibbney  > > a écrit :
> >
> > If I were you I would simply contact dev@joshia with that query then.
> > Someone on the list should hopefully see the comment and respond.
> > It looks like an update to this documentation is possibly required as I am
> > not sure if anyone is actively working on this... I may be wrong however!
> >
> >> On Tue, Sep 12, 2017 at 3:07 AM Tehetena Alemu  >> > wrote:
> >>
> >> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
> >> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
> >>
> >> "*Version 3 Language Packs Coming Soon*
> >> (March 2017) Version 3 language packs with Kenlm (via Docker) and more
> >> complete Google Translate API support
> >> <https://cloud.google.com/translate/docs/reference/rest 
> >> <https://cloud.google.com/translate/docs/reference/rest>> are coming soon.
> >> If you have questions, comments, concerns, or wish to help, please post
> >> questions to the Joshua mailing list: d...@joshua.apache.org 
> >> ."
> >>
> >> Tehetena Alemu
> >>
> >> On Tue, Sep 12, 2017 at 1:45 AM, lewis john mcgibbney  >> >
> >> wrote:
> >>
> >>> Where did you get this information from?
> >>>
> >>> On Mon, Sep 11, 2017 at 12:28 PM, Tehetena Alemu  >>> >
> >>> wrote:
> >>>
> >>>> Thank you very much Lewis , it is very kind of you. Your help means a
> >>>>> lot. By the way, 2 weeks is the time i took on trying diffrent options ,
> >>>>> but not for getting  a response.
> >>>>>
> >>>>
> >>>> On the other way, I just found out jousha pack 3 will be released soon,
> >>>> with  Google translation. When will it be released ? It will be a very 
> >>>> good
> >>>> contribution to my paper.
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>> --
> >>>> Tehetena Alemu
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
> >>> @hectorMcSpector
> >>> http://www.linkedin.com/in/lmcgibbney 
> >>> <http://www.linkedin.com/in/lmcgibbney>
> >>>
> >>
> >> --
> > http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney 
> > <http://www.linkedin.com/in/lmcgibbney>
> 
> 
> 
> -- 
> Tehetena Alemu
>

Re: [DISCUSS] Graduation (was Re: Path to TLP)

2017-09-25 Thread Matt Post

Hi everyone,

I think now is as good time a time as any to mention my feelings about Joshua. 
You may have noticed that I haven't done much active development over the past 
year; you likely also know that the reason is that the research community has 
shifted entirely from work on statistical models to work on neural machine 
translation. On the research side, neural models now consistently outperform 
phrase-based systems on BLEU score on language pairs where there is enough data 
(roughly, around 15 million words of training), and work there has injected a 
lot of new life into a field that many had felt was starting to stagnate. From 
a production standpoint, neural systems are also a big win: the models do best 
with a GPU and take some time to train, but the architecture and pipeline are 
simpler, and the resulting models are constant-sized and on the order of a few 
gigabytes at most, instead of scaling with training data into the tens of 
gigabytes, as statistical systems do. Test-time inference can also be run 
fairly efficiently on CPUs where throughput demands are low enough. All 
commercial systems are now neural or are quickly moving in that direction, 
including relatively surprising places like Systran, which until recently was 
known as the world's best-known rule-based system. As GPUs become more 
ubiquitous and cheap, this situation is only going to get better, even for the 
end user. There is little doubt that neural MT has supplanted statistical 
approaches to machine translation, across both academic research and industry. 
And it is still in its relative infancy, with lots of interesting research 
problems and engineering issues to investigate and resolve.

It's somewhat sad for me because I've been working on or with Joshua for almost 
seven years, but I also find my feelings here interesting in contrast to a 
previous time I've felt tugged away from Joshua. As many of you know, Philipp 
Koehn joined JHU a few years ago, which brought some tension to JHU with 
respect to collaborating on research. There was pressure for me to switch. 
Moses had a much bigger development community and was much more feature rich, 
but despite this, I was reluctant to let go of Joshua, for a number of reasons. 
Java is nicer to work with than C++ (and not really that much slower); our code 
is better written, IMO; jar files are easier to distribute than C++ in compiled 
or source form; and, of course, I had much more familiarity with the codebase, 
not to mention something of a personal stake in Joshua. But with neural MT, I 
have none of these reservations. It's nice for one to have the Moses/Joshua 
tension resolved (sometimes, ignoring a problem does make it go away!), but for 
all the reasons I listed in the opening paragraph, NMT is now the clear way to 
go. And the bottom line for me is that I can no longer justify spending time on 
Joshua during my working hours, and with a young family and other interests 
that I want to pursue, I don't have time for it outside of work. I am happy to 
still linger on the project, but am unlikely to be much of an active 
participant unless I'm explicitly asked for something.

As I've written before here, I think there may still some role for statistical 
systems, and therefore, for Joshua. In low-resource situations, StatMT may 
still be the right approach overall, or even simply the best way to quickly 
build up a working system. There is some promise I think in deploying models 
easily on older hardware that people have, and perhaps getting people to hep 
contribute translations and translation memories that could be used to build 
and improve systems. There are surely more good ideas in this space in the vein 
of providing a good tool to users. 

It's been a great experience for me working with the Apache community on 
Joshua. I am grateful to Chris for convincing us to make Joshua an Apache 
incubator project, which put a lot of new life into the project. Lewis has been 
a lot of help throughout helping smooth over the transition; Tommaso has 
repeatedly helped with tasks large and small; and that is just three of you. 
It's too bad therefore that the timing just didn't work out, but neural MT 
ascended very rapidly. I know there are other members here who are also 
thinking along these lines. At the same time, I hope my departure from active 
development doesn’t mean the end of the project for those of you who wish to 
keep working on it. 

Sincerely,
matt


> Le 25 sept. 2017 à 23:10, Tommaso Teofili  a écrit 
> :
> 
> I would also think we're ready for graduation.
> My only concern relates to how many of the current committers are willing
> to keep contributing to the project, basically if we have a PMC which is
> big enough for the graduation.
> 
> Regards,
> Tommaso
> 
> 
> Il giorno sab 23 set 2017 alle ore 01:21 Chris Mattmann 
> ha scritto:
> 
>> Tom, glad you raised this issue, IMO, Joshua is ready for TLP.
>> 
>> We’ve:
>> 
>> 1. Added new PPMC/comm

Re: [DISCUSS] Graduation (was Re: Path to TLP)

2017-10-05 Thread Matt Post

Thanks Tommaso. Though, I should say, initial thanks goes to Zhifei Li. I just 
took it over.

I think I can stick around in the capacity Chris suggests. Thanks, all.

matt

> On Sep 27, 2017, at 9:20 AM, Tommaso Teofili  
> wrote:
> 
> +1 to Chris's proposal.
> 
> Let me also add my thanks to you Matt for making Joshua happen in first
> place and for bringing it to the ASF and involving me and the rest of the
> team in such an interesting piece of sw and to machine translation in
> general. I do understand the need for you to move into the NMT stuff but at
> the same time I think Joshua is a very good resource (given also the so
> many language packs available) for people and / or projects that want to
> start with MT having reasonably good results so I can still see its value.
> 
> My 2 cents,
> Tommaso
> 
> 
> 
> Il giorno mar 26 set 2017 alle ore 18:57 Chris Mattmann 
> ha scritto:
> 
>> Thanks Matt. My feeling is that if you are willing to make you the chair
>> of the project,
>> which is really an administrative role if you are willing and willingness
>> to submit a board
>> report once monthly, and then quarterly after 3 months. This is to
>> recognize your contributions
>> and merit to the project, which will never expire. Even if you are not
>> actively developing, I think
>> you would make a great chair.
>> 
>> Apache Joshua works, has a release, and has a good community around it of
>> people like Lewis,
>> Tommaso, and others that I think it would withstand even your development
>> departure. It could
>> also make a good academic/learning tool and could be something we could
>> focus on getting new
>> GSOC projects to add in the NeuralMT stuff.
>> 
>> If you are OK with that I think we should proceed. Let me know and thanks.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> 
>> 
>> On 9/25/17, 11:24 PM, "Matt Post"  wrote:
>> 
>>Hi everyone,
>> 
>>I think now is as good time a time as any to mention my feelings about
>> Joshua. You may have noticed that I haven't done much active development
>> over the past year; you likely also know that the reason is that the
>> research community has shifted entirely from work on statistical models to
>> work on neural machine translation. On the research side, neural models now
>> consistently outperform phrase-based systems on BLEU score on language
>> pairs where there is enough data (roughly, around 15 million words of
>> training), and work there has injected a lot of new life into a field that
>> many had felt was starting to stagnate. From a production standpoint,
>> neural systems are also a big win: the models do best with a GPU and take
>> some time to train, but the architecture and pipeline are simpler, and the
>> resulting models are constant-sized and on the order of a few gigabytes at
>> most, instead of scaling with training data into the tens of gigabytes, as
>> statistical systems do. Test-time inference can also be run fairly
>> efficiently on CPUs where throughput demands are low enough. All commercial
>> systems are now neural or are quickly moving in that direction, including
>> relatively surprising places like Systran, which until recently was known
>> as the world's best-known rule-based system. As GPUs become more ubiquitous
>> and cheap, this situation is only going to get better, even for the end
>> user. There is little doubt that neural MT has supplanted statistical
>> approaches to machine translation, across both academic research and
>> industry. And it is still in its relative infancy, with lots of interesting
>> research problems and engineering issues to investigate and resolve.
>> 
>>It's somewhat sad for me because I've been working on or with Joshua
>> for almost seven years, but I also find my feelings here interesting in
>> contrast to a previous time I've felt tugged away from Joshua. As many of
>> you know, Philipp Koehn joined JHU a few years ago, which brought some
>> tension to JHU with respect to collaborating on research. There was
>> pressure for me to switch. Moses had a much bigger development community
>> and was much more feature rich, but despite this, I was reluctant to let go
>> of Joshua, for a number of reasons. Java is nicer to work with than C++
>> (and not really that much slower); our code is better written, IMO; jar
>> files are easier to distribute than C++ in compiled or source form; and, of
>> course, I had much more familiarity with the codebase, not to mention
>> something of a personal stake in Jos

Re: problems with LM loading

2017-10-16 Thread Matt Post

First I'd check, does the file exist?

It shouldn't be calling ArpaLM. That's for loading plain text files. 
".berkeleylm" files have been compiled into a special binary format that is 
more efficiently compacted and can be ready quickly. There is logic for 
determining which type of file it is, and I wonder if it is going astray. Or 
maybe the file is not what it says it is (can you "head" it)?

matt


> On Oct 16, 2017, at 7:08 PM, kellen sunderland  
> wrote:
> 
> The feature function initialization message is just a general purpose 
> exception handler.  I’ve seen this quite often when language models fail to 
> load.  The most interesting part of the log to me is:
> 
>> Caused by: java.lang.RuntimeException: Something wrong with I/O.
>> 
>> at edu.berkeley.nlp.lm.io.ArpaLmReader.parseHeader(ArpaLmReader.java:114)
>> 
>> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:76)
> 
> 
> To me it looks like it could only be caused by the lack of the text 
> "\\1-grams:" in the file you’re opening.  Reference this function: 
> https://github.com/smilli/berkeleylm/blob/master/src/edu/berkeley/nlp/lm/io/ArpaLmReader.java#L105
> 
> Are you trying to load a binary lm with an Arpa reader by any chance?  Do you 
> have the quoted text in your text based LM?
> 
> -Kellen
> From: Tommaso Teofili
> Sent: Monday, October 16, 2017 4:09 PM
> To: dev@joshua.incubator.apache.org
> Subject: Re: problems with LM loading
> 
> p.s.:
> I've tried with other LPs (e.g. sd-en) and I get the same ...
> 
> Il giorno lun 16 ott 2017 alle ore 15:06 Tommaso Teofili <
> tommaso.teof...@gmail.com> ha scritto:
> 
>> Hi all,
>> 
>> I am trying to use the ES-EN language pack from our "Language Packs" page
>> with Joshua 6.1, but when I get to load the two language models I get an IO
>> execption.
>> The config looks like:
>> 
>> feature-function = LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
>> model/lm.berkeleylm
>> feature-function = Distortion
>> feature-function = LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
>> model/en.giga.twopercent.4.lm.berkeleylm
>> feature-function = PhrasePenalty
>> 
>> and I get the following:
>> 
>> java.lang.RuntimeException: java.lang.RuntimeException: Unable to
>> instantiate feature function 'LanguageModel -lm_type berkeleylm -lm_order 4
>> -lm_file model/lm.berkeleylm'!
>> 
>> ...
>> 
>> Caused by: java.lang.RuntimeException: Unable to instantiate feature
>> function 'LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file
>> model/lm.berkeleylm'!
>> 
>> at
>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:642)
>> 
>> at org.apache.joshua.decoder.Decoder.initialize(Decoder.java:394)
>> 
>> at org.apache.joshua.decoder.Decoder.(Decoder.java:128)
>> 
>> Caused by: java.lang.reflect.InvocationTargetException: null
>> 
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> 
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>> 
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> 
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>> 
>> at
>> org.apache.joshua.decoder.Decoder.initializeFeatureFunctions(Decoder.java:638)
>> 
>> ... 58 common frames omitted
>> 
>> Caused by: java.lang.RuntimeException: Something wrong with I/O.
>> 
>> at edu.berkeley.nlp.lm.io.ArpaLmReader.parseHeader(ArpaLmReader.java:114)
>> 
>> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:76)
>> 
>> at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
>> 
>> at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
>> 
>> at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
>> 
>> at
>> edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
>> 
>> at
>> edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:151)
>> 
>> at
>> org.apache.joshua.decoder.ff.lm.berkeley_lm.LMGrammarBerkeley.(LMGrammarBerkeley.java:94)
>> 
>> at
>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.initializeLM(LanguageModelFF.java:158)
>> 
>> at
>> org.apache.joshua.decoder.ff.lm.LanguageModelFF.(LanguageModelFF.java:132)
>> 
>> Any hints on what I could be doing wrong ? Encoding ?
>> Did anyone else experience such issue ?
>> 
>> BTW I am running this from within a Java application, Decoder is
>> initialized as follows:
>> 
>> JoshuaConfiguration configuration = new JoshuaConfiguration();
>>configuration.readConfigFile(pathToJoshuaConfig);
>>configuration.use_structured_output = true;
>>Decoder decoder = new Decoder(configuration, pathToJoshuaConfig);
>> 
>> Regards,
>> Tommaso
>> 
>

Re: [jira] [Commented] (JOSHUA-333) The English-English Language Pack download links are broken.

2018-01-08 Thread Matt Post

Hi folks,

Hope we can dig these up because they’ve been deleted from JHU’s servers. 

matt (from my phone)

> Le 5 janv. 2018 à 17:51, Lewis John McGibbney (JIRA)  a 
> écrit :
> 
> 
>[ 
> https://issues.apache.org/jira/browse/JOSHUA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313425#comment-16313425
>  ] 
> 
> Lewis John McGibbney commented on JOSHUA-333:
> -
> 
> [~bugg_tb] were these files copied when we migrated from [~post]'s server to 
> Dropbox?
> 
>> The English-English Language Pack download links are broken.
>> 
>> 
>>Key: JOSHUA-333
>>URL: https://issues.apache.org/jira/browse/JOSHUA-333
>>Project: Joshua
>> Issue Type: Bug
>>   Reporter: David Gonzalez
>> 
>> On the Apache Joshua English-English wiki page the ruleset (PPDB v2) 
>> downloads are all broken (404).
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65142863
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)

Re: CJK LPs

2018-02-19 Thread Matt Post

I don’t think I ever built these. There is an additional step of properly and 
consistently segmenting Chinese which complicates things and creates an 
external dependency. 

matt (from my phone)

> Le 19 févr. 2018 à 10:46, Tommaso Teofili  a écrit 
> :
> 
> Hi all,
> 
> I am not sure if I am missing something, but I somewhat recalled that
> language packs for Chinese (but also Japanese / Korean) existed at [1],
> however I can't find any.
> Reading through the comments it seems at least that was the plan.
> If that is a leftout from the recent LP migration we could try to fix it
> otherwise it'd be nice to build and provide such CJK LPs.
> Can anyone help clarify ?
> 
> Regards,
> Tommaso
> 
> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

Re: CJK LPs

2018-02-19 Thread Matt Post

You just have to make sure that the language pack makes it easy to apply the 
same pre-processing to test data that you applied at training time. Which means 
bundling the segmentation model with the language pack (or doing something 
simple, like single-character words—that degrades performance but would be 
easier). I typically use the Stanford segmenter but I'm not sure it would 
matter that much.

matt


> On Feb 19, 2018, at 1:45 PM, Tommaso Teofili  
> wrote:
> 
> thanks Matt.
> Would you be able to point out such additional step in a bit more detail
> when you have time ?
> Not sure what you used for segmentation, perhaps could use either Lucene's
> CJK [1] or Kuromoji [2] analyzers.
> 
> Regards,
> Tommaso
> 
> [1] :
> https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html
> [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/
> 
> Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post  ha
> scritto:
> 
>> I don’t think I ever built these. There is an additional step of properly
>> and consistently segmenting Chinese which complicates things and creates an
>> external dependency.
>> 
>> matt (from my phone)
>> 
>>> Le 19 févr. 2018 à 10:46, Tommaso Teofili  a
>> écrit :
>>> 
>>> Hi all,
>>> 
>>> I am not sure if I am missing something, but I somewhat recalled that
>>> language packs for Chinese (but also Japanese / Korean) existed at [1],
>>> however I can't find any.
>>> Reading through the comments it seems at least that was the plan.
>>> If that is a leftout from the recent LP migration we could try to fix it
>>> otherwise it'd be nice to build and provide such CJK LPs.
>>> Can anyone help clarify ?
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> 
>>

Re: [DISCUSS] Graduation (was Re: Path to TLP)

2018-09-06 Thread Matt Post

Hi folks,

It is fine with me if you want to move to graduation, but at this point I will 
assert that I don't have the time to contribute, and do not wish to be involved 
as a committee member once that threshold is crossed. It has been a good run 
and I have only fond associations with the project, but it is time for me to 
move on, and I wish you all the best.

Sincerely,
Matt



> On Sep 6, 2018, at 11:36 AM, Chris Mattmann  wrote:
> 
> Coming back to this.
> 
> 
> 
> Sorry it took so long :/
> 
> 
> 
> Here is a proposed graduation template. I will call for a VOTE on it 
> by mid-next week once the discussion comes to consensus. 
> 
> 
> 
> WHEREAS, the Board of Directors deems it to be in the best
> 
> interests of the Foundation and consistent with the
> 
> Foundation's purpose to establish a Project Management
> 
> Committee charged with the creation and maintenance of
> 
> open-source software, for distribution at no charge to
> 
> the public, related to statistical and other forms of machine 
> translation.
> 
> 
> 
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> 
> Committee (PMC), to be known as the "Apache Joshua Project",
> 
> be and hereby is established pursuant to Bylaws of the
> 
> Foundation; and be it further
> 
> 
> 
> RESOLVED, that the Apache Joshua Project be and hereby is
> 
> responsible for the creation and maintenance of software
> 
> related to statistical and other forms of machine translation;
> 
> and be it further
> 
> 
> 
> RESOLVED, that the office of "Vice President, Apache Joshua" be
> 
> and hereby is created, the person holding such office to
> 
> serve at the direction of the Board of Directors as the chair
> 
> of the Apache Joshua Project, and to have primary responsibility
> 
> for management of the projects within the scope of
> 
> responsibility of the Apache Joshua Project; and be it further
> 
> 
> 
> RESOLVED, that the persons listed immediately below be and
> 
> hereby are appointed to serve as the initial members of the
> 
> Apache Joshua Project:
> 
> 
> 
> * Tom Barber  
> 
> * Thamme Gowda   
> 
> * Felix Hieber 
> 
> * Lewis John McGibbney 
> 
> * Chris Mattmann 
> 
> * Matt Post 
> 
> * Paul Ramirez   
> 
> * Henry Saputra
> 
> * Kellen Sunderland 
> 
> * Tommaso Teofili
> 
> 
> 
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matt Post
> 
> be appointed to the office of Vice President, Apache Joshua to
> 
> serve in accordance with and subject to the direction of the
> 
> Board of Directors and the Bylaws of the Foundation until
> 
> death, resignation, retirement, removal or disqualification,
> 
> or until a successor is appointed; and be it further
> 
> 
> 
> RESOLVED, that the initial Apache Joshua PMC be and hereby is
> 
> tasked with the creation of a set of bylaws intended to
> 
> encourage open development and increased participation in the
> 
> Apache Joshua Project; and be it further
> 
> 
> 
> RESOLVED, that the Apache Joshua Project be and hereby
> 
> is tasked with the migration and rationalization of the Apache
> 
> Incubator Joshua podling; and be it further
> 
> 
> 
> RESOLVED, that all responsibilities pertaining to the Apache
> 
> Incubator Joshua podling encumbered upon the Apache Incubator
> 
> Project are hereafter discharged.
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> From: Thamme Gowda 
> Reply-To: "dev@joshua.incubator.apache.org" 
> Date: Saturday, February 3, 2018 at 7:51 PM
> To: "dev@joshua.incubator.apache.org" 
> Subject: Re: [DISCUSS] Graduation (was Re: Path to TLP)
> 
> 
> 
> Great news!
> 
> 
> 
> 2018-02-01 19:48 GMT-08:00 Mattmann, Chris A (1761) <
> 
> chris.a.mattm...@jpl.nasa.gov>:
> 
> 
> 
> +1 I’ll draft the resolution and send shortly for community vote
> 
> 
> 
> Sent from my iPhone
> 
> 
> 
>> On Feb 1, 2018, at 7:22 PM, Tom Barber  wrote:
> 
>> 
> 
>> I'd just like to dig this one back. Seeing how Matt accepted the
> 
> proposal and there is action from Tommaso and Lewis to get stuff merged,
> 
> it seems like there is general consensus to get Joshua out of the incubator.
> 
>> 
> 
>> Tom
> 
>> 
&g

[jira] [Commented] (JOSHUA-248) Add Apache License headers to Joshua code

2016-03-07 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183229#comment-15183229
 ] 

Matt Post commented on JOSHUA-248:
--

Can you describe or point me to the conventions for this? Can we get by with 
LICENSE files in directories, or does it require modifying every file? If the 
latter, is it just source code, or also other files?

> Add Apache License headers to Joshua code
> -
>
> Key: JOSHUA-248
> URL: https://issues.apache.org/jira/browse/JOSHUA-248
> Project: Joshua
>  Issue Type: Task
>Reporter: Tommaso Teofili
> Fix For: 6.1
>
>
> Joshua source code should include standard headers with AL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-248) Add Apache License headers to Joshua code

2016-03-19 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197979#comment-15197979
 ] 

Matt Post commented on JOSHUA-248:
--

For what it's worth, the code in the whole subsample directory is not used at 
all, and probably hasn't been since 2012. If we had to, we could jettison it. I 
could also track down the original authors (I know where at least one is) and 
get permission to change the license.

> Add Apache License headers to Joshua code
> -
>
> Key: JOSHUA-248
> URL: https://issues.apache.org/jira/browse/JOSHUA-248
> Project: Joshua
>  Issue Type: Task
>Reporter: Tommaso Teofili
> Fix For: 6.1
>
> Attachments: JOSHUA-248.0.patch
>
>
> Joshua source code should include standard headers with AL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-248) Add Apache License headers to Joshua code

2016-04-05 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226639#comment-15226639
 ] 

Matt Post commented on JOSHUA-248:
--

My read is the same as Henri's: Joshua had a more restrictive license a while 
back, so they got special permission to include it as LGPL. I think we can just 
restore the Apache license.

> Add Apache License headers to Joshua code
> -
>
> Key: JOSHUA-248
> URL: https://issues.apache.org/jira/browse/JOSHUA-248
> Project: Joshua
>  Issue Type: Task
>Reporter: Tommaso Teofili
> Fix For: 6.1
>
> Attachments: JOSHUA-248.0.patch
>
>
> Joshua source code should include standard headers with AL2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-253) Enable execution of Unit tests

2016-04-27 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260525#comment-15260525
 ] 

Matt Post commented on JOSHUA-253:
--

Yes, Tommaso is correct. test/ holds what I think are better termed regression 
tests, and executes any executable test*sh file under test, where returning 0 
is success and otherwise is failure.

I know there are unit tests scattered throughout the code but I have never run 
them. It would be great to have those start to be run as well. I know the 
Amazon folks have been contributing some, so maybe they could let us know?

> Enable execution of Unit tests
> --
>
> Key: JOSHUA-253
> URL: https://issues.apache.org/jira/browse/JOSHUA-253
> Project: Joshua
>  Issue Type: Test
>Affects Versions: 6.0
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> As per our [discussion on this 
> topic|http://www.mail-archive.com/dev%40joshua.incubator.apache.org/msg00270.html],
>  [~teofili] correctly identified that unit level tests are not executed.
> We need to fix this such that they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-251) Address Website Branding Issues

2016-04-27 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261475#comment-15261475
 ] 

Matt Post commented on JOSHUA-251:
--

Related to this: after trying to use it a little, I think the current 
website-building approach is too much of an impediment to use. Recall that this 
process requires maintaining two branches in the repo: one for the source files 
(in Markdown format, mostly), and another for the generated website, which then 
gets pushed up. When the website was on Github, we just needed the source 
branch, because Github runs Jekyll for you. 

Another nice feature of Github was you could easily edit the files in a web 
browser on the site directly, and that would also trigger an update. It'd be 
nice to remove any barriers to documentation, since it's already kind of hard 
to get done.

I'm thinking about moving the website over to Joshua's Confluence page. Are 
there any drawbacks to this? I'm a little wary of putting the site in a 
proprietary CMS, but it seems that Apache is all-in on the software, and it 
provides a good user experience.

matt

> Address Website Branding Issues
> ---
>
> Key: JOSHUA-251
> URL: https://issues.apache.org/jira/browse/JOSHUA-251
> Project: Joshua
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 6.1
>
>
> We have a number of Website branding issues which we need to address.
> http://www.apache.org/foundation/marks/pmcs.html#introduction
> Lets work through them here. Please create child issues if appropriate.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-258) Add back penn-treebank-(de)tokenizer perl scripts

2016-04-28 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263258#comment-15263258
 ] 

Matt Post commented on JOSHUA-258:
--

Yes, it's been too-long neglected! I've been thinking this might be good to 
include in the language packs. I want to update it to implement the Google 
Translate API, with some extensions added by Philipp Koehn that allow it to 
work with CasmaCat, an interactive MT tool.

> Add back penn-treebank-(de)tokenizer perl scripts
> -
>
> Key: JOSHUA-258
> URL: https://issues.apache.org/jira/browse/JOSHUA-258
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 6.1
>
>
> I've been working with the 
> [joshua_translation_engine|https://github.com/joshua-decoder/joshua_translation_engine]
>  (which is friggin excellent, we will definately be standing this up on 
> something more heavyweight in the near future) and recently reported [issue 
> 15|https://github.com/joshua-decoder/joshua_translation_engine/issues/15]
> This issue therefore proposes to add back in penn-treebank-(de)tokenizer perl 
> scripts which were removed between 6.0.4 and 6.0.5 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-259) Integration tests are failing

2016-05-02 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267080#comment-15267080
 ] 

Matt Post commented on JOSHUA-259:
--

I am having some failures, but not all of yours.

- OS X 10.11: test/server/http and test/server/tcp-text
- CentOS 6.7: test/thrax/extraction test/server/http test/server/tcp-text

(for test/decoder/too-long: did you recompile after pulling?)

The failure of most of these is an error often enough that I have just ignored 
them, which is bad practice. I can fix these later today.

> Integration tests are failing
> -
>
> Key: JOSHUA-259
> URL: https://issues.apache.org/jira/browse/JOSHUA-259
> Project: Joshua
>  Issue Type: Bug
>Reporter: Kellen Sunderland
>
> Several integration tests are currently failing with Joshua.  I have a quick 
> fix coming for one of the tests but just in case we need more discussion 
> around the failures I'll open a bug.
> The currently failing tests for me:
> test/decoder/too-long
> test/server/http
> test/server/tcp-text
> test/thrax/extraction
> and 
> test/decoder/moses-compat (but this is easy to fix, simple extra space in the 
> expected file)
> These are failing under OS X 10.11.  If working under other environments feel 
> free to post a 'works for me'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-259) Integration tests are failing

2016-05-02 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267086#comment-15267086
 ] 

Matt Post commented on JOSHUA-259:
--

For the Hadoop test, it currently tests rolling out its own Hadoop cluster. 
This is something I'd like to remove from Joshua (the ability to set up its own 
infrastructure), so I am going to change it so that it just tests your current 
one, exiting without failure if $HADOOP is not defined. Unless there are any 
objections.

> Integration tests are failing
> -
>
> Key: JOSHUA-259
> URL: https://issues.apache.org/jira/browse/JOSHUA-259
> Project: Joshua
>  Issue Type: Bug
>Reporter: Kellen Sunderland
>
> Several integration tests are currently failing with Joshua.  I have a quick 
> fix coming for one of the tests but just in case we need more discussion 
> around the failures I'll open a bug.
> The currently failing tests for me:
> test/decoder/too-long
> test/server/http
> test/server/tcp-text
> test/thrax/extraction
> and 
> test/decoder/moses-compat (but this is easy to fix, simple extra space in the 
> expected file)
> These are failing under OS X 10.11.  If working under other environments feel 
> free to post a 'works for me'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-145) Add truecasing

2016-05-02 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post updated JOSHUA-145:
-
Issue Type: New Feature  (was: Bug)

> Add truecasing
> --
>
> Key: JOSHUA-145
> URL: https://issues.apache.org/jira/browse/JOSHUA-145
> Project: Joshua
>  Issue Type: New Feature
>    Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.1
>
>
> Joshua currently lowercases all data; a better approach is truecasing, where 
> the most frequent capitalization pattern is used for each token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-145) Add truecasing

2016-05-02 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267521#comment-15267521
 ] 

Matt Post commented on JOSHUA-145:
--

Reclassified.

I recently added a related feature to Joshua. If you invoke the decoder with 
-lowercase, all the input sentence tokens will be lowercased, and the grammar 
lookups will used the lowercase version. It then adds an annotation on each 
token of the form

lettercase = {lower, upper, all-upper}

This is available to any feature function, for example. If you also invoke the 
decoder with "-project-case", it will use word-level alignments to project 
source-language case to the target language, according to the following logic:

- If aligned to the first word, case is only projected if it is "all-upper"
- Otherwise, project the source-language case

This does things like project all caps, and capitalization of names (including 
if they were OOVs). It's different from true-casing or re-casing. I haven't 
done a thorough comparison, but this was the method that helped put a 
relatively simple Joshua system in first place for WMT 2016 en-tr.

> Add truecasing
> --
>
> Key: JOSHUA-145
> URL: https://issues.apache.org/jira/browse/JOSHUA-145
> Project: Joshua
>      Issue Type: New Feature
>    Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.1
>
>
> Joshua currently lowercases all data; a better approach is truecasing, where 
> the most frequent capitalization pattern is used for each token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-172) Speed up grammar file reading with memory-mapped files

2016-05-02 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267568#comment-15267568
 ] 

Matt Post commented on JOSHUA-172:
--

Agreed.

> Speed up grammar file reading with memory-mapped files
> --
>
> Key: JOSHUA-172
> URL: https://issues.apache.org/jira/browse/JOSHUA-172
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
> Fix For: 6.1
>
>
> [This 
> document|http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly]
>  should be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-260) Integrate IoC (Inversion of Control) into Joshua

2016-05-02 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267566#comment-15267566
 ] 

Matt Post commented on JOSHUA-260:
--

This looks cool. I am not going to be able to look into it until June, but we 
could chat about it next week. 

Can you say more about how this interacts with the config system? I'd love to 
see that overhauled. It would be really nice to do better argument processing. 
The features I like in the current system are:

- being able to list all parameters in a config file, but then to override them 
on the command line
- (nice but less important) collapsing different arguments to equiv. classes 
(e.g., "top-n" = "topn" = "topN" etc)

It would be nice to have:

- builtin documentation to each parameter
- the ability to invoke the decoder with -help

My 20 second look at guice though seems to suggest this is something quite 
different, though?

> Integrate IoC (Inversion of Control) into Joshua
> 
>
> Key: JOSHUA-260
> URL: https://issues.apache.org/jira/browse/JOSHUA-260
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Kellen Sunderland
>
> I'd like to propose we investigate looking into using guice 
> (https://github.com/google/guice) in conjunction with joshua's configuration 
> system.  I believe it would give us a nice way to map what is in the 
> configuration to the code paths, and implementations used within Joshua.  It 
> also would go a long way to allowing us to integrate unit tests throughout 
> all the important classes in Joshua.  What does everyone think?  Would IoC be 
> a good pattern to adopt?  Is everyone ok with using guice (versus say some 
> other IoC library).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (JOSHUA-172) Speed up grammar file reading with memory-mapped files

2016-05-02 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post closed JOSHUA-172.

Resolution: Done
  Assignee: Matt Post

This is implemented with packed grammars.

> Speed up grammar file reading with memory-mapped files
> --
>
> Key: JOSHUA-172
> URL: https://issues.apache.org/jira/browse/JOSHUA-172
> Project: Joshua
>  Issue Type: Bug
>    Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.1
>
>
> [This 
> document|http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly]
>  should be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-261) Remove ext directory from source tree

2016-05-06 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274386#comment-15274386
 ] 

Matt Post commented on JOSHUA-261:
--

Can we include KenLM? It's LGPL 2.1+.

BerkeleyLM is fine, so it's just GIZA++ that has to go.

> Remove ext directory from source tree
> -
>
> Key: JOSHUA-261
> URL: https://issues.apache.org/jira/browse/JOSHUA-261
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Right now we have a bunch of cofe bundled in to the 
> [ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
> don't think any of this code can be shipped with an Apache Joshua 
> (Incubating) release so we need to think about a mechanism for removing it 
> and making Joshua work in other ways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-250) Dockerized the test CI ti specify system libraries

2016-05-08 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275838#comment-15275838
 ] 

Matt Post commented on JOSHUA-250:
--

FYI there has been some independent work on docker-izing Joshua:

https://github.com/aglahe/docker-joshua-decoder

> Dockerized the test CI ti specify system libraries
> --
>
> Key: JOSHUA-250
> URL: https://issues.apache.org/jira/browse/JOSHUA-250
> Project: Joshua
>  Issue Type: Bug
>Affects Versions: 6.1
>Reporter: Henry Saputra
>Assignee: Henry Saputra
>Priority: Minor
>
> Since Joshua need system library like Boost may need to be installed to do 
> build.
> It is better to Dockerize the test environment so we do not need to update CI 
> machines to have them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-250) Dockerized the test CI ti specify system libraries

2016-05-13 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282663#comment-15282663
 ] 

Matt Post commented on JOSHUA-250:
--

I'm not sure that he'd be able to contribute. But if we build one, that would 
be a good starting point.

> Dockerized the test CI ti specify system libraries
> --
>
> Key: JOSHUA-250
> URL: https://issues.apache.org/jira/browse/JOSHUA-250
> Project: Joshua
>  Issue Type: Bug
>Affects Versions: 6.1
>Reporter: Henry Saputra
>Assignee: Henry Saputra
>Priority: Minor
>
> Since Joshua need system library like Boost may need to be installed to do 
> build.
> It is better to Dockerize the test environment so we do not need to update CI 
> machines to have them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-267) Java seems to swallow C exceptions

2016-05-16 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284457#comment-15284457
 ] 

Matt Post commented on JOSHUA-267:
--

Is this with a model you built, or one of the examples? If you give me the 
steps you used I can take a look.

(Also, it looks like you're not using the development Joshua; are you using a 
release?)

> Java seems to swallow C exceptions
> --
>
> Key: JOSHUA-267
> URL: https://issues.apache.org/jira/browse/JOSHUA-267
> Project: Joshua
>  Issue Type: Bug
>Reporter: Tom Barber
>Priority: Minor
>
> I compiled joshua on Ubuntu and copied it to another box of the same type, 
> but missing various C bits that were required at build time, but Joshua 
> doesn't run and tells me:
> Input 0:  berkeley works fine , but the pipeline fails in next steps 
> Input 0: Collecting options took 0.000 seconds
> Input 0: FATAL UNCAUGHT EXCEPTION: null
> java.lang.NullPointerException
> at joshua.decoder.phrase.Candidate.score(Candidate.java:214)
> at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:136)
> at joshua.decoder.phrase.Candidate.compareTo(Candidate.java:19)
> at java.util.HashMap.compareComparables(HashMap.java:371)
> at java.util.HashMap$TreeNode.treeify(HashMap.java:1920)
> at java.util.HashMap.treeifyBin(HashMap.java:771)
> at java.util.HashMap.putVal(HashMap.java:643)
> at java.util.HashMap.put(HashMap.java:611)
> at java.util.HashSet.add(HashSet.java:219)
> at joshua.decoder.phrase.Stack.addCandidate(Stack.java:125)
> at joshua.decoder.phrase.Stacks.search(Stacks.java:166)
> at joshua.decoder.DecoderThread.translate(DecoderThread.java:113)
> at joshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:218)
> Looking at the code its where it passes off to a decoder, which if it doesn't 
> appear, must surely throw some error that we don't see?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-262) Implement all logging as Slf4j over Log4j

2016-05-20 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294655#comment-15294655
 ] 

Matt Post commented on JOSHUA-262:
--

It's not very systematic. 

- 0: print nothing. This mostly works, except the flag is not passed to KenLM, 
which sometimes prints something
- 1: print "basic" information, like progress loading models, information about 
each sentence, and so on
- 2+: extended debugging and detail

I think it's useful to have a "silent" model (-v 0), and 1 should probably be 
for INFO. I'm not sure what the standard practice is when you want extra 
debugging output?

> Implement all logging as Slf4j over Log4j
> -
>
> Key: JOSHUA-262
> URL: https://issues.apache.org/jira/browse/JOSHUA-262
> Project: Joshua
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Thamme Gowda N
> Fix For: 6.1
>
>
> [~hsaputra] suggested that we implement all logging as Slf4j over Log4j. If 
> we use [parameterized logging 
> notation|http://www.slf4j.org/faq.html#logging_performance] we can have good 
> logging in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-262) Implement all logging as Slf4j over Log4j

2016-05-20 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294673#comment-15294673
 ] 

Matt Post commented on JOSHUA-262:
--

I'm not sure if that's right (also not sure it's wrong). Logging is just for 
things that go to STDERR, right? It seems like there should be the following:

- Decoder output that cannot be turned off. This is the most basic information, 
the translation. This isn't logging information, this is system output.

- Informative output that is useful for diagnostics, making sure things are 
running, etc. Things like reporting each sentence as it comes in, how long it 
took to translate, etc. This to me is INFO.

- WARN is for warnings, things out of the ordinary that could trip the user up

- ERROR is for errors.

> Implement all logging as Slf4j over Log4j
> -
>
> Key: JOSHUA-262
> URL: https://issues.apache.org/jira/browse/JOSHUA-262
> Project: Joshua
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Thamme Gowda N
> Fix For: 6.1
>
>
> [~hsaputra] suggested that we implement all logging as Slf4j over Log4j. If 
> we use [parameterized logging 
> notation|http://www.slf4j.org/faq.html#logging_performance] we can have good 
> logging in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-270) pipeline.pl needs major refactoring

2016-05-24 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299141#comment-15299141
 ] 

Matt Post commented on JOSHUA-270:
--

The pipeline is a huge mess, probably not worth salvaging. I'm hoping (maybe 
this year?) to rewrite it, perhaps using this: 
https://github.com/jhclark/ducttape/

> pipeline.pl needs major refactoring
> ---
>
> Key: JOSHUA-270
> URL: https://issues.apache.org/jira/browse/JOSHUA-270
> Project: Joshua
>  Issue Type: Bug
>  Components: pipeline
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> Right now 
> [pipeline.pl|https://github.com/apache/incubator-joshua/blob/master/scripts/training/pipeline.pl]
>  is well over 2000 lines long and extremely difficult to navigate. 
> I propose the following
>  * All ENV is refactored into an pipeline_environment file
>  * All Command line parsing and definitions are refactored into a 
> pipeline_cli file
>  * Sanity checking is refactored into a pipeline_sanity_check file
>  * Dependenct Variable Checking is refactored into 
> pipeline_dependent_variable_setting file
>  * filter and preprocess corpora is refactored into 
> pipeline_filter_preprocess_corpora
>  * pipeline_subsampling becomes a file
>  * pipeline_alignment becomes a file
>  * pipeline_parsing becomes a file
>  * pipeline_thrax becomes a file
>  * pipeline_tuning becomes a file
>  * pipeline_testing becomes a file
>  * pipeline_subreoutines becomes a file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-252) Make it possible to use Maven to build Joshua

2016-05-24 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299142#comment-15299142
 ] 

Matt Post commented on JOSHUA-252:
--

Are there any updates on this? I'd love to get this pulled into master. We have 
one other change for master, and then we might call that the first Apache 
release, version 6.1. We have a lot of other ideas for the version 7 roadmap, 
I'll create a ticket tomorrow.

> Make it possible to use Maven to build Joshua
> -
>
> Key: JOSHUA-252
> URL: https://issues.apache.org/jira/browse/JOSHUA-252
> Project: Joshua
>  Issue Type: Improvement
>  Components: build
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 6.1
>
>
> As per discussion on the dev@ list for now Ant is the official build tool for 
> Joshua however we would like to possibly switch to Maven if / when someone is 
> able to do so.
> Assigning to me for now as I could be able to look into this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-271) Thrax invocation should not reply upon $HADOOP being set

2016-05-24 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299181#comment-15299181
 ] 

Matt Post commented on JOSHUA-271:
--

It looks like there are just two lines where this occurs. I will remove the 
"$HADOOP/bin/" portions of the invocation and push to master soon.

> Thrax invocation should not reply upon $HADOOP being set
> 
>
> Key: JOSHUA-271
> URL: https://issues.apache.org/jira/browse/JOSHUA-271
> Project: Joshua
>  Issue Type: Bug
>  Components: pipeline, thrax
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
> Fix For: 6.1
>
>
> Right now one cannot run thrax unless the $HADOOP env variable is defined. 
> Every time the hadoop script is invoked it means that the path is coded as 
> $HADOOP/bin/hadoop however what happens if you are using a VM (Vagrant) to 
> connect to a cluster for which no $HADOOP env variable is defined? 
> The hadoop script should be on the path and available to use from there. The 
> only check which should be made is whether it is available from the path or 
> not, if it is not then start_hadoop_cluster subroutine can be called. This 
> reduces code and makes more sense.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JOSHUA-272) Simplify the packing and usage of phrase-based grammars

2016-05-25 Thread Matt Post (JIRA)

Matt Post created JOSHUA-272:


 Summary: Simplify the packing and usage of phrase-based grammars
 Key: JOSHUA-272
 URL: https://issues.apache.org/jira/browse/JOSHUA-272
 Project: Joshua
  Issue Type: Improvement
Reporter: Matt Post
Assignee: Matt Post
 Fix For: 6.1


For historical reasons, phrase-based grammars add some complexity to decoding. 
The complete tree under each top-level trie node in packed grammars has to fit 
within a single packed grammars slice, which is limited to 2 GB due to 
constraints on the size of Java byte[] arrays. We used to sort on just the 
first item in the trie, which was a problem for phrase-based decoding, since 
phrase-based rules are implemented as left-branching hierarchical rules. In 
order to pack large grammars, we packed them without the leading [X,1], and 
then added it when loading the grammars, both for the packed and memory-based 
grammars. This was a real mess.

This was all fixed with a commit a while ago that packs and reads packed 
grammars based on the first two symbols on the source side. So we should remove 
all the complexity associated with phrases. They should just be regular rules. 
There is also a lot of redundancy across the codebase in parsing rules, 
converting them to different formats, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JOSHUA-266) Refactor key interfaces and core code for a future release.

2016-05-25 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299609#comment-15299609
 ] 

Matt Post commented on JOSHUA-266:
--

Closing this as a duplicate of JOSHUA-255.

> Refactor key interfaces and core code for a future release. 
> 
>
> Key: JOSHUA-266
> URL: https://issues.apache.org/jira/browse/JOSHUA-266
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Kellen Sunderland
>Priority: Minor
>
> We've discussed making some modifications to the key interfaces.  This ticket 
> can focus on making large changes to the codebase for a future release.  This 
> work will likely take some time and some collaboration.  I'd suggest that the 
> code for this be contained in a separate release branch.
> Some issues we can work on:
> *  I'd propose we conform to the SOLID principles for our major interfaces.  
> https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)  . 
> *  We can look at Sparse / Dense feature vectors and how to handle them 
> naturally in Joshua.
> *  Refactor objects that may now be used more broadly than was originally 
> intended (for example Vocabulary class).
> *  We should have a general discussion around what parts of the codebase are 
> responsible for what functions.  We should clearly define what logic should 
> be a part of the Grammar versus the Feature Functions for example, and make 
> sure logic doesn't leak from one of these objects to the others.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (JOSHUA-266) Refactor key interfaces and core code for a future release.

2016-05-25 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post closed JOSHUA-266.

Resolution: Duplicate

> Refactor key interfaces and core code for a future release. 
> 
>
> Key: JOSHUA-266
> URL: https://issues.apache.org/jira/browse/JOSHUA-266
> Project: Joshua
>  Issue Type: Improvement
>Reporter: Kellen Sunderland
>Priority: Minor
>
> We've discussed making some modifications to the key interfaces.  This ticket 
> can focus on making large changes to the codebase for a future release.  This 
> work will likely take some time and some collaboration.  I'd suggest that the 
> code for this be contained in a separate release branch.
> Some issues we can work on:
> *  I'd propose we conform to the SOLID principles for our major interfaces.  
> https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)  . 
> *  We can look at Sparse / Dense feature vectors and how to handle them 
> naturally in Joshua.
> *  Refactor objects that may now be used more broadly than was originally 
> intended (for example Vocabulary class).
> *  We should have a general discussion around what parts of the codebase are 
> responsible for what functions.  We should clearly define what logic should 
> be a part of the Grammar versus the Feature Functions for example, and make 
> sure logic doesn't leak from one of these objects to the others.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-261) Remove ext directory from source tree

2016-05-25 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post updated JOSHUA-261:
-
Description: 
Right now we have a bunch of cofe bundled in to the 
[ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
don't think any of this code can be shipped with an Apache Joshua (Incubating) 
release so we need to think about a mechanism for removing it and making Joshua 
work in other ways.

Here is a partial roadmap:

[ ] remove GIZA++ and symal
[ ] update [the developer 
documentation](https://cwiki.apache.org/confluence/display/JOSHUA/Development) 
to describe how to install them and put them in the path
[ ] update the pipeline scripts to not be hard-coded to $JOSHUA/bin
[ ] update the build files to not try to build them

  was:Right now we have a bunch of cofe bundled in to the 
[ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
don't think any of this code can be shipped with an Apache Joshua (Incubating) 
release so we need to think about a mechanism for removing it and making Joshua 
work in other ways.


> Remove ext directory from source tree
> -
>
> Key: JOSHUA-261
> URL: https://issues.apache.org/jira/browse/JOSHUA-261
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Right now we have a bunch of cofe bundled in to the 
> [ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
> don't think any of this code can be shipped with an Apache Joshua 
> (Incubating) release so we need to think about a mechanism for removing it 
> and making Joshua work in other ways.
> Here is a partial roadmap:
> [ ] remove GIZA++ and symal
> [ ] update [the developer 
> documentation](https://cwiki.apache.org/confluence/display/JOSHUA/Development)
>  to describe how to install them and put them in the path
> [ ] update the pipeline scripts to not be hard-coded to $JOSHUA/bin
> [ ] update the build files to not try to build them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (JOSHUA-261) Remove ext directory from source tree

2016-05-25 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post updated JOSHUA-261:
-
Description: 
Right now we have a bunch of cofe bundled in to the 
[ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
don't think any of this code can be shipped with an Apache Joshua (Incubating) 
release so we need to think about a mechanism for removing it and making Joshua 
work in other ways.

Here is a partial roadmap:

[ ] remove GIZA++ and symal
[ ] update [the developer 
documentation|https://cwiki.apache.org/confluence/display/JOSHUA/Development] 
to describe how to install them and put them in the path
[ ] update the pipeline scripts to not be hard-coded to $JOSHUA/bin
[ ] update the build files to not try to build them

  was:
Right now we have a bunch of cofe bundled in to the 
[ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
don't think any of this code can be shipped with an Apache Joshua (Incubating) 
release so we need to think about a mechanism for removing it and making Joshua 
work in other ways.

Here is a partial roadmap:

[ ] remove GIZA++ and symal
[ ] update [the developer 
documentation](https://cwiki.apache.org/confluence/display/JOSHUA/Development) 
to describe how to install them and put them in the path
[ ] update the pipeline scripts to not be hard-coded to $JOSHUA/bin
[ ] update the build files to not try to build them


> Remove ext directory from source tree
> -
>
> Key: JOSHUA-261
> URL: https://issues.apache.org/jira/browse/JOSHUA-261
> Project: Joshua
>  Issue Type: Task
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Priority: Blocker
> Fix For: 6.1
>
>
> Right now we have a bunch of cofe bundled in to the 
> [ext|https://github.com/apache/incubator-joshua/tree/master/ext] directory. I 
> don't think any of this code can be shipped with an Apache Joshua 
> (Incubating) release so we need to think about a mechanism for removing it 
> and making Joshua work in other ways.
> Here is a partial roadmap:
> [ ] remove GIZA++ and symal
> [ ] update [the developer 
> documentation|https://cwiki.apache.org/confluence/display/JOSHUA/Development] 
> to describe how to install them and put them in the path
> [ ] update the pipeline scripts to not be hard-coded to $JOSHUA/bin
> [ ] update the build files to not try to build them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (JOSHUA-271) Thrax invocation should not reply upon $HADOOP being set

2016-05-25 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-271.
--
Resolution: Fixed
  Assignee: Matt Post

> Thrax invocation should not reply upon $HADOOP being set
> 
>
> Key: JOSHUA-271
> URL: https://issues.apache.org/jira/browse/JOSHUA-271
> Project: Joshua
>  Issue Type: Bug
>  Components: pipeline, thrax
>Affects Versions: 6.0.5
>Reporter: Lewis John McGibbney
>Assignee: Matt Post
> Fix For: 6.1
>
>
> Right now one cannot run thrax unless the $HADOOP env variable is defined. 
> Every time the hadoop script is invoked it means that the path is coded as 
> $HADOOP/bin/hadoop however what happens if you are using a VM (Vagrant) to 
> connect to a cluster for which no $HADOOP env variable is defined? 
> The hadoop script should be on the path and available to use from there. The 
> only check which should be made is whether it is available from the path or 
> not, if it is not then start_hadoop_cluster subroutine can be called. This 
> reduces code and makes more sense.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (JOSHUA-272) Simplify the packing and usage of phrase-based grammars

2016-05-25 Thread Matt Post (JIRA)


 [ 
https://issues.apache.org/jira/browse/JOSHUA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Post resolved JOSHUA-272.
--
Resolution: Fixed

Fixed with [a recent 
comment|https://github.com/apache/incubator-joshua/commit/aef0b2dbe4555070aec9f15bb2c8d9dcb5671dcd].

> Simplify the packing and usage of phrase-based grammars
> ---
>
> Key: JOSHUA-272
> URL: https://issues.apache.org/jira/browse/JOSHUA-272
> Project: Joshua
>  Issue Type: Improvement
>    Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.1
>
>
> For historical reasons, phrase-based grammars add some complexity to 
> decoding. The complete tree under each top-level trie node in packed grammars 
> has to fit within a single packed grammars slice, which is limited to 2 GB 
> due to constraints on the size of Java byte[] arrays. We used to sort on just 
> the first item in the trie, which was a problem for phrase-based decoding, 
> since phrase-based rules are implemented as left-branching hierarchical 
> rules. In order to pack large grammars, we packed them without the leading 
> [X,1], and then added it when loading the grammars, both for the packed and 
> memory-based grammars. This was a real mess.
> This was all fixed with a commit a while ago that packs and reads packed 
> grammars based on the first two symbols on the source side. So we should 
> remove all the complexity associated with phrases. They should just be 
> regular rules. There is also a lot of redundancy across the codebase in 
> parsing rules, converting them to different formats, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (JOSHUA-272) Simplify the packing and usage of phrase-based grammars

2016-05-25 Thread Matt Post (JIRA)


[ 
https://issues.apache.org/jira/browse/JOSHUA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300566#comment-15300566
 ] 

Matt Post edited comment on JOSHUA-272 at 5/25/16 6:15 PM:
---

Fixed with [a recent 
commit|https://github.com/apache/incubator-joshua/commit/aef0b2dbe4555070aec9f15bb2c8d9dcb5671dcd].


was (Author: post):
Fixed with [a recent 
comment|https://github.com/apache/incubator-joshua/commit/aef0b2dbe4555070aec9f15bb2c8d9dcb5671dcd].

> Simplify the packing and usage of phrase-based grammars
> ---
>
> Key: JOSHUA-272
> URL: https://issues.apache.org/jira/browse/JOSHUA-272
> Project: Joshua
>  Issue Type: Improvement
>    Reporter: Matt Post
>Assignee: Matt Post
> Fix For: 6.1
>
>
> For historical reasons, phrase-based grammars add some complexity to 
> decoding. The complete tree under each top-level trie node in packed grammars 
> has to fit within a single packed grammars slice, which is limited to 2 GB 
> due to constraints on the size of Java byte[] arrays. We used to sort on just 
> the first item in the trie, which was a problem for phrase-based decoding, 
> since phrase-based rules are implemented as left-branching hierarchical 
> rules. In order to pack large grammars, we packed them without the leading 
> [X,1], and then added it when loading the grammars, both for the packed and 
> memory-based grammars. This was a real mess.
> This was all fixed with a commit a while ago that packs and reads packed 
> grammars based on the first two symbols on the source side. So we should 
> remove all the complexity associated with phrases. They should just be 
> regular rules. There is also a lot of redundancy across the codebase in 
> parsing rules, converting them to different formats, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

< 1 2 3 4 5 >

201 - 300 of 465 matches

Mail list logo