Re: Apache Joshua Project

2016-12-16 Thread Matt Post
There is not enough information for me to answer your question. I don't see any 
problems.

$ echo "i'll give you 10% of the asking price" | ./prepare.sh | ./joshua
I'll give you 10 % of the asking price


> On Dec 16, 2016, at 3:22 AM, Aliaksei Rudak  wrote:
> 
> Also there is a problem with parsing (%) sign in sentences. Do you know how 
> to solve this ?
> 
> 2016-12-15 10:57 GMT+03:00 Aliaksei Rudak  >:
> Hi Matt,
> 
> English-Russian language pack has broken link
> https://cwiki.apache.org/confluence/home.apache.org/~lewismc/language-pack-en-ru-2016-10-28.tar.gz
>  
> 
> 
> When do you plan to create and upload other languages ?
> 
> Regards,
> Alexei
> 
> 2016-12-14 21:50 GMT+03:00 Matt Post  >:
> 1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
> builds KenLM and the KenLM library. If you can do that, that is all you need 
> to do.
> 
> 2. http://opus.lingfil.uu.se  is a great place to 
> get parallel data; it's where we got all the data we use.
> 
> 3. Joshua has a Java API (undocumented) but not a C++ one.
> 
> 
>> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak > > wrote:
>> 
>> 1) Can you estimate approximate date of releasing language packs with kenlm 
>> model ? I have a teammate who know c++ well so If we have more information 
>> (or tutorial) of how to do that by ourselves we can share the result with 
>> others. So it will be benefit for all.
>> 
>> 2) Where can I get or buy parallel corpora for other languages ? Where did 
>> you get data for current huge language packs? I found several sources but 
>> they so small in size.
>> 
>> 3) Is there any document of how to create offline translation system based 
>> on Joshua and make it as c++ library for example ?
>> 
>> 
>> 
>> 
>> 2016-12-14 14:33 GMT+03:00 Matt Post > >:
>> 1. the lm cannot be used with moses. we have berkeleylm format you need 
>> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
>> to compile c++ code which can be tricky. 
>> 
>> 2/3. please see the README in each language pack. you need to pass input 
>> text through "prepare.sh" which does tokenization. 
>> 
>> matt (from my phone)
>> 
>> Le 14 déc. 2016 à 06:16, Aliaksei Rudak > > a écrit :
>> 
>>> Hi Matt, 
>>> Thanks for answers.
>>> 
>>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>>> yes - can you give me the link how to run them on it ? 
>>> 
>>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>>> have the same strange issue. Trying to translate one sentence. 
>>> 
>>> For example from Spanish to English
>>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>>> del español. "
>>> 
>>> Translates as
>>> "Also auditions, you'll find texts with different accents of español"
>>> 
>>> It means that one word in sentence (español) is not translated correct. But 
>>> it's ok if you translating single word ( español )
>>> 
>>> Same for other languages (German, Russian). All words (except one or 
>>> sometimes 2 words) are not translated. Do you know how to fix this ?
>>> 
>>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>>> question marks etc) ?
>>> 
>>> Translating from Spanish to English gives error
>>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>>> pregunta."
>>> 
>>> If you try to translate words separated with commas it not translates these 
>>> words
>>> "inglés, francés, alemán y portugués"
>>> 
>>> output
>>> "Inglés, francés, german and portuguese"
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-13 17:44 GMT+03:00 Matt Post >> >:
>>> 
 On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak > wrote:
 
 1) If English-German pair will be recompiled to German-English 
 (vice-versa) do I need a separate instance to process back translation ? 
 Or they can work in one instance in both directions ?
 
>>> A whole new model needs to be trained. You need a separate model for each 
>>> direction.
 2) Are there any documents about how to recompile model to work vice-versa 
 from German-English to English-German ?
 
 At this page under the “Project Info” title links “Community page” and 
 “Current Documentation” not working
 
 http://incubator.apache.org/projects/joshua.html 
 
>>> This document on running the pipeline:
>>> 
>>> 
>>> 

Re: Apache Joshua Project

2016-12-14 Thread Matt Post
1. If you download Joshua from GitHub, and run "download_dependencies.sh", it 
builds KenLM and the KenLM library. If you can do that, that is all you need to 
do.

2. http://opus.lingfil.uu.se is a great place to get parallel data; it's where 
we got all the data we use.

3. Joshua has a Java API (undocumented) but not a C++ one.


> On Dec 14, 2016, at 10:30 AM, Aliaksei Rudak  wrote:
> 
> 1) Can you estimate approximate date of releasing language packs with kenlm 
> model ? I have a teammate who know c++ well so If we have more information 
> (or tutorial) of how to do that by ourselves we can share the result with 
> others. So it will be benefit for all.
> 
> 2) Where can I get or buy parallel corpora for other languages ? Where did 
> you get data for current huge language packs? I found several sources but 
> they so small in size.
> 
> 3) Is there any document of how to create offline translation system based on 
> Joshua and make it as c++ library for example ?
> 
> 
> 
> 
> 2016-12-14 14:33 GMT+03:00 Matt Post  >:
> 1. the lm cannot be used with moses. we have berkeleylm format you need 
> kenlm. we are releasing kenlm soon. kenlm is better but it requires the user 
> to compile c++ code which can be tricky. 
> 
> 2/3. please see the README in each language pack. you need to pass input text 
> through "prepare.sh" which does tokenization. 
> 
> matt (from my phone)
> 
> Le 14 déc. 2016 à 06:16, Aliaksei Rudak  > a écrit :
> 
>> Hi Matt, 
>> Thanks for answers.
>> 
>> 1) Can language models inside Joshua language packs work with Moses MT ? If 
>> yes - can you give me the link how to run them on it ? 
>> 
>> 2) I installed several instances (German, Spanish, Russian) and all of them 
>> have the same strange issue. Trying to translate one sentence. 
>> 
>> For example from Spanish to English
>> "Además podrás encontrar las audiciones de los textos con distintos acentos 
>> del español. "
>> 
>> Translates as
>> "Also auditions, you'll find texts with different accents of español"
>> 
>> It means that one word in sentence (español) is not translated correct. But 
>> it's ok if you translating single word ( español )
>> 
>> Same for other languages (German, Russian). All words (except one or 
>> sometimes 2 words) are not translated. Do you know how to fix this ?
>> 
>> 3) How to translate sentences with punctuation marks (comma, exclamation, 
>> question marks etc) ?
>> 
>> Translating from Spanish to English gives error
>> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
>> pregunta."
>> 
>> If you try to translate words separated with commas it not translates these 
>> words
>> "inglés, francés, alemán y portugués"
>> 
>> output
>> "Inglés, francés, german and portuguese"
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 
>> 2016-12-13 17:44 GMT+03:00 Matt Post > >:
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak >> > wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html 
>>> 
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
>> 3)
>>  Are there ways of increasing translation quality without changing 
>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incubator.apache.org/6.0/faq.html 
>>> 
>> Yes but it's complicated. The best way is to add data, but there are lots of 
>> other models and parameter variations that could be tried.
>> 
>>> 4) How can I reduce the amount of memory each language pair instance use 
>>> without losing process speed and quality?
>>> 
>> If you can find German–French parallel data, use that. Otherwise, pivot 
>> through another language.
>>> 5) To make translation from German to French do I need to make translation 
>>> via English conversion ? (like German to English first and then English to 
>>> French) 
>>> 
>>> I mean for the case without German-French parallel data.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 

Re: Apache Joshua Project

2016-12-14 Thread Matt Post
1. the lm cannot be used with moses. we have berkeleylm format you need kenlm. 
we are releasing kenlm soon. kenlm is better but it requires the user to 
compile c++ code which can be tricky. 

2/3. please see the README in each language pack. you need to pass input text 
through "prepare.sh" which does tokenization. 

matt (from my phone)

> Le 14 déc. 2016 à 06:16, Aliaksei Rudak  a écrit :
> 
> Hi Matt, 
> Thanks for answers.
> 
> 1) Can language models inside Joshua language packs work with Moses MT ? If 
> yes - can you give me the link how to run them on it ? 
> 
> 2) I installed several instances (German, Spanish, Russian) and all of them 
> have the same strange issue. Trying to translate one sentence. 
> 
> For example from Spanish to English
> "Además podrás encontrar las audiciones de los textos con distintos acentos 
> del español. "
> 
> Translates as
> "Also auditions, you'll find texts with different accents of español"
> 
> It means that one word in sentence (español) is not translated correct. But 
> it's ok if you translating single word ( español )
> 
> Same for other languages (German, Russian). All words (except one or 
> sometimes 2 words) are not translated. Do you know how to fix this ?
> 
> 3) How to translate sentences with punctuation marks (comma, exclamation, 
> question marks etc) ?
> 
> Translating from Spanish to English gives error
> "¿Se puede aprender a escribir? ¿El escritor nace o se hace? La vieja 
> pregunta."
> 
> If you try to translate words separated with commas it not translates these 
> words
> "inglés, francés, alemán y portugués"
> 
> output
> "Inglés, francés, german and portuguese"
> 
> Regards,
> Alexei
> 
> 
> 
> 
> 
> 2016-12-13 17:44 GMT+03:00 Matt Post :
>> 
>>> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  wrote:
>>> 
>>> 1) If English-German pair will be recompiled to German-English (vice-versa) 
>>> do I need a separate instance to process back translation ? Or they can 
>>> work in one instance in both directions ?
>>> 
>> A whole new model needs to be trained. You need a separate model for each 
>> direction.
>>> 2) Are there any documents about how to recompile model to work vice-versa 
>>> from German-English to English-German ?
>>> 
>>> At this page under the “Project Info” title links “Community page” and 
>>> “Current Documentation” not working
>>> 
>>> http://incubator.apache.org/projects/joshua.html
>>> 
>> 
>> This document on running the pipeline:
>> 
>>  
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
>>> 3) Are there ways of increasing translation quality without changing 
>>> (extending) language model?  
>>> 
>>> At this page under “How do I make Joshua produce better results? at second 
>>> option (Joshua directly) link not working
>>>  
>>> http://joshua.incubator.apache.org/6.0/faq.html
>>> 
>> 
>> Yes but it's complicated. The best way is to add data, but there are lots of 
>> other models and parameter variations that could be tried.
>> 
>>> 4) How can I reduce the amount of memory each language pair instance use 
>>> without losing process speed and quality?
>>> 
>> If you can find German–French parallel data, use that. Otherwise, pivot 
>> through another language.
>>> 5) To make translation from German to French do I need to make translation 
>>> via English conversion ? (like German to English first and then English to 
>>> French) 
>>> 
>>> I mean for the case without German-French parallel data.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Regards,
>>> 
>>> Alexei
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2016-12-12 17:58 GMT+03:00 Matt Post :
 No, each has to be run separately. But not all are equally good, so I 
 suggest starting with a few and building up.
 
 If you get KenLM working in place of BerkeleyLM, the language models will 
 be shared between them if they are on the same machine. I will post 
 instructions soon.
 
 Yes, each one has two language models that are interpolated.
 
 
 
> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak  wrote:
> 
> Hi Matt,
> 
> You was right about increasing memory. Spanish works fine now but need 
> about 16GB to run. Is it possible to use one Joshua instance for all 
> language pairs simultaneously ? Right now I use one instance for each 
> pair at it takes about 4GB, so for all 60 languages I need 240 GB of RAM 
> memory and 60 running instances. But may be it's possible to process all 
> language translation with one instance and use for example 32 GB ?
> 
> Also I found that every language pair archive has 2 language models ( 
> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one 
> of them depending on some parameters ?
> 
> Regards,
> Alexei
> 
> 
> 
> 
> 2016-12-07 15:51 GMT+03:00 Matt Post :
>> I fixed the 

Re: Apache Joshua Project

2016-12-13 Thread Matt Post

> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  > wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> 
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630 
3) 
Are there ways of increasing translation quality without changing (extending) 
language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> 
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post  >:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak > > wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post > >:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak >> > wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 >:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post >> > wrote:
>>> 
 Inline below:
 
> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak  > wrote:
> 
> Hi Matt,
> 
> 
> 
> We need to prepare all infrastructure now so you can make changes in 
> future. Preparation will take time. Right now I have several questions 
> about all this things.
> 
> 1) Does Joshua has language auto-detect feature ? If yes – how to use it? 
> If not – is it hard to do it ?
> 
 This feature is called LID ("language ID"). It is not in Joshua currently 
 but we have talked about it, and it wouldn't be too difficult to add in.
> 2) On this page
> 
> https://cwiki.apache.org/confluence/display/JOSHUA/Notes+on+Language+Pack+Creation
>  
> 
> 

Re: Apache Joshua Project

2016-12-13 Thread Matt Post

> On Dec 12, 2016, at 3:04 PM, Aliaksei Rudak  wrote:
> 
> 1) If English-German pair will be recompiled to German-English (vice-versa) 
> do I need a separate instance to process back translation ? Or they can work 
> in one instance in both directions ?
> 
A whole new model needs to be trained. You need a separate model for each 
direction.
> 2) Are there any documents about how to recompile model to work vice-versa 
> from German-English to English-German ?
> 
> At this page under the “Project Info” title links “Community page” and 
> “Current Documentation” not working
> 
> http://incubator.apache.org/projects/joshua.html 
> 
This document on running the pipeline:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65871630
> 3) Are there ways of increasing translation quality without changing 
> (extending) language model?  
> 
> At this page under “How do I make Joshua produce better results? at second 
> option (Joshua directly) link not working
>  
> http://joshua.incubator.apache.org/6.0/faq.html 
> 
Yes but it's complicated. The best way is to add data, but there are lots of 
other models and parameter variations that could be tried.

> 4) How can I reduce the amount of memory each language pair instance use 
> without losing process speed and quality?
> 
If you can find German–French parallel data, use that. Otherwise, pivot through 
another language.
> 5) To make translation from German to French do I need to make translation 
> via English conversion ? (like German to English first and then English to 
> French) 
> 
> I mean for the case without German-French parallel data.
> 
> 
> 
> 
> 
> Regards,
> 
> Alexei
> 
> 
> 
> 
> 
> 
> 2016-12-12 17:58 GMT+03:00 Matt Post  >:
> No, each has to be run separately. But not all are equally good, so I suggest 
> starting with a few and building up.
> 
> If you get KenLM working in place of BerkeleyLM, the language models will be 
> shared between them if they are on the same machine. I will post instructions 
> soon.
> 
> Yes, each one has two language models that are interpolated.
> 
> 
> 
>> On Dec 12, 2016, at 9:20 AM, Aliaksei Rudak > > wrote:
>> 
>> Hi Matt,
>> 
>> You was right about increasing memory. Spanish works fine now but need about 
>> 16GB to run. Is it possible to use one Joshua instance for all language 
>> pairs simultaneously ? Right now I use one instance for each pair at it 
>> takes about 4GB, so for all 60 languages I need 240 GB of RAM memory and 60 
>> running instances. But may be it's possible to process all language 
>> translation with one instance and use for example 32 GB ?
>> 
>> Also I found that every language pair archive has 2 language models ( 
>> Berkeley and KenLM ) Do I need them two at once ? Or Joshua selects one of 
>> them depending on some parameters ?
>> 
>> Regards,
>> Alexei
>> 
>> 
>> 
>> 
>> 2016-12-07 15:51 GMT+03:00 Matt Post > >:
>> I fixed the Czech link.
>> 
>> For Spanish–English, what is the error? I imagine you have to provide more 
>> memory. Edit the "joshua" script and double or triple the amount of memory.
>> 
>> 
>>> On Dec 7, 2016, at 7:14 AM, Aliaksei Rudak >> > wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Can you check Czech-English language pack, it has broken link. 
>>> Spanish-English pair not works, throws exceptions
>>> 
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> 2016-11-28 17:30 GMT+03:00 >:
>>> Hi Matt, what time (total price ) will be to record video of how to make 
>>> translation vice-versa (from german to english)  to english to german pair
>>> 
>>> Regards,
>>> Alexei
>>> 
>>> On Nov 28, 2016, at 17:59, Matt Post >> > wrote:
>>> 
 Inline below:
 
> On Nov 26, 2016, at 11:12 AM, Aliaksei Rudak  > wrote:
> 
> Hi Matt,
> 
> 
> 
> We need to prepare all infrastructure now so you can make changes in 
> future. Preparation will take time. Right now I have several questions 
> about all this things.
> 
> 1) Does Joshua has language auto-detect feature ? If yes – how to use it? 
> If not – is it hard to do it ?
> 
 This feature is called LID ("language ID"). It is not in Joshua currently 
 but we have talked about it, and it wouldn't be too difficult to add in.
> 2) On this page
> 
> https://cwiki.apache.org/confluence/display/JOSHUA/Notes+on+Language+Pack+Creation
>  
> 
> In first sentence there is link to “Corpus” at the end where language 
> datasets should be