[Moses-support] EMS LM corpus setting

2014-10-24 Thread Prasanth K
Hi,

I am trying to use the target side of the parallel corpora to train my LM
from the EMS. From the example on the EMS webpage: I have defined, my LM
section as

[LM:multiun]
lowercased-corpus = [CORPUS:multiun:lowercased]

The LM training crashes, since the pipeline does not link this to the
target side of the corpus. Is there some way to specify  this setting in
the EMS?

Thanks.

-- 
"Theories have four stages of acceptance. i) this is worthless nonsense;
ii) this is an interesting, but perverse, point of view, iii) this is true,
but quite unimportant; iv) I always said so."

  --- J.B.S. Haldane
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] OOV translations

2014-10-24 Thread Tom Hoar
It's been a while since I worked with alternate Moses command options. 
The default behavior is to pass-through any OOV tokens to the translated 
text. The -output-unknowns option sends the OOV to the output path given.

Is there a way to force the stdout translation not include the OOV token 
and insert some alternate token? Or, are we left to parse/merge the 
-output-unknowns file with the stdout translations?

Thanks,
Tom
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] ar-en system build

2014-10-24 Thread Barry Haddow
Hi Emna

First, you will not get good results on Arabic with an English 
tokeniser. Try MADA, which does tokenisation and morphological 
segmentation for Arabic.

Secondly, you will need to extract the text from the xml before passing 
to Moses. You may find something suitable in m4loc 
(http://code.google.com/p/m4loc/) but in general there are many tools 
for handling xml.

cheers - Barry

On 24/10/14 02:30, emna hkiri wrote:
> Dear Friends
> i'm trying to build ar-en system. i have downloaded the arabic-english 
> // corpora from http://www.euromatrixplus.eu/multi-un/
> at first moses tokenizer do not include arabic language so i did it 
> with english
> the second problem is that the corpus is in xml format.So english(also 
> arabic)texts after the tokenization are in this format because of the 
> tags of XML
>
>
> < p n = " 2 " >
> < s n = " 2 " > Agenda item 116 < / s >
> < / p >
>
> so what should i do??? would you help me please i'm stuck at this point
> thank you for your help
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Help in Moses Incremental Retraing

2014-10-24 Thread Sandipan Dandapat
Hi Ulrich,
Sorry for sending the doubts to you directly.  I will keep in mind to post
my future queries to moses-support.

Thanks a lot for the clarification. Let me play with the MERT.

Thanks and regards,
sandipan

On 24 October 2014 01:51, Ulrich Germann  wrote:

> Hi Sandipan,
>
> first, please post Moses-related questions to moses-support@mit.edu, not
> individual contributors.
>
> second, the current seven features used by Mmsapt /
> PhraseDictionaryBitextSampling are (for details, see my recent paper on
> this phrase table implementation:
> https://www.researchgate.net/publication/267270863_Dynamic_Phrase_Tables_for_Machine_Translation_in_an_Interactive_Post-editing_Scenario
> )
>
> THE STANDARD SET OF FEATURES MAY CHANGE AT ANY TIME, as this is still work
> in progress.
>
> - forward and backward lexically smoothed phrase scores (2 scores; same as
> standard features)
> - rarity penalty (1/(x+1)), where x is the number of phrase pair
> occurrences in the corpus/sample (1 score)
> - the lower bound on forward and backward phrase-level probabilities, with
> confidence level .99 (2 scores)
> - 2 provenance features (x/(x+1)), where x is the number of phrase pair
> occurrences in the (static) background and (dynamic) foreground corpus (2
> scores)
>
> third, you need to retrain the feature weights for good performance with
> any of the standard techniques, but with the  I usually use MERT. The
> executable simulate-pe allows you to feed in references and  word
> aligmnents one sentence at a time; there are additional parameters
> --spe-src, --spe-trg, --spe-aln to specify source, target, and alignment
> (symal output format). Source and target files are one sentence per line,
> tokenized. Michael Denkowski is currently in the process of integrating
> online tuning into Moses, but I'm not sure whether that's ready to be
> deployed yet.
>
> Regards - Uli
>
>
>
> On Thu, Oct 23, 2014 at 1:47 AM, Sandipan Dandapat <
> sandipandanda...@gmail.com> wrote:
>
>> Dear Ulrich,
>> I got your reference from Prashanta Mathur. I am a postdoctoral
>> researcher in CNGL, DCU and  I am working with Moses incremental
>> retraining. It will be great if you help me to understand couple of doubts:
>>
>> 1. I found there are 7 weights to define for PT0 (PT0 is the Mmsapt name)
>> i.e.
>>
>> Mmsapt name=PT0 output-factor=0 num-features=7
>> base=/home/sandipan/inc_retrain/MT_sys/En-Fr/dgt/50_i/mmsa_pt/train. L1=en
>> L2=fr
>> [weight]
>> PT0= 0.1 0.2 0.3 0.4 0.5 0.6 0.7
>>
>> num-featues in PBSMT model is 4 which does not work with Mmsapt. What are
>> these 7 weights? Can I use uniform weights for all 7 features? Or how do I
>> adjust these values? Or, how to adjust these weights?
>>
>> 2. I found there is significant difference in BLEU score when I am using
>> standard PBSMT model and when I am using MMST based model. Is this because
>> of the weights I am using or am I doing something wrong?
>>
>> It will be real great help, if you help me to understand the above issue.
>> Thanking you.
>>
>> Regards,
>> sandipan
>>
>>
>>
>
>
> --
> Ulrich Germann
> Research Associate
> School of Informatics
> University of Edinburgh
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase pair generation at run time using Source-Pivot and Pivot-Target phrase tables

2014-10-24 Thread prajdabre
I will get the blowup but since its generating phrases for a sentence only it 
can fit in memory.
Ofc the time would be more but im eager to see what happens.


Sent from Samsung Mobile

 Original message 
From: Barry Haddow  
Date: 24/10/2014  20:09  (GMT+09:00) 
To: Raj Dabre  
Cc: moses-support@mit.edu 
Subject: Re: [Moses-support] Phrase pair generation at run time using 
Source-Pivot and Pivot-Target phrase tables 
 
On 24/10/14 11:06, Raj Dabre wrote:
> That is a good starting  point suggestion.
> Many thanks.
>
> Perhaps an easier option would be to do the synthesis offline. In 
> other words, take your two tables and create a pivot table from them, 
> and then use it like a normal phrase table.
>
> This I have been doing for 6 months but the phrase tables generated 
> are super-huge where the size almost is a square of the original size. 
> I end up having to keep a threshold frequency and kill potentially 
> good phrase pairs. Thats why I want to generate it online and keep all 
> the pairs.

But if you synthesise the phrase pairs inside the decoder, then won't 
you get the same n^2 blow-up? I think you have to find a good way to 
prune however you implement the synthesis.


>
> Thanks again.
>
> On Fri, Oct 24, 2014 at 6:58 PM, Barry Haddow 
> mailto:bhad...@staffmail.ed.ac.uk>> wrote:
>
> Hi Raj
>
> You could create a custom phrase table implementation to produce
> your synthesised phrase pairs. Have a look at the existing phrase
> table implementations in moses/TranslationModel. In particular,
> you need to subclass PhraseDictionary. The method
> GetTargetPhraseCollectionLEGACY() returns a collection of phrase
> pairs, given a source phrase.
>
> Perhaps an easier option would be to do the synthesis offline. In
> other words, take your two tables and create a pivot table from
> them, and then use it like a normal phrase table.
>
> cheers - Barry
>
>
> On 21/10/14 11:14, Raj Dabre wrote:
>
> Hello,
> I am currently doing research on using pivot languages for
> Phrase based SMT.
>
> My current method involves the usage of alternate decoding
> paths feature to combine multiple synthesized Source-Target
> phrase tables. (I have noticed that not many people exploit
> this method or even if they do they don't mention it clearly).
>
> However pre-synthesized  phrase tables need to be pruned to
> remove low probability phrase pairs and I would like to
> generate phrase pairs via a pivot at run time. I am ok with
> taking additional decoding time.
>
> I am aware that Bertoldi (2008) had already mentioned that he
> had used this method but this is not present in the moses
> decoder release.
> I would very much like to implement this but do not know where
> to start.
> If someone could tell me the section of code that reads in
> phrase pairs given a source phrase I think I might be able to
> do something.
> Any help would be appreciated.
>
> Thanks in advance.
>
> -- 
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu 
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
>
> -- 
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fix for make-factor-brown-cluster-mkcls.perl - encoding issue

2014-10-24 Thread Tomas Fulajtar
Hi Hieu,

Find the file attached.

Hope it helps ☺

Tomas

From: hieuho...@gmail.com [mailto:hieuho...@gmail.com] On Behalf Of Hieu Hoang
Sent: Friday, October 24, 2014 1:35 PM
To: Tomas Fulajtar
Cc: moses-support (moses-support@mit.edu)
Subject: Re: [Moses-support] Fix for make-factor-brown-cluster-mkcls.perl - 
encoding issue

hi tomas,

thanks for the bug report. You can do a git pull request.
  https://help.github.com/articles/using-pull-requests/
However, if it's just a small change to 1 file, then just send the changed file 
to me & i'll check it in for you

On 24 October 2014 04:29, Tomas Fulajtar 
mailto:toma...@moravia.com>> wrote:
Hi team,

I have found there is missing specification of the file mode opening in the 
make-factor-brown-cluster-mkcls.perl.
See the read_cluster_from_mkcls function in 
scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl file.

The fix is to add the following code bellow line 36:

binmode(CLUSTER_FILE, “utf8”);

This resolves encoding problems when user wants to process text containing 
extended characters.

Please let me know how can I  implement the fix to github repository.

Thank you,

Tomas

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support



--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu


make-factor-brown-cluster-mkcls.perl
Description: make-factor-brown-cluster-mkcls.perl
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fix for make-factor-brown-cluster-mkcls.perl - encoding issue

2014-10-24 Thread Hieu Hoang
hi tomas,

thanks for the bug report. You can do a git pull request.
  https://help.github.com/articles/using-pull-requests/
However, if it's just a small change to 1 file, then just send the changed
file to me & i'll check it in for you

On 24 October 2014 04:29, Tomas Fulajtar  wrote:

>  Hi team,
>
>
>
> I have found there is missing specification of the file mode opening in
> the make-factor-brown-cluster-mkcls.perl.
>
> See the read_cluster_from_mkcls function in
> scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl file.
>
>
>
> The fix is to add the following code bellow line 36:
>
>
>
> binmode(CLUSTER_FILE, “utf8”);
>
>
>
> This resolves encoding problems when user wants to process text containing
> extended characters.
>
>
>
> Please let me know how can I  implement the fix to github repository.
>
>
>
> Thank you,
>
>
>
> Tomas
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fix for make-factor-brown-cluster-mkcls.perl - encoding issue

2014-10-24 Thread Tomas Fulajtar
Hi team,

I have found there is missing specification of the file mode opening in the 
make-factor-brown-cluster-mkcls.perl.
See the read_cluster_from_mkcls function in 
scripts/training/wrappers/make-factor-brown-cluster-mkcls.perl file.

The fix is to add the following code bellow line 36:

binmode(CLUSTER_FILE, "utf8");

This resolves encoding problems when user wants to process text containing 
extended characters.

Please let me know how can I  implement the fix to github repository.

Thank you,

Tomas
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase pair generation at run time using Source-Pivot and Pivot-Target phrase tables

2014-10-24 Thread Barry Haddow
On 24/10/14 11:06, Raj Dabre wrote:
> That is a good starting  point suggestion.
> Many thanks.
>
> Perhaps an easier option would be to do the synthesis offline. In 
> other words, take your two tables and create a pivot table from them, 
> and then use it like a normal phrase table.
>
> This I have been doing for 6 months but the phrase tables generated 
> are super-huge where the size almost is a square of the original size. 
> I end up having to keep a threshold frequency and kill potentially 
> good phrase pairs. Thats why I want to generate it online and keep all 
> the pairs.

But if you synthesise the phrase pairs inside the decoder, then won't 
you get the same n^2 blow-up? I think you have to find a good way to 
prune however you implement the synthesis.


>
> Thanks again.
>
> On Fri, Oct 24, 2014 at 6:58 PM, Barry Haddow 
> mailto:bhad...@staffmail.ed.ac.uk>> wrote:
>
> Hi Raj
>
> You could create a custom phrase table implementation to produce
> your synthesised phrase pairs. Have a look at the existing phrase
> table implementations in moses/TranslationModel. In particular,
> you need to subclass PhraseDictionary. The method
> GetTargetPhraseCollectionLEGACY() returns a collection of phrase
> pairs, given a source phrase.
>
> Perhaps an easier option would be to do the synthesis offline. In
> other words, take your two tables and create a pivot table from
> them, and then use it like a normal phrase table.
>
> cheers - Barry
>
>
> On 21/10/14 11:14, Raj Dabre wrote:
>
> Hello,
> I am currently doing research on using pivot languages for
> Phrase based SMT.
>
> My current method involves the usage of alternate decoding
> paths feature to combine multiple synthesized Source-Target
> phrase tables. (I have noticed that not many people exploit
> this method or even if they do they don't mention it clearly).
>
> However pre-synthesized  phrase tables need to be pruned to
> remove low probability phrase pairs and I would like to
> generate phrase pairs via a pivot at run time. I am ok with
> taking additional decoding time.
>
> I am aware that Bertoldi (2008) had already mentioned that he
> had used this method but this is not present in the moses
> decoder release.
> I would very much like to implement this but do not know where
> to start.
> If someone could tell me the section of code that reads in
> phrase pairs given a source phrase I think I might be able to
> do something.
> Any help would be appreciated.
>
> Thanks in advance.
>
> -- 
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu 
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
>
> -- 
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase pair generation at run time using Source-Pivot and Pivot-Target phrase tables

2014-10-24 Thread Raj Dabre
That is a good starting  point suggestion.
Many thanks.

Perhaps an easier option would be to do the synthesis offline. In other
words, take your two tables and create a pivot table from them, and then
use it like a normal phrase table.

This I have been doing for 6 months but the phrase tables generated are
super-huge where the size almost is a square of the original size. I end up
having to keep a threshold frequency and kill potentially good phrase
pairs. Thats why I want to generate it online and keep all the pairs.

Thanks again.

On Fri, Oct 24, 2014 at 6:58 PM, Barry Haddow 
wrote:

> Hi Raj
>
> You could create a custom phrase table implementation to produce your
> synthesised phrase pairs. Have a look at the existing phrase table
> implementations in moses/TranslationModel. In particular, you need to
> subclass PhraseDictionary. The method GetTargetPhraseCollectionLEGACY()
> returns a collection of phrase pairs, given a source phrase.
>
> Perhaps an easier option would be to do the synthesis offline. In other
> words, take your two tables and create a pivot table from them, and then
> use it like a normal phrase table.
>
> cheers - Barry
>
>
> On 21/10/14 11:14, Raj Dabre wrote:
>
>> Hello,
>> I am currently doing research on using pivot languages for Phrase based
>> SMT.
>>
>> My current method involves the usage of alternate decoding paths feature
>> to combine multiple synthesized Source-Target phrase tables. (I have
>> noticed that not many people exploit this method or even if they do
>> they don't mention it clearly).
>>
>> However pre-synthesized  phrase tables need to be pruned to remove low
>> probability phrase pairs and I would like to generate phrase pairs via a
>> pivot at run time. I am ok with taking additional decoding time.
>>
>> I am aware that Bertoldi (2008) had already mentioned that he had used
>> this method but this is not present in the moses decoder release.
>> I would very much like to implement this but do not know where to start.
>> If someone could tell me the section of code that reads in phrase pairs
>> given a source phrase I think I might be able to do something.
>> Any help would be appreciated.
>>
>> Thanks in advance.
>>
>> --
>> Raj Dabre.
>> Research Student,
>> Graduate School of Informatics,
>> Kyoto University.
>> CSE MTech, IITB., 2011-2014
>>
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


-- 
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Phrase pair generation at run time using Source-Pivot and Pivot-Target phrase tables

2014-10-24 Thread Barry Haddow
Hi Raj

You could create a custom phrase table implementation to produce your 
synthesised phrase pairs. Have a look at the existing phrase table 
implementations in moses/TranslationModel. In particular, you need to 
subclass PhraseDictionary. The method GetTargetPhraseCollectionLEGACY() 
returns a collection of phrase pairs, given a source phrase.

Perhaps an easier option would be to do the synthesis offline. In other 
words, take your two tables and create a pivot table from them, and then 
use it like a normal phrase table.

cheers - Barry

On 21/10/14 11:14, Raj Dabre wrote:
> Hello,
> I am currently doing research on using pivot languages for Phrase 
> based SMT.
>
> My current method involves the usage of alternate decoding paths 
> feature to combine multiple synthesized Source-Target phrase tables. 
> (I have noticed that not many people exploit this method or even if 
> they do they don't mention it clearly).
>
> However pre-synthesized  phrase tables need to be pruned to remove low 
> probability phrase pairs and I would like to generate phrase pairs via 
> a pivot at run time. I am ok with taking additional decoding time.
>
> I am aware that Bertoldi (2008) had already mentioned that he had used 
> this method but this is not present in the moses decoder release.
> I would very much like to implement this but do not know where to start.
> If someone could tell me the section of code that reads in phrase 
> pairs given a source phrase I think I might be able to do something.
> Any help would be appreciated.
>
> Thanks in advance.
>
> -- 
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support