Re: [Moses-support] Sparse features and overfitting

2015-01-15 Thread HOANG Cong Duy Vu
Thanks for your replies!

Hi Prashant,

there is definitely an option for sparse l1/l2 regularization with mira. I
> don't know how to call it through command line though.


Yes. For MIRA, we can set the *C* parameter to control its regularization.
I tried different C values (0.01, 0.001) but it didn't work in my case.

Hi Matthias,

Do the sparse features give you any large improvement on the tuning set?


Yes. The improvement is around ~2-3 BLEU scores on the tuning set.

Does this mean that there are hundreds of sentences in your original
> tuning and test sets that are equal on the source side but have
> different references? That sounds a bit odd. Maybe it indicates that
> something about your data is generally problematic.


Yes. It's quite odd, I think so. But this data (Chinese-to-English) is
extracted from an official competition.
Probably, I will have to remove overlapping before moving on with other
kinds of features.

--
Cheers,
Vu

On Fri, Jan 16, 2015 at 6:31 AM, Matthias Huck  wrote:

> On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:
>
>
> > - tune & test
> > (based on source)
> > size of overlap set = 624
> > (based on target)
> > size of overlap set = 386
>
> >
> > (tune & test have high overlapping parts based on source sentences,
> > but half of them have different target sentences)
>
>
>
> Does this mean that there are hundreds of sentences in your original
> tuning and test sets that are equal on the source side but have
> different references? That sounds a bit odd. Maybe it indicates that
> something about your data is generally problematic.
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Sparse features and overfitting

2015-01-15 Thread Matthias Huck
On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:


> - tune & test
> (based on source)
> size of overlap set = 624
> (based on target)
> size of overlap set = 386

> 
> (tune & test have high overlapping parts based on source sentences,
> but half of them have different target sentences)



Does this mean that there are hundreds of sentences in your original
tuning and test sets that are equal on the source side but have
different references? That sounds a bit odd. Maybe it indicates that
something about your data is generally problematic.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Sparse features and overfitting

2015-01-15 Thread Matthias Huck
We typically try to increase the tuning set in order to obtain more
reliable sparse feature weights. But in your case it's rather the test
set that seems a bit small for trusting the BLEU scores. 

Do the sparse features give you any large improvement on the tuning set?



On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:

> I used sparse features such as: TargetWordInsertionFeature,
> SourceWordDeletionFeature, WordTranslationFeature,
> PhraseLengthFeature.
> Sparse features are used only for top source and target words (100,
> 150, 200, 250, ).
> 
> 
> My parallel data include: train(201K); tune(6214); test(641).

> 
> Is there any way to prevent over-fitting when applying the sparse
> features? Or in this case, sparse features will not generalize well
> over "unseen" data?




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] phrase table

2015-01-15 Thread Matthias Huck
Hi,


The data is sentence-segmented.

Assume you train your model with a training corpus which contains a
single parallel sentence pair. Your training sentence has length L on
both source and target side, and it's aligned along the diagonal. 
If n > L, you cannot extract any phrase of length n from this training
corpus. If n <= L, you can extract L - n + 1 phrases of length n. 

Example: for L = 5 you can extract five phrases of length n = 1, four of
length n = 2, ... , one of length n = 5, and none of length n > 5.


Also, bilingual blocks are valid (=extractable) phrases only if they are
consistent wrt. the word alignment. Larger blocks are possibly more
frequently inconsistent.


Of course you should consider some more aspects, e.g.:

- training settings 
  (there won't be any 8-grams if you set the max. phrase length to 7; 
  long phrases will be affected more by a count cutoff because of sparsity)
- vocabulary sizes limit the amount of possible combinations
- n-gram entropy of the language 
  [http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf]


Analyzing such things in detail is surely a fun pastime. You can start
with vocabulary sizes, number of running words of your corpus,
histograms of source-side training sentence lengths, number of distinct
n-grams that appear in the source side of the corpus vs. number of
distinct n-grams that are source sides of valid phrases, number of
distinct n-grams that appear in the source side of the corpus if you
undo the sentence segmentation (replace all line breaks by spaces), etc.

Cheers,
Matthias



On Thu, 2015-01-15 at 16:39 +, Read, James C wrote:
> Hi,
> 
> 
> 
> I just ran a count of different sized n-grams in the source side of my
> phrase table and this is what I got.
> 
> 
> 
> unigrams 85,233
> 
> 
> bigrams   991,701
> 
> 
> trigrams   2,697,341
> 
> 
> 4-grams3,876,180
> 
> 
> 5-grams4,209,094
> 
> 
> 6-grams3,702,813
> 
> 
> 7-grams2,560,251
> 
> 
> 8-grams   0
> 
> 
> 
> So, up until the 5-grams the results are what I expected the number is
> increasing. But then it drops for the 6-grams and drops again for the
> 7-grams.
> 
> 
> 
> Does anybody know why?
> 
> 
> 
> James 
> 
> 
> 
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] phrase table

2015-01-15 Thread John D Burger
I've observed this as well. It seems to me there are several competing 
pressures affecting the number of ngram types in  a corpus. On the one hand, as 
the size of the corpus increases, so does the vocabulary. This obviously 
increases the number of unigram types (which is the same as the vocabulary 
size), but also increases all of the other ngram sizes as well. But the other 
effect is that language is hugely constrained by context, and the longer the 
context (i.e. the longer the ngram) the less freedom there is for what can 
reasonably say next. If I say "the big", there are lots of reasonable choices 
for the third word, but if I say "I was frightened by the barking of the big", 
there are very few sensible completions.

You could quantify this by computing perplexity at various ngram sizes, but 
that's just another way of measuring the same effect you see with your ngram 
counts.

Of course this could be complete nonsense - I'm eager to hear what other people 
think.

- John Burger
 MITRE

On Jan 15, 2015, at 11:39 , Read, James C  wrote:

> Hi,
> 
> I just ran a count of different sized n-grams in the source side of my phrase 
> table and this is what I got.
> 
> unigrams 85,233
> bigrams   991,701
> trigrams   2,697,341
> 4-grams3,876,180
> 5-grams4,209,094
> 6-grams3,702,813
> 7-grams2,560,251
> 8-grams   0
> 
> So, up until the 5-grams the results are what I expected the number is 
> increasing. But then it drops for the 6-grams and drops again for the 7-grams.
> 
> Does anybody know why?
> 
> James 
> 
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] phrase table

2015-01-15 Thread Read, James C
Hi,


I just ran a count of different sized n-grams in the source side of my phrase 
table and this is what I got.


unigrams 85,233

bigrams   991,701

trigrams   2,697,341

4-grams3,876,180

5-grams4,209,094

6-grams3,702,813

7-grams2,560,251

8-grams   0


So, up until the 5-grams the results are what I expected the number is 
increasing. But then it drops for the 6-grams and drops again for the 7-grams.


Does anybody know why?


James

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Dernier Appel à communications [Last CFP] : Conférence TALN 2015 | RÉCITAL 2015

2015-01-15 Thread Paul Martin

[ Excuses pour réceptions multiples ]
[ English Version below ]

Conférence TALN 2015 | RÉCITAL 2015
Appel à communications TALN 2015 | RÉCITAL 2015


22ème conférence sur le Traitement Automatique des Langues Naturelles
17èmes Rencontre des Étudiants Chercheurs en Informatique pour le Traitement 
Automatique des Langues

Université de Caen Basse-Normandie, Caen
Du 22 au 25 juin 2015 à Caen, France
http://taln2015.greyc.fr


DATES IMPORTANTES


Articles longs TALN (10 à 12 pages)
-Date limite de soumission : vendredi 30 janvier 2015 (23:59 heure Paris)
-Notification aux auteurs : lundi 30 mars 2015
-Date limite de soumission des versions définitives : vendredi 8 mai 2015

Articles courts TALN (4 à 6 pages)
-Date limite de soumission : vendredi 3 avril 2015 (23:59 heure Paris)
-Notification aux auteurs : mercredi 13 mai 2015
-Date limite de soumission des versions définitives : vendredi 22 mai 2015

Démonstrations TALN (1 à 2 pages)
-Date limite de soumission : vendredi 8 mai 2015 (23:59 heure Paris)
-Notification aux auteurs : vendredi 15 mai 2015
-Date limite de soumission des versions définitives : vendredi 22 mai 2015

Articles RÉCITAL (10 à 12 pages)
-Date limite de soumission : vendredi 20 mars 2015  (23:59 heure Paris)
-Notification aux auteurs : mardi 28 avril 2015
-Date limite de soumission des versions définitives : vendredi 15 mai 2015 

Propositions d'atelier
-Date limite de soumission des propositions d’atelier : Vendredi 30 janvier 
2015 (23:59 heure Paris)
-Réponse du comité de programme : Vendredi 6 février 2015
-Date limite de soumission des versions définitives : Vendredi 8 mai 2015


COMITÉ D'ORGANISATION


-Président TALN : Nadine Lucas GREYC, Université Caen Basse Normandie, France
-Vice-Président TALN : Gaël Dias GREYC, Université Caen Basse Normandie, France
-Président RÉCITAL : Charlotte Lecluze GREYC, Université Caen Basse Normandie, 
France
-Vice-Président RÉCITAL : Jose G Moreno GREYC, Université Caen Basse Normandie, 
France


CONTACTS


-nadine.lu...@unicaen.fr
-charlotte.lecl...@unicaen.fr




[English version]
[ Apologies for cross postings ]

TALN 2015 | RÉCITAL 2015 Conference
Call for Papers TALN 2015 | RÉCITAL 2015


22nd conference on Natural Language Processing
17th Meeting of Student Researchers in Computer Science for Natural Language 
Processing

University of Caen Lower Normandy
June 22-25, 2015  -  Caen, France
http://taln2015.greyc.fr


IMPORTANT DATES


TALN Long paper (10 to 12 pages)
-Paper submission deadline: Friday, January 30, 2015 (23:59 Paris time)
-Notification: Monday, March 30, 2015
-Camera ready paper due: Friday, May 8, 2015

TALN Short paper (4 to 6 pages)
-Paper submission deadline: Friday, April 3, 2015 (23:59 Paris time)
-Notification: Wednesday, May 13, 2015
-Camera ready paper due: Friday, May 22, 2015

TALN Demonstration (1 to 2 pages)
-Submission deadline: Friday, May 8, 2015 (23:59 Paris time)
-Notification: Friday, May 15, 2015
-Camera ready paper due: Friday, May 22, 2015

RÉCITAL paper (10 to 12 pages)
-Paper submission deadline: Friday, March 20, 2015 (23:59 Paris time)
-Notification: Tuesday, April 28, 2015
-Camera ready paper due: Friday, May 15, 2015 

Workshop proposal
-Workshop submission deadline: Friday, January 30, 2015 (23:59 Paris time)
-Program Committee response: Friday, February 6, 2015
-Camera ready paper due: Friday, May 8, 2015


ORGANIZING COMMITTEE


-TALN Chair: Nadine Lucas GREYC, Université Caen Basse Normandie, France
-TALN Co-Chair: Gaël Dias GREYC, Université Caen Basse Normandie, France
-RÉCITAL Chair: Charlotte Lecluze GREYC, Université Caen Basse Normandie, France
-RÉCITAL Co-Chair: Jose G Moreno GREYC, Université Caen Basse Normandie, France


CONTACTS


nadine.lu...@unicaen.fr
charlotte.lecl...@unicaen.fr

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] how to align some new parallel sentences using a trained model

2015-01-15 Thread Christophe Servan
Hello,
as far as I know, you can use the forced alignment process or the
incremental giza.
more info there:
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc58

Cheers,

Christophe

2015-01-15 1:54 GMT+01:00 iamzcy_hit iamzcy_hit :

> Hi,all
>   If I've train a alignment model using a huge parallel corpus with
> the help of giga++,mgiga or fast-align, now I am given some new sentences
> pairs and want to align the words in the sentence, how should I do ?
>   Best regards
>
> --
> 我们注定要被刻上这个时代的烙印.
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Tokenization problem

2015-01-15 Thread Ihab Ramadan
Many thanks for all of you
As you mentioned the problem is not in the script it was in the text sent to
the terminal from my web app, I found that some characters does not goes as
it with weird Unicode  
Thanks everybody

-Original Message-
From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu]
On Behalf Of moses-support-requ...@mit.edu
Sent: Thursday, January 15, 2015 3:39 AM
To: moses-support@mit.edu
Subject: Moses-support Digest, Vol 99, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-requ...@mit.edu

You can reach the person managing the list at
moses-support-ow...@mit.edu

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Moses-support digest..."


Today's Topics:

   1. how to align some new parallel sentences using a  trained
  model (iamzcy_hit iamzcy_hit)
   2. Re: Tokenization problem (Tom Hoar)
   3. Re: Tokenization problem (Kenneth Heafield)


--

Message: 1
Date: Thu, 15 Jan 2015 08:54:06 +0800
From: iamzcy_hit iamzcy_hit 
Subject: [Moses-support] how to align some new parallel sentences
using a trained model
To: "moses-support@mit.edu" 
Message-ID:

Content-Type: text/plain; charset="utf-8"

Hi,all
  If I've train a alignment model using a huge parallel corpus with the
help of giga++,mgiga or fast-align, now I am given some new sentences pairs
and want to align the words in the sentence, how should I do ?
  Best regards

--
???.
-- next part --
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/9f
3850f8/attachment-0001.htm

--

Message: 2
Date: Thu, 15 Jan 2015 08:33:17 +0700
From: Tom Hoar 
Subject: Re: [Moses-support] Tokenization problem
To: moses-support@mit.edu
Message-ID: <54b718dd.4030...@precisiontranslationtools.com>
Content-Type: text/plain; charset="windows-1252"

I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
./tokenizer.perl -no-escape -q -l en < test.txt which will guide you through
connecting and configuring your printer 's wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .

This is not a Perl script problem. What shell and command line are you using
for your "in the file" results? You'll find the problem in either your shell
or your custom tool chain(s) before you run tokenizer.perl.



On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>
> Dears,
>
> I still have this problem, for not confusing the decoder I used the 
> ??no-escape? parameter in the tokenizer.perl script but still have the 
> problem of adding extra space after quotations for tokenizing files 
> however in tokenizing a segment it comes without the extra space
>
> For example
>
> In the file
>
> ?which will guide you through connecting and configuring your 
> printer's wireless connection. ? ??which will guide you through 
> connecting and configuring your printer ' s wireless connection .?
>
> As a segment
>
> ?which will guide you through connecting and configuring your 
> printer's wireless connection. ? ??which will guide you through 
> connecting and configuring your printer 's wireless connection .?
>
> I wonder if it is the same script why it generated two different 
> outputs
>
> I have no experience in perl so I could not get the line of code which 
> differ between if the segment in a file or just one segment passed as 
> a parameter to the script
>
> Please help
>
> *From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
> *Sent:* Monday, January 5, 2015 10:09 AM
> *To:* moses-support@mit.edu
> *Subject:* Tokenization problem
>
> Dears,
>
> Using the tokenizer on the training files replaces the apostrophes 
> with ?' s? (with space) but if I use the same script to tokenize 
> a sentence it makes the apostrophes to be ?'s? (without a space)
>
> This problem confuse the decoder while translation
>
> How to solve this peoblem
>
> Thanks
>
> Best Regards
>
>