I already answered this question in another post. Apologies for double posting. 
Here is the code I used for filtering. I filtered based on the fourth score 
only.

#!/usr/bin/perl -w
#
# Program filters phrase table to leave only phrase pairs
# with probability above a threshold
#
use strict;
use warnings;
use Getopt::Long;

my $phrase;
my $min;
my $phrase_table;
my $filtered_table;

GetOptions(     'min=f'         => \$min,
                'out=s'         => \$filtered_table,
                'in=s'          => \$phrase_table);
die "ERROR: must give threshold and phrase table input file and output file\n" 
unless ($min && $phrase_table && $filtered_table);
die "ERROR: file $phrase_table does not exist\n" unless (-e $phrase_table);
open (PHRASETABLE, "<$phrase_table") or die "FATAL: Could not open phrase table 
$phrase_table\n";;
open (FILTEREDTABLE, ">$filtered_table") or die "FATAL: Could not open phrase 
table $filtered_table\n";;

while (my $line = <PHRASETABLE>)
{
        chomp $line;
        my @columns = split ('\|\|\|', $line);

        # check that file is a well formatted phrase table
        if (scalar @columns < 4)
        {
                die "ERROR: input file is not a well formatted phrase table. A 
phrase table must have at least four colums each column separated by |||\n";
        }

        # get the probability and check it is less than the threshold
        my @scores = split /\s+/, $columns[2];
        if ($scores[3] > $min)
        {
                print FILTEREDTABLE $line."\n";;
        }
}


________________________________________
From: moses-support-boun...@mit.edu <moses-support-boun...@mit.edu> on behalf 
of Rico Sennrich <rico.sennr...@gmx.ch>
Sent: Wednesday, June 17, 2015 7:17 PM
To: moses-support@mit.edu
Subject: Re: [Moses-support] Major bug found in Moses

Read, James C <jcread@...> writes:

>
> Actually the approximation I expect to be:
>
> p(e|f)=p(f|e)
>
> Why would you expect this to give poor results if the TM is well trained?
Surely the results of my filtering
> experiments provve otherwise.
>
> James

I recommend you read the following:
https://en.wikipedia.org/wiki/Confusion_of_the_inverse

you don't explain which score you use for filtering (do you take one of the
scores, their sum, their product, or something else?), but I expect you
(mostly) keep the phrase pairs with a high p(e|f), which is the best thing
to do when you don't have a language model.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to