I already answered this question in another post. Apologies for double posting. Here is the code I used for filtering. I filtered based on the fourth score only.
#!/usr/bin/perl -w # # Program filters phrase table to leave only phrase pairs # with probability above a threshold # use strict; use warnings; use Getopt::Long; my $phrase; my $min; my $phrase_table; my $filtered_table; GetOptions( 'min=f' => \$min, 'out=s' => \$filtered_table, 'in=s' => \$phrase_table); die "ERROR: must give threshold and phrase table input file and output file\n" unless ($min && $phrase_table && $filtered_table); die "ERROR: file $phrase_table does not exist\n" unless (-e $phrase_table); open (PHRASETABLE, "<$phrase_table") or die "FATAL: Could not open phrase table $phrase_table\n";; open (FILTEREDTABLE, ">$filtered_table") or die "FATAL: Could not open phrase table $filtered_table\n";; while (my $line = <PHRASETABLE>) { chomp $line; my @columns = split ('\|\|\|', $line); # check that file is a well formatted phrase table if (scalar @columns < 4) { die "ERROR: input file is not a well formatted phrase table. A phrase table must have at least four colums each column separated by |||\n"; } # get the probability and check it is less than the threshold my @scores = split /\s+/, $columns[2]; if ($scores[3] > $min) { print FILTEREDTABLE $line."\n";; } } ________________________________________ From: moses-support-boun...@mit.edu <moses-support-boun...@mit.edu> on behalf of Rico Sennrich <rico.sennr...@gmx.ch> Sent: Wednesday, June 17, 2015 7:17 PM To: moses-support@mit.edu Subject: Re: [Moses-support] Major bug found in Moses Read, James C <jcread@...> writes: > > Actually the approximation I expect to be: > > p(e|f)=p(f|e) > > Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering > experiments provve otherwise. > > James I recommend you read the following: https://en.wikipedia.org/wiki/Confusion_of_the_inverse you don't explain which score you use for filtering (do you take one of the scores, their sum, their product, or something else?), but I expect you (mostly) keep the phrase pairs with a high p(e|f), which is the best thing to do when you don't have a language model. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support