Hi again,
Sorry, never mind!
"We recommend using the last quarter of 2000 for testing (2000-10 until
2000-12) for consistency in reporting research results on this data."
Kenneth
On 01/06/15 17:20, Kenneth Heafield wrote:
> Hi,
>
> It seems that the WMT release is missing data.
Hi,
this is done on purpose - the Q4 2000 is used for test sets, so it is excluded
from the parallel and monolingual training corpora.
-phi
On Tue, Jan 6, 2015 at 2:20 PM, Kenneth Heafield wrote:
> Hi,
>
> It seems that the WMT release is missing data. For example, why does
> en/ep-10-
Hi,
It seems that the WMT release is missing data. For example, why does
en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
appear in the WMT europarl-v7.en file from the WMT site?
Kenneth
On 01/06/15 14:24, Philipp Koehn wrote:
> Hi,
>
> the Perl script that was used
Hi,
the Perl script that was used to build this corpus is:
#!/usr/bin/perl -w
use strict;
my ($l) = @ARGV;
my $data = "/home/pkoehn/statmt/data/europarl-v7";
my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
my $preprocessor = "$tools/split-sentences.perl -q";
die("ERROR: no data for l
Dear Moses,
Where does this data come from?
http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
Specifically, if I wanted non-WMT languages, then I can download
Europarl from http://www.statmt.org/europarl/ .
There are some tools, like a perl script to strip XML, bu