Re: [Moses-support] Europarl monolingual pipeline

2015-01-06 Thread Kenneth Heafield
Hi again, Sorry, never mind! "We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data." Kenneth On 01/06/15 17:20, Kenneth Heafield wrote: > Hi, > > It seems that the WMT release is missing data.

Re: [Moses-support] Europarl monolingual pipeline

2015-01-06 Thread Philipp Koehn
Hi, this is done on purpose - the Q4 2000 is used for test sets, so it is excluded from the parallel and monolingual training corpora. -phi On Tue, Jan 6, 2015 at 2:20 PM, Kenneth Heafield wrote: > Hi, > > It seems that the WMT release is missing data. For example, why does > en/ep-10-

Re: [Moses-support] Europarl monolingual pipeline

2015-01-06 Thread Kenneth Heafield
Hi, It seems that the WMT release is missing data. For example, why does en/ep-10-02.txt line 4 contain "21 September 2000" but this does not appear in the WMT europarl-v7.en file from the WMT site? Kenneth On 01/06/15 14:24, Philipp Koehn wrote: > Hi, > > the Perl script that was used

Re: [Moses-support] Europarl monolingual pipeline

2015-01-06 Thread Philipp Koehn
Hi, the Perl script that was used to build this corpus is: #!/usr/bin/perl -w use strict; my ($l) = @ARGV; my $data = "/home/pkoehn/statmt/data/europarl-v7"; my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools"; my $preprocessor = "$tools/split-sentences.perl -q"; die("ERROR: no data for l

[Moses-support] Europarl monolingual pipeline

2015-01-06 Thread Kenneth Heafield
Dear Moses, Where does this data come from? http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz Specifically, if I wanted non-WMT languages, then I can download Europarl from http://www.statmt.org/europarl/ . There are some tools, like a perl script to strip XML, bu