Hello just to get back to this issue since I bumped into it again:
$ tr -d -c '\n' < news-commentary-v12.de-en.de | wc -c 270769 $ tr -d -c '\r' < news-commentary-v12.de-en.de | wc -c 3920 $ tr -d -c '\n' < news-commentary-v12.de-en.en | wc -c 270769 $ tr -d -c '\r' < news-commentary-v12.de-en.en | wc -c 4099 so v12 is broken somehow when reading it with some tools / primitive, but it works with some others. Just to let you know. Le 14/09/2017 à 08:48, Vincent Nguyen a écrit : > okay really weird. > wc gives me the same numbers as you, but gedit give another 2 different > numbers for each file. Must be special characters somewhere. > > > Le 13/09/2017 à 18:52, Barry Haddow a écrit : >> Hi Vincent >> >> Looks fine to me: >> >>> wc -l news-commentary-v12.de-en.* >>> 270769 news-commentary-v12.de-en.de >>> 270769 news-commentary-v12.de-en.en >>> 541538 total >> What are you running that shows you different line numbers? >> >> cheers - Barry >> >> On 12/09/17 10:06, Vincent Nguyen wrote: >>> Hi, >>> Is there an updated version of NCv12 for this >>> http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz >>> >>> >>> the number of lines for de-en is not the same in the 2 languages. >>> >>> Cheers, >>> Vincent >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support