On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer <unham...@fsfe.org> wrote:
> VIVEK VICKY <vivekvicky...@gmail.com> > čálii: > > > Hello everyone, > > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/, > > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty > lines > > in either languages due to splitting of a sentence into two or merging of > > two sentences after the translation, which is causing errors during > > lexical-training. Is it common in parallel corpora? or is there any clean > > parallel corpus out there? > > Right now, I am translating the sentences around[up and below] the empty > > lines and manually merging/splitting them. Is there any better way to do > > this? > > Can you give an example? I took a look at that corpus and haven't found > any unmatched lines yet. In Europarl's spa-eng corpus, in eng text line number 104, " now he is doing just the same" is shifted to line 105 in spa text. This is just one example[look for empty lines in both langs]. There are around 5-8 such sentences for every 2000 Make sure you use the es-en.en file when > pairing es with en (that is, don't use cs-en.en with es-en.es). > Yes, indeed > (It *is* common to find semi-parallel corpora out there, but I suppose > we can leave sentence alignment out of the GsoC task unless > there's extra time, and assume corpora will be fairly clean.) > We won't get valid rules if we train on semi-parallel corpora right? as our script assumes sentences are perfectly aligned PS: These corpora are perfectly sentence-aligned, except for FEW which are just split or merged in the other language. Hence, blank lines _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff