Re: [Apertium-stuff] Cleaning Parallel Corpus

VIVEK VICKY Thu, 29 Apr 2021 05:53:23 -0700

On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer <unham...@fsfe.org>
wrote:


> VIVEK VICKY <vivekvicky...@gmail.com>
> čálii:
>
> > Hello everyone,
> > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty
> lines
> > in either languages due to splitting of a sentence into two or merging of
> > two sentences after the translation, which is causing errors during
> > lexical-training. Is it common in parallel corpora? or is there any clean
> > parallel corpus out there?
> > Right now, I am translating the sentences around[up and below] the empty
> > lines and manually merging/splitting them. Is there any better way to do
> > this?
>
> Can you give an example? I took a look at that corpus and haven't found
> any unmatched lines yet.

 In Europarl's spa-eng corpus, in eng text line number 104, " now he is
doing just the same" is shifted to line 105 in spa text. This is just one
example[look for empty lines in both langs]. There are around 5-8 such
sentences for every 2000

Make sure you use the es-en.en file when
> pairing es with en (that is, don't use cs-en.en with es-en.es).
>
 Yes, indeed


> (It *is* common to find semi-parallel corpora out there, but I suppose
> we can leave sentence alignment out of the GsoC task unless
> there's extra time, and assume corpora will be fairly clean.)
>
We won't get valid rules if we train on semi-parallel corpora right? as our
script assumes sentences are perfectly aligned
PS: These corpora are perfectly sentence-aligned, except for FEW which are
just split or merged in the other language. Hence, blank lines

_______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Cleaning Parallel Corpus

Reply via email to