Re: [Apertium-stuff] Cleaning Parallel Corpus
Awesome I will try it out. Thanks!! On Thu, 29 Apr, 2021, 11:31 pm Tanmai Khanna, wrote: > Since you have only about 5-8 such sentences for every 2000 lines, and it > seems like empty lines are a reliable marker for these kind of situations, > something I would do is to prune the corpus and remove any empty line along > with two lines before and two lines after it from both the english and > spanish corpus. You'd lose some sentences to train on but that would be > negligible and the remaining corpus would be aligned. > > Just a thought > > *तन्मय खन्ना * > *Tanmai Khanna* > > > On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY > wrote: > >> >> On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer >> wrote: >> >>> VIVEK VICKY >>> čálii: >>> >>> > Hello everyone, >>> > The eng-spa parallel corpora I am using( >>> http://www.statmt.org/europarl/, >>> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty >>> lines >>> > in either languages due to splitting of a sentence into two or merging >>> of >>> > two sentences after the translation, which is causing errors during >>> > lexical-training. Is it common in parallel corpora? or is there any >>> clean >>> > parallel corpus out there? >>> > Right now, I am translating the sentences around[up and below] the >>> empty >>> > lines and manually merging/splitting them. Is there any better way to >>> do >>> > this? >>> >>> Can you give an example? I took a look at that corpus and haven't found >>> any unmatched lines yet. >> >> In Europarl's spa-eng corpus, in eng text line number 104, " now he is >> doing just the same" is shifted to line 105 in spa text. This is just one >> example[look for empty lines in both langs]. There are around 5-8 such >> sentences for every 2000 >> >> Make sure you use the es-en.en file when >>> pairing es with en (that is, don't use cs-en.en with es-en.es). >>> >> Yes, indeed >> >> >>> (It *is* common to find semi-parallel corpora out there, but I suppose >>> we can leave sentence alignment out of the GsoC task unless >>> there's extra time, and assume corpora will be fairly clean.) >>> >> We won't get valid rules if we train on semi-parallel corpora right? as >> our script assumes sentences are perfectly aligned >> PS: These corpora are perfectly sentence-aligned, except for FEW which >> are just split or merged in the other language. Hence, blank lines >> >> ___ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Cleaning Parallel Corpus
Since you have only about 5-8 such sentences for every 2000 lines, and it seems like empty lines are a reliable marker for these kind of situations, something I would do is to prune the corpus and remove any empty line along with two lines before and two lines after it from both the english and spanish corpus. You'd lose some sentences to train on but that would be negligible and the remaining corpus would be aligned. Just a thought *तन्मय खन्ना * *Tanmai Khanna* On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY wrote: > > On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer > wrote: > >> VIVEK VICKY >> čálii: >> >> > Hello everyone, >> > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/ >> , >> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty >> lines >> > in either languages due to splitting of a sentence into two or merging >> of >> > two sentences after the translation, which is causing errors during >> > lexical-training. Is it common in parallel corpora? or is there any >> clean >> > parallel corpus out there? >> > Right now, I am translating the sentences around[up and below] the empty >> > lines and manually merging/splitting them. Is there any better way to do >> > this? >> >> Can you give an example? I took a look at that corpus and haven't found >> any unmatched lines yet. > > In Europarl's spa-eng corpus, in eng text line number 104, " now he is > doing just the same" is shifted to line 105 in spa text. This is just one > example[look for empty lines in both langs]. There are around 5-8 such > sentences for every 2000 > > Make sure you use the es-en.en file when >> pairing es with en (that is, don't use cs-en.en with es-en.es). >> > Yes, indeed > > >> (It *is* common to find semi-parallel corpora out there, but I suppose >> we can leave sentence alignment out of the GsoC task unless >> there's extra time, and assume corpora will be fairly clean.) >> > We won't get valid rules if we train on semi-parallel corpora right? as > our script assumes sentences are perfectly aligned > PS: These corpora are perfectly sentence-aligned, except for FEW which are > just split or merged in the other language. Hence, blank lines > > ___ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Cleaning Parallel Corpus
On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer wrote: > VIVEK VICKY > čálii: > > > Hello everyone, > > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/, > > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty > lines > > in either languages due to splitting of a sentence into two or merging of > > two sentences after the translation, which is causing errors during > > lexical-training. Is it common in parallel corpora? or is there any clean > > parallel corpus out there? > > Right now, I am translating the sentences around[up and below] the empty > > lines and manually merging/splitting them. Is there any better way to do > > this? > > Can you give an example? I took a look at that corpus and haven't found > any unmatched lines yet. In Europarl's spa-eng corpus, in eng text line number 104, " now he is doing just the same" is shifted to line 105 in spa text. This is just one example[look for empty lines in both langs]. There are around 5-8 such sentences for every 2000 Make sure you use the es-en.en file when > pairing es with en (that is, don't use cs-en.en with es-en.es). > Yes, indeed > (It *is* common to find semi-parallel corpora out there, but I suppose > we can leave sentence alignment out of the GsoC task unless > there's extra time, and assume corpora will be fairly clean.) > We won't get valid rules if we train on semi-parallel corpora right? as our script assumes sentences are perfectly aligned PS: These corpora are perfectly sentence-aligned, except for FEW which are just split or merged in the other language. Hence, blank lines ___ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Cleaning Parallel Corpus
VIVEK VICKY čálii: > Hello everyone, > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/, > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines > in either languages due to splitting of a sentence into two or merging of > two sentences after the translation, which is causing errors during > lexical-training. Is it common in parallel corpora? or is there any clean > parallel corpus out there? > Right now, I am translating the sentences around[up and below] the empty > lines and manually merging/splitting them. Is there any better way to do > this? Can you give an example? I took a look at that corpus and haven't found any unmatched lines yet. Make sure you use the es-en.en file when pairing es with en (that is, don't use cs-en.en with es-en.es). (It *is* common to find semi-parallel corpora out there, but I suppose we can leave sentence alignment out of the GsoC task unless there's extra time, and assume corpora will be fairly clean.) signature.asc Description: PGP signature ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
[Apertium-stuff] Cleaning Parallel Corpus
Hello everyone, The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/, http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines in either languages due to splitting of a sentence into two or merging of two sentences after the translation, which is causing errors during lexical-training. Is it common in parallel corpora? or is there any clean parallel corpus out there? Right now, I am translating the sentences around[up and below] the empty lines and manually merging/splitting them. Is there any better way to do this? Regards, Vivek Vardhan Adepu IRC: vivekvelda*/naan_dhaan* ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff