Re: [Apertium-stuff] Cleaning Parallel Corpus

2021-04-29 Thread VIVEK VICKY
Awesome I will try it out. Thanks!!

On Thu, 29 Apr, 2021, 11:31 pm Tanmai Khanna, 
wrote:

> Since you have only about 5-8 such sentences for every 2000 lines, and it
> seems like empty lines are a reliable marker for these kind of situations,
> something I would do is to prune the corpus and remove any empty line along
> with two lines before and two lines after it from both the english and
> spanish corpus. You'd lose some sentences to train on but that would be
> negligible and the remaining corpus would be aligned.
>
> Just a thought
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY 
> wrote:
>
>>
>> On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer 
>> wrote:
>>
>>> VIVEK VICKY 
>>> čálii:
>>>
>>> > Hello everyone,
>>> > The eng-spa parallel corpora I am using(
>>> http://www.statmt.org/europarl/,
>>> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty
>>> lines
>>> > in either languages due to splitting of a sentence into two or merging
>>> of
>>> > two sentences after the translation, which is causing errors during
>>> > lexical-training. Is it common in parallel corpora? or is there any
>>> clean
>>> > parallel corpus out there?
>>> > Right now, I am translating the sentences around[up and below] the
>>> empty
>>> > lines and manually merging/splitting them. Is there any better way to
>>> do
>>> > this?
>>>
>>> Can you give an example? I took a look at that corpus and haven't found
>>> any unmatched lines yet.
>>
>>  In Europarl's spa-eng corpus, in eng text line number 104, " now he is
>> doing just the same" is shifted to line 105 in spa text. This is just one
>> example[look for empty lines in both langs]. There are around 5-8 such
>> sentences for every 2000
>>
>> Make sure you use the es-en.en file when
>>> pairing es with en (that is, don't use cs-en.en with es-en.es).
>>>
>>  Yes, indeed
>>
>>
>>> (It *is* common to find semi-parallel corpora out there, but I suppose
>>> we can leave sentence alignment out of the GsoC task unless
>>> there's extra time, and assume corpora will be fairly clean.)
>>>
>> We won't get valid rules if we train on semi-parallel corpora right? as
>> our script assumes sentences are perfectly aligned
>> PS: These corpora are perfectly sentence-aligned, except for FEW which
>> are just split or merged in the other language. Hence, blank lines
>>
>> ___
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Cleaning Parallel Corpus

2021-04-29 Thread Tanmai Khanna
Since you have only about 5-8 such sentences for every 2000 lines, and it
seems like empty lines are a reliable marker for these kind of situations,
something I would do is to prune the corpus and remove any empty line along
with two lines before and two lines after it from both the english and
spanish corpus. You'd lose some sentences to train on but that would be
negligible and the remaining corpus would be aligned.

Just a thought

*तन्मय खन्ना *
*Tanmai Khanna*


On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY  wrote:

>
> On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer 
> wrote:
>
>> VIVEK VICKY 
>> čálii:
>>
>> > Hello everyone,
>> > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/
>> ,
>> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty
>> lines
>> > in either languages due to splitting of a sentence into two or merging
>> of
>> > two sentences after the translation, which is causing errors during
>> > lexical-training. Is it common in parallel corpora? or is there any
>> clean
>> > parallel corpus out there?
>> > Right now, I am translating the sentences around[up and below] the empty
>> > lines and manually merging/splitting them. Is there any better way to do
>> > this?
>>
>> Can you give an example? I took a look at that corpus and haven't found
>> any unmatched lines yet.
>
>  In Europarl's spa-eng corpus, in eng text line number 104, " now he is
> doing just the same" is shifted to line 105 in spa text. This is just one
> example[look for empty lines in both langs]. There are around 5-8 such
> sentences for every 2000
>
> Make sure you use the es-en.en file when
>> pairing es with en (that is, don't use cs-en.en with es-en.es).
>>
>  Yes, indeed
>
>
>> (It *is* common to find semi-parallel corpora out there, but I suppose
>> we can leave sentence alignment out of the GsoC task unless
>> there's extra time, and assume corpora will be fairly clean.)
>>
> We won't get valid rules if we train on semi-parallel corpora right? as
> our script assumes sentences are perfectly aligned
> PS: These corpora are perfectly sentence-aligned, except for FEW which are
> just split or merged in the other language. Hence, blank lines
>
> ___
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Cleaning Parallel Corpus

2021-04-29 Thread VIVEK VICKY
On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer 
wrote:

> VIVEK VICKY 
> čálii:
>
> > Hello everyone,
> > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty
> lines
> > in either languages due to splitting of a sentence into two or merging of
> > two sentences after the translation, which is causing errors during
> > lexical-training. Is it common in parallel corpora? or is there any clean
> > parallel corpus out there?
> > Right now, I am translating the sentences around[up and below] the empty
> > lines and manually merging/splitting them. Is there any better way to do
> > this?
>
> Can you give an example? I took a look at that corpus and haven't found
> any unmatched lines yet.

 In Europarl's spa-eng corpus, in eng text line number 104, " now he is
doing just the same" is shifted to line 105 in spa text. This is just one
example[look for empty lines in both langs]. There are around 5-8 such
sentences for every 2000

Make sure you use the es-en.en file when
> pairing es with en (that is, don't use cs-en.en with es-en.es).
>
 Yes, indeed


> (It *is* common to find semi-parallel corpora out there, but I suppose
> we can leave sentence alignment out of the GsoC task unless
> there's extra time, and assume corpora will be fairly clean.)
>
We won't get valid rules if we train on semi-parallel corpora right? as our
script assumes sentences are perfectly aligned
PS: These corpora are perfectly sentence-aligned, except for FEW which are
just split or merged in the other language. Hence, blank lines

___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Cleaning Parallel Corpus

2021-04-29 Thread Kevin Brubeck Unhammer
VIVEK VICKY 
čálii:

> Hello everyone,
> The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines
> in either languages due to splitting of a sentence into two or merging of
> two sentences after the translation, which is causing errors during
> lexical-training. Is it common in parallel corpora? or is there any clean
> parallel corpus out there?
> Right now, I am translating the sentences around[up and below] the empty
> lines and manually merging/splitting them. Is there any better way to do
> this?

Can you give an example? I took a look at that corpus and haven't found
any unmatched lines yet. Make sure you use the es-en.en file when
pairing es with en (that is, don't use cs-en.en with es-en.es).

(It *is* common to find semi-parallel corpora out there, but I suppose
we can leave sentence alignment out of the GsoC task unless
there's extra time, and assume corpora will be fairly clean.)


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] Cleaning Parallel Corpus

2021-04-28 Thread VIVEK VICKY
Hello everyone,
The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines
in either languages due to splitting of a sentence into two or merging of
two sentences after the translation, which is causing errors during
lexical-training. Is it common in parallel corpora? or is there any clean
parallel corpus out there?
Right now, I am translating the sentences around[up and below] the empty
lines and manually merging/splitting them. Is there any better way to do
this?
Regards,
Vivek Vardhan Adepu
IRC: vivekvelda*/naan_dhaan*
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff