Hi john

I'm afraid the word alignment tools like Giza++ aren't really designed to
be run against paragraph length input. Probably one reason why you're
getting bad alignments.

I don't know is tweaking the parameters would make it any better, or using
any other word alignment tool

On Wed, Jul 8, 2020, 6:19 PM John Thompson <
john.thompson.jtsoftw...@gmail.com> wrote:

> Hi,
>
> I'm using a 7162 line paragraph-aligned corpus. Unfortunately the
> translation within the paragraph sometimes don't have the sentences
> aligned, i.e. in one language the sentence could be one long sentence, and
> in another language the sentence could have clauses broken up into multiple
> sentences, hence I'm running GIZA++ on paragraphs.
>
> It works partially, but the alignment of words is often wrong or it's
> missing matches that should have been made.
>
> I set the "maxsentencelength" configuration file parameter to 350, though
> most of the paragraphs are around 100 or fewer words.
>
> Q1: What difference do you estimate I should expect between using
> paragraphs vs. sentences?
>
> Q2: Are there GIZA++ parameters I could tune to improve the alignment?
>
> Q3: If I concatenated multiple corpora, would the alignment output likely
> improve?
>
> I could preprocess the corpus, breaking up the paragraphs where the number
> of sentences match, but there may be some cases where the sentences don't
> align, where multiple sentences within the paragraph were joined or split
> differently, such that the sentence count of the paragraph is the same, but
> the sentences don't align.
>
> Q4: How big of effect would these bad sentence alignments have on the rest
> of the alignments?
>
> Q5: Any ideas for how to get better word alignment with these corpora that
> I have, either with GIZA++ or a different tool?
>
> I'm using the word alignment in a language study tool. For example, I have
> the text for a book in both English and Marshallese, but language resources
> for Marshallese are scarce. In my tool I associate the alignment
> information with the text, and also generate a dictionary using the
> alignment output (or optionally the *.dict.actual.ti.final dictionary list
> output). In one page I show the aligned sentences, and in another you can
> click on words to get both the alignment definition, and the dictionary
> definitions.  For each source word dictionary entry, I sort the target
> definitions by descending frequency (or probability if using the
> *.dict.actual.ti.final dictionary list output), and then chop off the list
> after a certain number, as otherwise there will be a lot of bad or spurious
> definitions included.
>
> Thanks!
>
> -John
>
> --
> John Thompson
> john.thompson.jtsoftw...@gmail.com
> https://www.jtlanguage.com
> 1-909-283-4364 (home)
> 1-909-283-5642 (cell)
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to