Hi john I'm afraid the word alignment tools like Giza++ aren't really designed to be run against paragraph length input. Probably one reason why you're getting bad alignments.
I don't know is tweaking the parameters would make it any better, or using any other word alignment tool On Wed, Jul 8, 2020, 6:19 PM John Thompson < john.thompson.jtsoftw...@gmail.com> wrote: > Hi, > > I'm using a 7162 line paragraph-aligned corpus. Unfortunately the > translation within the paragraph sometimes don't have the sentences > aligned, i.e. in one language the sentence could be one long sentence, and > in another language the sentence could have clauses broken up into multiple > sentences, hence I'm running GIZA++ on paragraphs. > > It works partially, but the alignment of words is often wrong or it's > missing matches that should have been made. > > I set the "maxsentencelength" configuration file parameter to 350, though > most of the paragraphs are around 100 or fewer words. > > Q1: What difference do you estimate I should expect between using > paragraphs vs. sentences? > > Q2: Are there GIZA++ parameters I could tune to improve the alignment? > > Q3: If I concatenated multiple corpora, would the alignment output likely > improve? > > I could preprocess the corpus, breaking up the paragraphs where the number > of sentences match, but there may be some cases where the sentences don't > align, where multiple sentences within the paragraph were joined or split > differently, such that the sentence count of the paragraph is the same, but > the sentences don't align. > > Q4: How big of effect would these bad sentence alignments have on the rest > of the alignments? > > Q5: Any ideas for how to get better word alignment with these corpora that > I have, either with GIZA++ or a different tool? > > I'm using the word alignment in a language study tool. For example, I have > the text for a book in both English and Marshallese, but language resources > for Marshallese are scarce. In my tool I associate the alignment > information with the text, and also generate a dictionary using the > alignment output (or optionally the *.dict.actual.ti.final dictionary list > output). In one page I show the aligned sentences, and in another you can > click on words to get both the alignment definition, and the dictionary > definitions. For each source word dictionary entry, I sort the target > definitions by descending frequency (or probability if using the > *.dict.actual.ti.final dictionary list output), and then chop off the list > after a certain number, as otherwise there will be a lot of bad or spurious > definitions included. > > Thanks! > > -John > > -- > John Thompson > john.thompson.jtsoftw...@gmail.com > https://www.jtlanguage.com > 1-909-283-4364 (home) > 1-909-283-5642 (cell) > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support