John — Any updates on here?
> On Nov 23, 2016, at 12:28 PM, Matt Post <p...@cs.jhu.edu> wrote: > > I think it will be much less of a headache. The GIZA++ code is notorious for > being unreadable, and the Perl piece of that pipeline only hurts (even though > Philipp's Perl is unusually clear). I think adding atools to your port is the > way to go, and that it's written in C++ should facilitate that. > > > > >> On Nov 23, 2016, at 12:25 PM, John Hewitt <john...@seas.upenn.edu> wrote: >> >> It'll be a headache because it also has no documentation, but to be fair it >> may be less of a headache / a better long-term solution than trying to move >> forward with this hackier solution. >> >> I'll keep the symal use on the backburner and start putting together an >> atools port. >> >> -John >> >> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote: >> >>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align >>> indeed replaced them with "atools"; how much work would it be to port that? >>> >>> >>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu> >>> wrote: >>>> >>>> Hey everyone, >>>> >>>> I'm packaging up a Java port Fast Align for Joshua and integrating it >>> into >>>> the pipeline. >>>> Fast Align does not produce symmetrical alignments -- it relies on a tool >>>> that I haven't ported to Java. >>>> We package symal (which symmetricizes alignments) with Joshua right now >>> for >>>> GIZA++, so I'm attempting to re-use that. >>>> However, symal uses the .bal format, which it fails to describe. >>>> It gets away with this because files from GIZA++ are piped through >>>> giza2bal.pl, which itself is not well documented. >>>> I'm attempting to write, say, fastalign2bal.py. >>>> With a bit of tinkering, I got at the .bal format: >>>> >>>> 1 >>>> >>>> 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 >>>> >>>> 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 >>>> >>>> A template for which would be >>>> >>>> 1 >>>> >>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 >>>> alignment2 ... alignmentN] >>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 >>>> alignment2 ... alignmentN] >>>> >>>> >>>> However, I'm hitting some pretty nasty errors with symal when I pipe in >>>> some fastalign2bal.py output. >>>> A few hours with gdb made some progress (for as far as I can tell, the >>>> formats are identical) but if anyone has experience with symal, I would >>>> greatly appreciate some consultation. >>>> >>>> -John >>> >>> >