Hey everyone, I'm packaging up a Java port Fast Align for Joshua and integrating it into the pipeline. Fast Align does not produce symmetrical alignments -- it relies on a tool that I haven't ported to Java. We package symal (which symmetricizes alignments) with Joshua right now for GIZA++, so I'm attempting to re-use that. However, symal uses the .bal format, which it fails to describe. It gets away with this because files from GIZA++ are piped through giza2bal.pl, which itself is not well documented. I'm attempting to write, say, fastalign2bal.py. With a bit of tinkering, I got at the .bal format:
1 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 A template for which would be 1 NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 alignment2 ... alignmentN] NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 alignment2 ... alignmentN] However, I'm hitting some pretty nasty errors with symal when I pipe in some fastalign2bal.py output. A few hours with gdb made some progress (for as far as I can tell, the formats are identical) but if anyone has experience with symal, I would greatly appreciate some consultation. -John