John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed 
replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu> wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John

Reply via email to