Re: Any symal experts?
John — Any updates on here? > On Nov 23, 2016, at 12:28 PM, Matt Postwrote: > > I think it will be much less of a headache. The GIZA++ code is notorious for > being unreadable, and the Perl piece of that pipeline only hurts (even though > Philipp's Perl is unusually clear). I think adding atools to your port is the > way to go, and that it's written in C++ should facilitate that. > > > > >> On Nov 23, 2016, at 12:25 PM, John Hewitt wrote: >> >> It'll be a headache because it also has no documentation, but to be fair it >> may be less of a headache / a better long-term solution than trying to move >> forward with this hackier solution. >> >> I'll keep the symal use on the backburner and start putting together an >> atools port. >> >> -John >> >> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post wrote: >> >>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align >>> indeed replaced them with "atools"; how much work would it be to port that? >>> >>> On Nov 23, 2016, at 12:11 PM, John Hewitt >>> wrote: Hey everyone, I'm packaging up a Java port Fast Align for Joshua and integrating it >>> into the pipeline. Fast Align does not produce symmetrical alignments -- it relies on a tool that I haven't ported to Java. We package symal (which symmetricizes alignments) with Joshua right now >>> for GIZA++, so I'm attempting to re-use that. However, symal uses the .bal format, which it fails to describe. It gets away with this because files from GIZA++ are piped through giza2bal.pl, which itself is not well documented. I'm attempting to write, say, fastalign2bal.py. With a bit of tinkering, I got at the .bal format: 1 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 A template for which would be 1 NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 alignment2 ... alignmentN] NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 alignment2 ... alignmentN] However, I'm hitting some pretty nasty errors with symal when I pipe in some fastalign2bal.py output. A few hours with gdb made some progress (for as far as I can tell, the formats are identical) but if anyone has experience with symal, I would greatly appreciate some consultation. -John >>> >>> >
Re: Any symal experts?
I think it will be much less of a headache. The GIZA++ code is notorious for being unreadable, and the Perl piece of that pipeline only hurts (even though Philipp's Perl is unusually clear). I think adding atools to your port is the way to go, and that it's written in C++ should facilitate that. > On Nov 23, 2016, at 12:25 PM, John Hewittwrote: > > It'll be a headache because it also has no documentation, but to be fair it > may be less of a headache / a better long-term solution than trying to move > forward with this hackier solution. > > I'll keep the symal use on the backburner and start putting together an > atools port. > > -John > > On Wed, Nov 23, 2016 at 12:18 PM, Matt Post wrote: > >> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align >> indeed replaced them with "atools"; how much work would it be to port that? >> >> >>> On Nov 23, 2016, at 12:11 PM, John Hewitt >> wrote: >>> >>> Hey everyone, >>> >>> I'm packaging up a Java port Fast Align for Joshua and integrating it >> into >>> the pipeline. >>> Fast Align does not produce symmetrical alignments -- it relies on a tool >>> that I haven't ported to Java. >>> We package symal (which symmetricizes alignments) with Joshua right now >> for >>> GIZA++, so I'm attempting to re-use that. >>> However, symal uses the .bal format, which it fails to describe. >>> It gets away with this because files from GIZA++ are piped through >>> giza2bal.pl, which itself is not well documented. >>> I'm attempting to write, say, fastalign2bal.py. >>> With a bit of tinkering, I got at the .bal format: >>> >>> 1 >>> >>> 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 >>> >>> 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 >>> >>> A template for which would be >>> >>> 1 >>> >>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 >>> alignment2 ... alignmentN] >>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 >>> alignment2 ... alignmentN] >>> >>> >>> However, I'm hitting some pretty nasty errors with symal when I pipe in >>> some fastalign2bal.py output. >>> A few hours with gdb made some progress (for as far as I can tell, the >>> formats are identical) but if anyone has experience with symal, I would >>> greatly appreciate some consultation. >>> >>> -John >> >>
Re: Any symal experts?
It'll be a headache because it also has no documentation, but to be fair it may be less of a headache / a better long-term solution than trying to move forward with this hackier solution. I'll keep the symal use on the backburner and start putting together an atools port. -John On Wed, Nov 23, 2016 at 12:18 PM, Matt Postwrote: > John — I suggest trying to ditch those GIZA++ tools entirely. fast_align > indeed replaced them with "atools"; how much work would it be to port that? > > > > On Nov 23, 2016, at 12:11 PM, John Hewitt > wrote: > > > > Hey everyone, > > > > I'm packaging up a Java port Fast Align for Joshua and integrating it > into > > the pipeline. > > Fast Align does not produce symmetrical alignments -- it relies on a tool > > that I haven't ported to Java. > > We package symal (which symmetricizes alignments) with Joshua right now > for > > GIZA++, so I'm attempting to re-use that. > > However, symal uses the .bal format, which it fails to describe. > > It gets away with this because files from GIZA++ are piped through > > giza2bal.pl, which itself is not well documented. > > I'm attempting to write, say, fastalign2bal.py. > > With a bit of tinkering, I got at the .bal format: > > > > 1 > > > > 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 > > > > 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 > > > > A template for which would be > > > > 1 > > > > NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 > > alignment2 ... alignmentN] > > NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 > > alignment2 ... alignmentN] > > > > > > However, I'm hitting some pretty nasty errors with symal when I pipe in > > some fastalign2bal.py output. > > A few hours with gdb made some progress (for as far as I can tell, the > > formats are identical) but if anyone has experience with symal, I would > > greatly appreciate some consultation. > > > > -John > >
Re: Any symal experts?
John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed replaced them with "atools"; how much work would it be to port that? > On Nov 23, 2016, at 12:11 PM, John Hewittwrote: > > Hey everyone, > > I'm packaging up a Java port Fast Align for Joshua and integrating it into > the pipeline. > Fast Align does not produce symmetrical alignments -- it relies on a tool > that I haven't ported to Java. > We package symal (which symmetricizes alignments) with Joshua right now for > GIZA++, so I'm attempting to re-use that. > However, symal uses the .bal format, which it fails to describe. > It gets away with this because files from GIZA++ are piped through > giza2bal.pl, which itself is not well documented. > I'm attempting to write, say, fastalign2bal.py. > With a bit of tinkering, I got at the .bal format: > > 1 > > 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 > > 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 > > A template for which would be > > 1 > > NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 > alignment2 ... alignmentN] > NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 > alignment2 ... alignmentN] > > > However, I'm hitting some pretty nasty errors with symal when I pipe in > some fastalign2bal.py output. > A few hours with gdb made some progress (for as far as I can tell, the > formats are identical) but if anyone has experience with symal, I would > greatly appreciate some consultation. > > -John
Any symal experts?
Hey everyone, I'm packaging up a Java port Fast Align for Joshua and integrating it into the pipeline. Fast Align does not produce symmetrical alignments -- it relies on a tool that I haven't ported to Java. We package symal (which symmetricizes alignments) with Joshua right now for GIZA++, so I'm attempting to re-use that. However, symal uses the .bal format, which it fails to describe. It gets away with this because files from GIZA++ are piped through giza2bal.pl, which itself is not well documented. I'm attempting to write, say, fastalign2bal.py. With a bit of tinkering, I got at the .bal format: 1 7 jehovah said to moses and aaron : # 3 2 2 4 5 6 8 8 i řekl hospodin mojžíšovi a aronovi takto : # 2 2 1 4 5 6 6 7 A template for which would be 1 NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1 alignment2 ... alignmentN] NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1 alignment2 ... alignmentN] However, I'm hitting some pretty nasty errors with symal when I pipe in some fastalign2bal.py output. A few hours with gdb made some progress (for as far as I can tell, the formats are identical) but if anyone has experience with symal, I would greatly appreciate some consultation. -John