John — Any updates on here?

> On Nov 23, 2016, at 12:28 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> I think it will be much less of a headache. The GIZA++ code is notorious for 
> being unreadable, and the Perl piece of that pipeline only hurts (even though 
> Philipp's Perl is unusually clear). I think adding atools to your port is the 
> way to go, and that it's written in C++ should facilitate that.
> 
> 
> 
> 
>> On Nov 23, 2016, at 12:25 PM, John Hewitt <john...@seas.upenn.edu> wrote:
>> 
>> It'll be a headache because it also has no documentation, but to be fair it
>> may be less of a headache / a better long-term solution than trying to move
>> forward with this hackier solution.
>> 
>> I'll keep the symal use on the backburner and start putting together an
>> atools port.
>> 
>> -John
>> 
>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>>> indeed replaced them with "atools"; how much work would it be to port that?
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <john...@seas.upenn.edu>
>>> wrote:
>>>> 
>>>> Hey everyone,
>>>> 
>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>> into
>>>> the pipeline.
>>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>>> that I haven't ported to Java.
>>>> We package symal (which symmetricizes alignments) with Joshua right now
>>> for
>>>> GIZA++, so I'm attempting to re-use that.
>>>> However, symal uses the .bal format, which it fails to describe.
>>>> It gets away with this because files from GIZA++ are piped through
>>>> giza2bal.pl, which itself is not well documented.
>>>> I'm attempting to write, say, fastalign2bal.py.
>>>> With a bit of tinkering, I got at the .bal format:
>>>> 
>>>> 1
>>>> 
>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>> 
>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>> 
>>>> A template for which would be
>>>> 
>>>> 1
>>>> 
>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> 
>>>> 
>>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>>> some fastalign2bal.py output.
>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>> formats are identical) but if anyone has experience with symal, I would
>>>> greatly appreciate some consultation.
>>>> 
>>>> -John
>>> 
>>> 
> 

Reply via email to