Re: Any symal experts?

2017-01-03 Thread Matt Post
John — Any updates on here?


> On Nov 23, 2016, at 12:28 PM, Matt Post  wrote:
> 
> I think it will be much less of a headache. The GIZA++ code is notorious for 
> being unreadable, and the Perl piece of that pipeline only hurts (even though 
> Philipp's Perl is unusually clear). I think adding atools to your port is the 
> way to go, and that it's written in C++ should facilitate that.
> 
> 
> 
> 
>> On Nov 23, 2016, at 12:25 PM, John Hewitt  wrote:
>> 
>> It'll be a headache because it also has no documentation, but to be fair it
>> may be less of a headache / a better long-term solution than trying to move
>> forward with this hackier solution.
>> 
>> I'll keep the symal use on the backburner and start putting together an
>> atools port.
>> 
>> -John
>> 
>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:
>> 
>>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>>> indeed replaced them with "atools"; how much work would it be to port that?
>>> 
>>> 
 On Nov 23, 2016, at 12:11 PM, John Hewitt 
>>> wrote:
 
 Hey everyone,
 
 I'm packaging up a Java port Fast Align for Joshua and integrating it
>>> into
 the pipeline.
 Fast Align does not produce symmetrical alignments -- it relies on a tool
 that I haven't ported to Java.
 We package symal (which symmetricizes alignments) with Joshua right now
>>> for
 GIZA++, so I'm attempting to re-use that.
 However, symal uses the .bal format, which it fails to describe.
 It gets away with this because files from GIZA++ are piped through
 giza2bal.pl, which itself is not well documented.
 I'm attempting to write, say, fastalign2bal.py.
 With a bit of tinkering, I got at the .bal format:
 
 1
 
 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
 
 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
 
 A template for which would be
 
 1
 
 NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
 alignment2 ... alignmentN]
 NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
 alignment2 ... alignmentN]
 
 
 However, I'm hitting some pretty nasty errors with symal when I pipe in
 some fastalign2bal.py output.
 A few hours with gdb made some progress (for as far as I can tell, the
 formats are identical) but if anyone has experience with symal, I would
 greatly appreciate some consultation.
 
 -John
>>> 
>>> 
> 



Re: Any symal experts?

2016-11-23 Thread Matt Post
I think it will be much less of a headache. The GIZA++ code is notorious for 
being unreadable, and the Perl piece of that pipeline only hurts (even though 
Philipp's Perl is unusually clear). I think adding atools to your port is the 
way to go, and that it's written in C++ should facilitate that.




> On Nov 23, 2016, at 12:25 PM, John Hewitt  wrote:
> 
> It'll be a headache because it also has no documentation, but to be fair it
> may be less of a headache / a better long-term solution than trying to move
> forward with this hackier solution.
> 
> I'll keep the symal use on the backburner and start putting together an
> atools port.
> 
> -John
> 
> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:
> 
>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>> indeed replaced them with "atools"; how much work would it be to port that?
>> 
>> 
>>> On Nov 23, 2016, at 12:11 PM, John Hewitt 
>> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>> into
>>> the pipeline.
>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>> that I haven't ported to Java.
>>> We package symal (which symmetricizes alignments) with Joshua right now
>> for
>>> GIZA++, so I'm attempting to re-use that.
>>> However, symal uses the .bal format, which it fails to describe.
>>> It gets away with this because files from GIZA++ are piped through
>>> giza2bal.pl, which itself is not well documented.
>>> I'm attempting to write, say, fastalign2bal.py.
>>> With a bit of tinkering, I got at the .bal format:
>>> 
>>> 1
>>> 
>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>> 
>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>> 
>>> A template for which would be
>>> 
>>> 1
>>> 
>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> 
>>> 
>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>> some fastalign2bal.py output.
>>> A few hours with gdb made some progress (for as far as I can tell, the
>>> formats are identical) but if anyone has experience with symal, I would
>>> greatly appreciate some consultation.
>>> 
>>> -John
>> 
>> 



Re: Any symal experts?

2016-11-23 Thread John Hewitt
It'll be a headache because it also has no documentation, but to be fair it
may be less of a headache / a better long-term solution than trying to move
forward with this hackier solution.

I'll keep the symal use on the backburner and start putting together an
atools port.

-John

On Wed, Nov 23, 2016 at 12:18 PM, Matt Post  wrote:

> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
> indeed replaced them with "atools"; how much work would it be to port that?
>
>
> > On Nov 23, 2016, at 12:11 PM, John Hewitt 
> wrote:
> >
> > Hey everyone,
> >
> > I'm packaging up a Java port Fast Align for Joshua and integrating it
> into
> > the pipeline.
> > Fast Align does not produce symmetrical alignments -- it relies on a tool
> > that I haven't ported to Java.
> > We package symal (which symmetricizes alignments) with Joshua right now
> for
> > GIZA++, so I'm attempting to re-use that.
> > However, symal uses the .bal format, which it fails to describe.
> > It gets away with this because files from GIZA++ are piped through
> > giza2bal.pl, which itself is not well documented.
> > I'm attempting to write, say, fastalign2bal.py.
> > With a bit of tinkering, I got at the .bal format:
> >
> > 1
> >
> > 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> >
> > 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> >
> > A template for which would be
> >
> > 1
> >
> > NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> > NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> >
> >
> > However, I'm hitting some pretty nasty errors with symal when I pipe in
> > some fastalign2bal.py output.
> > A few hours with gdb made some progress (for as far as I can tell, the
> > formats are identical) but if anyone has experience with symal, I would
> > greatly appreciate some consultation.
> >
> > -John
>
>


Re: Any symal experts?

2016-11-23 Thread Matt Post
John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed 
replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt  wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John



Any symal experts?

2016-11-23 Thread John Hewitt
Hey everyone,

I'm packaging up a Java port Fast Align for Joshua and integrating it into
the pipeline.
Fast Align does not produce symmetrical alignments -- it relies on a tool
that I haven't ported to Java.
We package symal (which symmetricizes alignments) with Joshua right now for
GIZA++, so I'm attempting to re-use that.
However, symal uses the .bal format, which it fails to describe.
It gets away with this because files from GIZA++ are piped through
giza2bal.pl, which itself is not well documented.
I'm attempting to write, say, fastalign2bal.py.
With a bit of tinkering, I got at the .bal format:

1

7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8

8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7

A template for which would be

1

NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
alignment2 ... alignmentN]
NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
alignment2 ... alignmentN]


However, I'm hitting some pretty nasty errors with symal when I pipe in
some fastalign2bal.py output.
A few hours with gdb made some progress (for as far as I can tell, the
formats are identical) but if anyone has experience with symal, I would
greatly appreciate some consultation.

-John