Re: Issue Building LM on master branch

2016-07-20 Thread Matt Post
I believe I cloned it there a while back because of the Google code expiration. 
Until recently the jar was just checked into Joshua. I'm not sure what the 
situation is now.

Probably we should put the Berkeley Aligner in the Maven codebase? Or just 
switch to fast_align (which is Apache licensed).

matt


> On Jul 20, 2016, at 11:07 AM, Lewis John Mcgibbney 
>  wrote:
> 
> @Matt,
> I noticed that you have a clone of some code representing berkelyaligner in
> your Github repos
> https://github.com/mjpost/berkeleyaligner
> Is this the most up-to-date code?
> Also, I see that berkeleyaligner.jar is called at the following line
> https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81
> 
> On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> 
>> OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
>> invocation does not currently work in master branch.
>> My log simply prints
>> 
>> Error: Unable to access jarfile
>> /usr/local/incubator-joshua/lib/berkeleyaligner.jar
>> 
>> I'll get this sorted out and submit a PR to try and fix.
>> Thanks
>> 
>> On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>> 
>>> Hi Kellen and Matt,
>>> 
>>> On Tue, Jul 19, 2016 at 8:20 PM, <
>>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>> 
>>>> From: Matt Post 
>>>> To: dev@joshua.incubator.apache.org
>>>> Cc:
>>>> Date: Sun, 17 Jul 2016 23:30:33 -0400
>>>> Subject: Re: Issue Building LM on master branch
>>>> Lewis — This is a good-sized dataset, and on a single desktop machine, I
>>>> expect it would take at least a day to go all the way through alignment,
>>>> model-building, and tuning.
>>>> 
>>> 
>>> OK thanks for the estimate.
>>> 
>>> 
>>>> 
>>>> fast_align is a good idea, though it isn't integrated into the pipeline
>>>> (shouldn't be too hard, and is on the list). You could also just try
>>>> "--aligner berkeley" and see if that works.
>>>> 
>>> 
>>> I'll do exactly that. Starting with berkeley first and then moving on to
>>> fast_align. I'll update here with any progress.
>>> 
>>> 
>>>> 
>>>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
>>>> Sometimes GIZA doesn't compile correctly, and this could be an error where
>>>> it doesn't find GIZA++ or one of the support binaries (mkcls, 
>>>> snt2cooc.out).
>>>> 
>>>> 
>>> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
>>> put the log below and would greatly appreciate if you could have a look
>>> through it and provide some feedback.
>>> http://home.apache.org/~lewismc/giza.log
>>> I'll update this thread on the berkeley alignment outcome before shooting
>>> to use the fast_align.
>>> Thanks both again.
>>> Lewis
>>> 
>> 
>> 
>> 
>> --
>> *Lewis*
>> 
> 
> 
> 
> -- 
> *Lewis*



Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
@Matt,
I noticed that you have a clone of some code representing berkelyaligner in
your Github repos
https://github.com/mjpost/berkeleyaligner
Is this the most up-to-date code?
Also, I see that berkeleyaligner.jar is called at the following line
https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81

On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
> invocation does not currently work in master branch.
> My log simply prints
>
> Error: Unable to access jarfile
> /usr/local/incubator-joshua/lib/berkeleyaligner.jar
>
> I'll get this sorted out and submit a PR to try and fix.
> Thanks
>
> On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>> Hi Kellen and Matt,
>>
>> On Tue, Jul 19, 2016 at 8:20 PM, <
>> dev-digest-h...@joshua.incubator.apache.org> wrote:
>>
>>> From: Matt Post 
>>> To: dev@joshua.incubator.apache.org
>>> Cc:
>>> Date: Sun, 17 Jul 2016 23:30:33 -0400
>>> Subject: Re: Issue Building LM on master branch
>>> Lewis — This is a good-sized dataset, and on a single desktop machine, I
>>> expect it would take at least a day to go all the way through alignment,
>>> model-building, and tuning.
>>>
>>
>> OK thanks for the estimate.
>>
>>
>>>
>>> fast_align is a good idea, though it isn't integrated into the pipeline
>>> (shouldn't be too hard, and is on the list). You could also just try
>>> "--aligner berkeley" and see if that works.
>>>
>>
>> I'll do exactly that. Starting with berkeley first and then moving on to
>> fast_align. I'll update here with any progress.
>>
>>
>>>
>>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
>>> Sometimes GIZA doesn't compile correctly, and this could be an error where
>>> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>>>
>>>
>> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
>> put the log below and would greatly appreciate if you could have a look
>> through it and provide some feedback.
>> http://home.apache.org/~lewismc/giza.log
>> I'll update this thread on the berkeley alignment outcome before shooting
>> to use the fast_align.
>> Thanks both again.
>> Lewis
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*


Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
OK so as it turns out passing '--aligner berkeley' to the pipeline.pl
invocation does not currently work in master branch.
My log simply prints

Error: Unable to access jarfile
/usr/local/incubator-joshua/lib/berkeleyaligner.jar

I'll get this sorted out and submit a PR to try and fix.
Thanks

On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Kellen and Matt,
>
> On Tue, Jul 19, 2016 at 8:20 PM, <
> dev-digest-h...@joshua.incubator.apache.org> wrote:
>
>> From: Matt Post 
>> To: dev@joshua.incubator.apache.org
>> Cc:
>> Date: Sun, 17 Jul 2016 23:30:33 -0400
>> Subject: Re: Issue Building LM on master branch
>> Lewis — This is a good-sized dataset, and on a single desktop machine, I
>> expect it would take at least a day to go all the way through alignment,
>> model-building, and tuning.
>>
>
> OK thanks for the estimate.
>
>
>>
>> fast_align is a good idea, though it isn't integrated into the pipeline
>> (shouldn't be too hard, and is on the list). You could also just try
>> "--aligner berkeley" and see if that works.
>>
>
> I'll do exactly that. Starting with berkeley first and then moving on to
> fast_align. I'll update here with any progress.
>
>
>>
>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
>> Sometimes GIZA doesn't compile correctly, and this could be an error where
>> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>>
>>
> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
> put the log below and would greatly appreciate if you could have a look
> through it and provide some feedback.
> http://home.apache.org/~lewismc/giza.log
> I'll update this thread on the berkeley alignment outcome before shooting
> to use the fast_align.
> Thanks both again.
> Lewis
>



-- 
*Lewis*


Re: Issue Building LM on master branch

2016-07-20 Thread Lewis John Mcgibbney
Hi Kellen and Matt,

On Tue, Jul 19, 2016 at 8:20 PM, <
dev-digest-h...@joshua.incubator.apache.org> wrote:

> From: Matt Post 
> To: dev@joshua.incubator.apache.org
> Cc:
> Date: Sun, 17 Jul 2016 23:30:33 -0400
> Subject: Re: Issue Building LM on master branch
> Lewis — This is a good-sized dataset, and on a single desktop machine, I
> expect it would take at least a day to go all the way through alignment,
> model-building, and tuning.
>

OK thanks for the estimate.


>
> fast_align is a good idea, though it isn't integrated into the pipeline
> (shouldn't be too hard, and is on the list). You could also just try
> "--aligner berkeley" and see if that works.
>

I'll do exactly that. Starting with berkeley first and then moving on to
fast_align. I'll update here with any progress.


>
> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)?
> Sometimes GIZA doesn't compile correctly, and this could be an error where
> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).
>
>
AFAICT I don't see any errors prior to the bottom dozen or so lines. I've
put the log below and would greatly appreciate if you could have a look
through it and provide some feedback.
http://home.apache.org/~lewismc/giza.log
I'll update this thread on the berkeley alignment outcome before shooting
to use the fast_align.
Thanks both again.
Lewis


Re: Issue Building LM on master branch

2016-07-17 Thread Matt Post
Lewis — This is a good-sized dataset, and on a single desktop machine, I expect 
it would take at least a day to go all the way through alignment, 
model-building, and tuning.

fast_align is a good idea, though it isn't integrated into the pipeline 
(shouldn't be too hard, and is on the list). You could also just try "--aligner 
berkeley" and see if that works. 

Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? Sometimes 
GIZA doesn't compile correctly, and this could be an error where it doesn't 
find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).

matt


> On Jul 16, 2016, at 6:01 PM, Lewis John Mcgibbney  
> wrote:
> 
> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
> 
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
> 
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
> 
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
> 
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
> 
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
> 
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
> 
> Thanks
> 
> ==
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>  alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>  alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>  9081 Abort trap: 6   |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
> 
> 
> 
> -- 
> *Lewis*



Re: Issue Building LM on master branch

2016-07-16 Thread kellen sunderland
Hey Lewis, as an alternative you can try fast_align. It's been working well
for us.  10 minutes seems a little bit faster than what I'd expect.  IIRC
it may take a few hours (4-8?) to align that much data.

https://github.com/clab/fast_align

-Kellen

On Sun, Jul 17, 2016 at 12:01 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
>
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
>
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
>
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
>
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
>
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
>
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
>
> Thanks
>
> ==
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>   alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>   alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>   9081 Abort trap: 6   |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
>
>
>
> --
> *Lewis*
>


Issue Building LM on master branch

2016-07-16 Thread Lewis John Mcgibbney
Hi Folks,
When attempting to build a heiro model using 5K sentences for tuning, many
many more than that for testing and again many many more than that for the
actual corpus (~880K) I get the following error within the GIZA alignment
pipeline phase.

Anyone have a clue what this means? I have the full GIZA logs if they are
useful.
I did find a thread on a VERY similar issue at [0]. The solution seems to
be to use absolute paths to all input data for the pipeline however that is
exactly what I've done e.g.

$JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
/usr/local/joshua_input/commoncrawl.ru-en --tune
/usr/local/joshua_input/commoncrawl.ru-en.tune --test
/usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
--rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
English Translation model” --mbr

Where the parallel .en and .ru sentence files exist for all of the above
corpus, tune and test paths respectively.

[0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489

I have been having trouble consistently when generating models using
GIZA... is there a suggested alignment substitute which I should be trying
out?

One last question... roughly how long should a Hiero-based LM for a corpus
of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
mem. I remeber reading a while ago on the old Joshua site that a pipeline
would run in 10 or so minutes... this is clearly not the case and I would
like to share/compare some results if possible with others who are in the
business of generating LM and language packs.

Thanks

==
Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
Waiting for second GIZA process...
(3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
Combining forward and inverted alignment from files:
  alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
  alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
Executing: bash -c mkdir -p alignments/0/model
Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
<(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
|/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
-diagonal="yes" -final="yes" -both="no"
-o=alignments/0/model/aligned.grow-diag-final
symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
skip=<0> counts=<817962>
symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
bash: line 1:  9080 Done
/usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
  9081 Abort trap: 6   |
/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
-diagonal="yes" -final="yes" -both="no"
-o=alignments/0/model/aligned.grow-diag-final
Exit code: 134
ERROR: Can't generate symmetrized alignment file



-- 
*Lewis*