Re: Issue Building LM on master branch
I believe I cloned it there a while back because of the Google code expiration. Until recently the jar was just checked into Joshua. I'm not sure what the situation is now. Probably we should put the Berkeley Aligner in the Maven codebase? Or just switch to fast_align (which is Apache licensed). matt > On Jul 20, 2016, at 11:07 AM, Lewis John Mcgibbney > wrote: > > @Matt, > I noticed that you have a clone of some code representing berkelyaligner in > your Github repos > https://github.com/mjpost/berkeleyaligner > Is this the most up-to-date code? > Also, I see that berkeleyaligner.jar is called at the following line > https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81 > > On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> OK so as it turns out passing '--aligner berkeley' to the pipeline.pl >> invocation does not currently work in master branch. >> My log simply prints >> >> Error: Unable to access jarfile >> /usr/local/incubator-joshua/lib/berkeleyaligner.jar >> >> I'll get this sorted out and submit a PR to try and fix. >> Thanks >> >> On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> Hi Kellen and Matt, >>> >>> On Tue, Jul 19, 2016 at 8:20 PM, < >>> dev-digest-h...@joshua.incubator.apache.org> wrote: >>> >>>> From: Matt Post >>>> To: dev@joshua.incubator.apache.org >>>> Cc: >>>> Date: Sun, 17 Jul 2016 23:30:33 -0400 >>>> Subject: Re: Issue Building LM on master branch >>>> Lewis — This is a good-sized dataset, and on a single desktop machine, I >>>> expect it would take at least a day to go all the way through alignment, >>>> model-building, and tuning. >>>> >>> >>> OK thanks for the estimate. >>> >>> >>>> >>>> fast_align is a good idea, though it isn't integrated into the pipeline >>>> (shouldn't be too hard, and is on the list). You could also just try >>>> "--aligner berkeley" and see if that works. >>>> >>> >>> I'll do exactly that. Starting with berkeley first and then moving on to >>> fast_align. I'll update here with any progress. >>> >>> >>>> >>>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? >>>> Sometimes GIZA doesn't compile correctly, and this could be an error where >>>> it doesn't find GIZA++ or one of the support binaries (mkcls, >>>> snt2cooc.out). >>>> >>>> >>> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've >>> put the log below and would greatly appreciate if you could have a look >>> through it and provide some feedback. >>> http://home.apache.org/~lewismc/giza.log >>> I'll update this thread on the berkeley alignment outcome before shooting >>> to use the fast_align. >>> Thanks both again. >>> Lewis >>> >> >> >> >> -- >> *Lewis* >> > > > > -- > *Lewis*
Re: Issue Building LM on master branch
@Matt, I noticed that you have a clone of some code representing berkelyaligner in your Github repos https://github.com/mjpost/berkeleyaligner Is this the most up-to-date code? Also, I see that berkeleyaligner.jar is called at the following line https://github.com/apache/incubator-joshua/blob/master/scripts/training/paralign.pl#L81 On Wed, Jul 20, 2016 at 7:54 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > OK so as it turns out passing '--aligner berkeley' to the pipeline.pl > invocation does not currently work in master branch. > My log simply prints > > Error: Unable to access jarfile > /usr/local/incubator-joshua/lib/berkeleyaligner.jar > > I'll get this sorted out and submit a PR to try and fix. > Thanks > > On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> Hi Kellen and Matt, >> >> On Tue, Jul 19, 2016 at 8:20 PM, < >> dev-digest-h...@joshua.incubator.apache.org> wrote: >> >>> From: Matt Post >>> To: dev@joshua.incubator.apache.org >>> Cc: >>> Date: Sun, 17 Jul 2016 23:30:33 -0400 >>> Subject: Re: Issue Building LM on master branch >>> Lewis — This is a good-sized dataset, and on a single desktop machine, I >>> expect it would take at least a day to go all the way through alignment, >>> model-building, and tuning. >>> >> >> OK thanks for the estimate. >> >> >>> >>> fast_align is a good idea, though it isn't integrated into the pipeline >>> (shouldn't be too hard, and is on the list). You could also just try >>> "--aligner berkeley" and see if that works. >>> >> >> I'll do exactly that. Starting with berkeley first and then moving on to >> fast_align. I'll update here with any progress. >> >> >>> >>> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? >>> Sometimes GIZA doesn't compile correctly, and this could be an error where >>> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out). >>> >>> >> AFAICT I don't see any errors prior to the bottom dozen or so lines. I've >> put the log below and would greatly appreciate if you could have a look >> through it and provide some feedback. >> http://home.apache.org/~lewismc/giza.log >> I'll update this thread on the berkeley alignment outcome before shooting >> to use the fast_align. >> Thanks both again. >> Lewis >> > > > > -- > *Lewis* > -- *Lewis*
Re: Issue Building LM on master branch
OK so as it turns out passing '--aligner berkeley' to the pipeline.pl invocation does not currently work in master branch. My log simply prints Error: Unable to access jarfile /usr/local/incubator-joshua/lib/berkeleyaligner.jar I'll get this sorted out and submit a PR to try and fix. Thanks On Wed, Jul 20, 2016 at 7:07 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Kellen and Matt, > > On Tue, Jul 19, 2016 at 8:20 PM, < > dev-digest-h...@joshua.incubator.apache.org> wrote: > >> From: Matt Post >> To: dev@joshua.incubator.apache.org >> Cc: >> Date: Sun, 17 Jul 2016 23:30:33 -0400 >> Subject: Re: Issue Building LM on master branch >> Lewis — This is a good-sized dataset, and on a single desktop machine, I >> expect it would take at least a day to go all the way through alignment, >> model-building, and tuning. >> > > OK thanks for the estimate. > > >> >> fast_align is a good idea, though it isn't integrated into the pipeline >> (shouldn't be too hard, and is on the list). You could also just try >> "--aligner berkeley" and see if that works. >> > > I'll do exactly that. Starting with berkeley first and then moving on to > fast_align. I'll update here with any progress. > > >> >> Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? >> Sometimes GIZA doesn't compile correctly, and this could be an error where >> it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out). >> >> > AFAICT I don't see any errors prior to the bottom dozen or so lines. I've > put the log below and would greatly appreciate if you could have a look > through it and provide some feedback. > http://home.apache.org/~lewismc/giza.log > I'll update this thread on the berkeley alignment outcome before shooting > to use the fast_align. > Thanks both again. > Lewis > -- *Lewis*
Re: Issue Building LM on master branch
Hi Kellen and Matt, On Tue, Jul 19, 2016 at 8:20 PM, < dev-digest-h...@joshua.incubator.apache.org> wrote: > From: Matt Post > To: dev@joshua.incubator.apache.org > Cc: > Date: Sun, 17 Jul 2016 23:30:33 -0400 > Subject: Re: Issue Building LM on master branch > Lewis — This is a good-sized dataset, and on a single desktop machine, I > expect it would take at least a day to go all the way through alignment, > model-building, and tuning. > OK thanks for the estimate. > > fast_align is a good idea, though it isn't integrated into the pipeline > (shouldn't be too hard, and is on the list). You could also just try > "--aligner berkeley" and see if that works. > I'll do exactly that. Starting with berkeley first and then moving on to fast_align. I'll update here with any progress. > > Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? > Sometimes GIZA doesn't compile correctly, and this could be an error where > it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out). > > AFAICT I don't see any errors prior to the bottom dozen or so lines. I've put the log below and would greatly appreciate if you could have a look through it and provide some feedback. http://home.apache.org/~lewismc/giza.log I'll update this thread on the berkeley alignment outcome before shooting to use the fast_align. Thanks both again. Lewis
Re: Issue Building LM on master branch
Lewis — This is a good-sized dataset, and on a single desktop machine, I expect it would take at least a day to go all the way through alignment, model-building, and tuning. fast_align is a good idea, though it isn't integrated into the pipeline (shouldn't be too hard, and is on the list). You could also just try "--aligner berkeley" and see if that works. Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? Sometimes GIZA doesn't compile correctly, and this could be an error where it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out). matt > On Jul 16, 2016, at 6:01 PM, Lewis John Mcgibbney > wrote: > > Hi Folks, > When attempting to build a heiro model using 5K sentences for tuning, many > many more than that for testing and again many many more than that for the > actual corpus (~880K) I get the following error within the GIZA alignment > pipeline phase. > > Anyone have a clue what this means? I have the full GIZA logs if they are > useful. > I did find a thread on a VERY similar issue at [0]. The solution seems to > be to use absolute paths to all input data for the pipeline however that is > exactly what I've done e.g. > > $JOSHUA/bin/pipeline.pl --rundir . --type hiero --corpus > /usr/local/joshua_input/commoncrawl.ru-en --tune > /usr/local/joshua_input/commoncrawl.ru-en.tune --test > /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru > --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to > English Translation model” --mbr > > Where the parallel .en and .ru sentence files exist for all of the above > corpus, tune and test paths respectively. > > [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489 > > I have been having trouble consistently when generating models using > GIZA... is there a suggested alignment substitute which I should be trying > out? > > One last question... roughly how long should a Hiero-based LM for a corpus > of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB > mem. I remeber reading a while ago on the old Joshua site that a pipeline > would run in 10 or so minutes... this is clearly not the case and I would > like to share/compare some results if possible with others who are in the > business of generating LM and language packs. > > Thanks > > == > Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz > Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final > Waiting for second GIZA process... > (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016 > Combining forward and inverted alignment from files: > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz} > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz} > Executing: bash -c mkdir -p alignments/0/model > Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d > <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0) > skip=<0> counts=<817962> > symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250: > pointer being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > bash: line 1: 9080 Done > /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > 9081 Abort trap: 6 | > /usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > Exit code: 134 > ERROR: Can't generate symmetrized alignment file > > > > -- > *Lewis*
Re: Issue Building LM on master branch
Hey Lewis, as an alternative you can try fast_align. It's been working well for us. 10 minutes seems a little bit faster than what I'd expect. IIRC it may take a few hours (4-8?) to align that much data. https://github.com/clab/fast_align -Kellen On Sun, Jul 17, 2016 at 12:01 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > When attempting to build a heiro model using 5K sentences for tuning, many > many more than that for testing and again many many more than that for the > actual corpus (~880K) I get the following error within the GIZA alignment > pipeline phase. > > Anyone have a clue what this means? I have the full GIZA logs if they are > useful. > I did find a thread on a VERY similar issue at [0]. The solution seems to > be to use absolute paths to all input data for the pipeline however that is > exactly what I've done e.g. > > $JOSHUA/bin/pipeline.pl --rundir . --type hiero --corpus > /usr/local/joshua_input/commoncrawl.ru-en --tune > /usr/local/joshua_input/commoncrawl.ru-en.tune --test > /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru > --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to > English Translation model” --mbr > > Where the parallel .en and .ru sentence files exist for all of the above > corpus, tune and test paths respectively. > > [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489 > > I have been having trouble consistently when generating models using > GIZA... is there a suggested alignment substitute which I should be trying > out? > > One last question... roughly how long should a Hiero-based LM for a corpus > of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB > mem. I remeber reading a while ago on the old Joshua site that a pipeline > would run in 10 or so minutes... this is clearly not the case and I would > like to share/compare some results if possible with others who are in the > business of generating LM and language packs. > > Thanks > > == > Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz > Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final > Waiting for second GIZA process... > (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016 > Combining forward and inverted alignment from files: > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz} > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz} > Executing: bash -c mkdir -p alignments/0/model > Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d > <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0) > skip=<0> counts=<817962> > symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250: > pointer being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > bash: line 1: 9080 Done > /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > 9081 Abort trap: 6 | > /usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > Exit code: 134 > ERROR: Can't generate symmetrized alignment file > > > > -- > *Lewis* >
Issue Building LM on master branch
Hi Folks, When attempting to build a heiro model using 5K sentences for tuning, many many more than that for testing and again many many more than that for the actual corpus (~880K) I get the following error within the GIZA alignment pipeline phase. Anyone have a clue what this means? I have the full GIZA logs if they are useful. I did find a thread on a VERY similar issue at [0]. The solution seems to be to use absolute paths to all input data for the pipeline however that is exactly what I've done e.g. $JOSHUA/bin/pipeline.pl --rundir . --type hiero --corpus /usr/local/joshua_input/commoncrawl.ru-en --tune /usr/local/joshua_input/commoncrawl.ru-en.tune --test /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to English Translation model” --mbr Where the parallel .en and .ru sentence files exist for all of the above corpus, tune and test paths respectively. [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489 I have been having trouble consistently when generating models using GIZA... is there a suggested alignment substitute which I should be trying out? One last question... roughly how long should a Hiero-based LM for a corpus of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB mem. I remeber reading a while ago on the old Joshua site that a pipeline would run in 10 or so minutes... this is clearly not the case and I would like to share/compare some results if possible with others who are in the business of generating LM and language packs. Thanks == Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final Waiting for second GIZA process... (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016 Combining forward and inverted alignment from files: alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz} alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz} Executing: bash -c mkdir -p alignments/0/model Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow" -diagonal="yes" -final="yes" -both="no" -o=alignments/0/model/aligned.grow-diag-final symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0) skip=<0> counts=<817962> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug bash: line 1: 9080 Done /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) 9081 Abort trap: 6 | /usr/local/incubator-joshua/ext/symal/symal -alignment="grow" -diagonal="yes" -final="yes" -both="no" -o=alignments/0/model/aligned.grow-diag-final Exit code: 134 ERROR: Can't generate symmetrized alignment file -- *Lewis*