Re: Issue Building LM on master branch

2016-07-16 Thread kellen sunderland
Hey Lewis, as an alternative you can try fast_align. It's been working well
for us.  10 minutes seems a little bit faster than what I'd expect.  IIRC
it may take a few hours (4-8?) to align that much data.

https://github.com/clab/fast_align

-Kellen

On Sun, Jul 17, 2016 at 12:01 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
>
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
>
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
>
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
>
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
>
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
>
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
>
> Thanks
>
> ==
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>   alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>   alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>   9081 Abort trap: 6   |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
>
>
>
> --
> *Lewis*
>


Re: Russian Language Model for Joshua

2016-07-16 Thread Tom Barber
I  can host it: http://meteorite.bi/downloads/ru.kenlm

Tom

--

Director Meteorite.bi - Saiku Analytics Founder
Tel: +44(0)5603641316

(Thanks to the Saiku community we reached our Kickstart

goal, but you can always help by sponsoring the project
)

On 16 July 2016 at 22:45, Mcgibbney, Lewis J (398M) <
lewis.j.mcgibb...@jpl.nasa.gov> wrote:

> Can you make this public for good? Or is it the size which is the issue?
> Is this build using master branch Matt? I am having issues building models
> with masterŠ I¹ll post my issues on another thread.
>
> Dr. Lewis John McGibbney Ph.D., B.Sc.
> Data Scientist II
> Computer Science for Data Intensive Applications Group 398M
> Jet Propulsion Laboratory
> California Institute of Technology
> 4800 Oak Grove Drive
> Pasadena, California 91109-8099
> Mail Stop : 158-256C
> Tel:  (+1) (818)-393-7402
> Cell: (+1) (626)-487-3476
> Fax:  (+1) (818)-393-1190
> Email: lewis.j.mcgibb...@jpl.nasa.gov
>
>
>
>  Dare Mighty Things
>
>
>
>
>
>
>
>
>
>
>
> On 7/16/16, 1:09 PM, "Matt Post"  wrote:
>
> >Done:
> >
> >   http://cs.jhu.edu/~post/tmp/ru.kenlm
> >   4106251755 bytes, sha1sum: 5c894e24dafa42bc44a5bb6822812d6234eda791
> >
> >Let me know when you have it so I can delete it.
> >
> >matt
> >
> >
> >> On Jul 15, 2016, at 4:42 PM, Matt Post  wrote:
> >>
> >> All right, started trying to recompile. If you have a machine with >
> >>256 GB of memory, it might be more efficient for me to give you the raw
> >>ARPA file and for you to compile it. We'll see how it goes. Ping me in a
> >>day if you don't hear from me.
> >>
> >> matt
> >>
> >>
> >>> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980)
> >>> wrote:
> >>>
> >>> Yes please! :)
> >>>
> >>> Sent from my iPhone
> >>>
>  On Jul 15, 2016, at 1:39 PM, Matt Post  wrote:
> 
>  I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM
> compiles of it failed in the past, but I'll try again. I expect it to
> be about 8 GB when that's done. Do you want it?
> 
>  matt
> 
> 
> > On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980)
> > wrote:
> >
> > Hey Folks,
> >
> > Anyone have a Russian Language Model for Joshua? Lewis was working on
> > one, not sure if he has it but just broadening the question.
> >
> > Cheers,
> > Chris
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Director, Information Retrieval and Data Science Group (IRDS)
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > ++
> 
> >>
> >
>
>


Re: Russian Language Model for Joshua

2016-07-16 Thread Mcgibbney, Lewis J (398M)
Can you make this public for good? Or is it the size which is the issue?
Is this build using master branch Matt? I am having issues building models
with masterŠ I¹ll post my issues on another thread.

Dr. Lewis John McGibbney Ph.D., B.Sc.
Data Scientist II
Computer Science for Data Intensive Applications Group 398M
Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive
Pasadena, California 91109-8099
Mail Stop : 158-256C
Tel:  (+1) (818)-393-7402
Cell: (+1) (626)-487-3476
Fax:  (+1) (818)-393-1190
Email: lewis.j.mcgibb...@jpl.nasa.gov

   

 Dare Mighty Things











On 7/16/16, 1:09 PM, "Matt Post"  wrote:

>Done:
>
>   http://cs.jhu.edu/~post/tmp/ru.kenlm
>   4106251755 bytes, sha1sum: 5c894e24dafa42bc44a5bb6822812d6234eda791
>
>Let me know when you have it so I can delete it.
>
>matt
>
>
>> On Jul 15, 2016, at 4:42 PM, Matt Post  wrote:
>> 
>> All right, started trying to recompile. If you have a machine with >
>>256 GB of memory, it might be more efficient for me to give you the raw
>>ARPA file and for you to compile it. We'll see how it goes. Ping me in a
>>day if you don't hear from me.
>> 
>> matt
>> 
>> 
>>> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980)
>>> wrote:
>>> 
>>> Yes please! :)
>>> 
>>> Sent from my iPhone
>>> 
 On Jul 15, 2016, at 1:39 PM, Matt Post  wrote:
 
 I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM
compiles of it failed in the past, but I'll try again. I expect it to
be about 8 GB when that's done. Do you want it?
 
 matt
 
 
> On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980)
> wrote:
> 
> Hey Folks,
> 
> Anyone have a Russian Language Model for Joshua? Lewis was working on
> one, not sure if he has it but just broadening the question.
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
 
>> 
>



Re: Russian Language Model for Joshua

2016-07-16 Thread Matt Post
Done:

http://cs.jhu.edu/~post/tmp/ru.kenlm
4106251755 bytes, sha1sum: 5c894e24dafa42bc44a5bb6822812d6234eda791

Let me know when you have it so I can delete it.

matt


> On Jul 15, 2016, at 4:42 PM, Matt Post  wrote:
> 
> All right, started trying to recompile. If you have a machine with > 256 GB 
> of memory, it might be more efficient for me to give you the raw ARPA file 
> and for you to compile it. We'll see how it goes. Ping me in a day if you 
> don't hear from me.
> 
> matt
> 
> 
>> On Jul 15, 2016, at 4:40 PM, Mattmann, Chris A (3980) 
>>  wrote:
>> 
>> Yes please! :)
>> 
>> Sent from my iPhone
>> 
>>> On Jul 15, 2016, at 1:39 PM, Matt Post  wrote:
>>> 
>>> I have one built on Common Crawl. It's 25 GB uncompressed. My KenLM 
>>> compiles of it failed in the past, but I'll try again. I expect it to be 
>>> about 8 GB when that's done. Do you want it?
>>> 
>>> matt
>>> 
>>> 
 On Jul 15, 2016, at 3:50 PM, Mattmann, Chris A (3980) 
  wrote:
 
 Hey Folks,
 
 Anyone have a Russian Language Model for Joshua? Lewis was working on
 one, not sure if he has it but just broadening the question.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Director, Information Retrieval and Data Science Group (IRDS)
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 WWW: http://irds.usc.edu/
 ++
>>> 
>