I'm working on a laptop as well, with 512MB memory. I wouldn't advise trying to
run giza on a whole corpus. I tried it and 3 days later gave up waiting for it
to finish (paqe files = Sloooow). My feelings about giza is that it needs some
kind of dynamic way of figuring out how much actual physical memory it has
available and working with that. 

In order to get as far as I have (stage 7) I have been working with the first
1000 lines of the English-French sentence aligned corpus. I definitely think
it's a good idea to use a smaller data set for beginning. I would try to stick
to a dataset that doesn't cause giza or its companion to start paging.

I don't have a clue why you're getting the signal 11. But I would try using
another data set to make sure your doing the right thing or that you don't have
messy data.

Can you email me the first thousand lines of your sentence aligned and cleaned
corpus? That way we can narrow the problem down a bit.

Quoting Hubert Crépy <[EMAIL PROTECTED]>:

> J C Read a écrit :
> > According to wikipedia http://en.wikipedia.org/wiki/SIGSEGV signal 11
> indicates
> > an invalid memory reference.
> >   
> Yes, definitely, what we also call a "coredump" under AIX.
> > I eventually figured out that this was because of the data I was using.
> >   
> That's often the case, an unfortunate data condition that is unexpected 
> and unaccounted for in error recovery.  That's usually hard to track, 
> though...
> > Things to check:
> >
> > Is the data sentence aligned?
> >   
> Yes, europarl.lowercased.0-0.fr has 73835 lines:
>     reprise de la session
>     je déclare reprise la session du parlement européen qui avait (...)
>     (...)
>     des paroles , pas d' action .
>     en attendant , deux mille personnes ont perdu la vie inutilement , (...)
> and europarl.lowercased.0-0.en has 73835 lines:
>     resumption of the session
>     i declare resumed the session of the european parliament adjourned 
> on (...)
>     (...)
>     more talk . no action .
>     meanwhile , two thousand people in the last year have needlessly (...)
> > Has the data been cleaned with the clean script? (try using sentences of
> min 1
> > and max 100)
> >   
> Yes, it went through the script, with the recommended parameters:
> 
> |    
> bin/moses-scripts/scripts-||/YYYYMMDD-HHMM/||/training/clean-corpus-n.perl 
> working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean 
> 1 40|
> 
> which reduced the number of sentences from the initial 100K to 73835.
> 
> Any other suggestions?
> 
> Say, it could not be that the very smallness of my training data (only 
> 73K sentences) could be causing unexpected underflows or whatever in 
> GIZA, could it?
> Does it not make sense to try and run the whole process on a small 
> dataset to start with (I don't have access to powerful machines at the 
> moment, running this on my personal laptop...) ?
> 
> Thanks for your support, much appreciated.
> 
> -- 
> Hubert Crépy
> 
> 


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to