Dear All,
I'm trying to build a lm using a large dataset ( 11 M sentences). I have
generated the Arpa format with irstlm and now I'd like to binarize it using
kenlm.
I have called the build_binary to estimate memory usage, and I got this
Memory estimate:
type MB
probing 16129 assuming -p
Hi,
This looks like a bug in the trie implementation due to some recent
changes I made for left state minimization. I'll fix it soon. A
workaround is to pass a large -m option to build_binary.
Sorry,
Kenneth
On 10/08/11 11:34, marco turchi wrote:
Dear All,
I'm trying to build a lm
Hi,
thanks a lot for the answer.
Great, so I can use -m 2048 to build it. Do you think it is enough?
Thanks again
Marco
On Sat, Oct 8, 2011 at 12:46 PM, Kenneth Heafield mo...@kheafield.comwrote:
**
Hi,
This looks like a bug in the trie implementation due to some recent
changes I made
Fixed in revision 4314. There's still an issue with some SRILM models
failing to build that I'll get to soon.
On 10/08/11 11:52, marco turchi wrote:
Hi,
thanks a lot for the answer.
Great, so I can use -m 2048 to build it. Do you think it is enough?
Thanks again
Marco
On Sat, Oct 8,
Thanks!
I'm going to update my version.
Cheers
Marco
On Sat, Oct 8, 2011 at 1:01 PM, Kenneth Heafield mo...@kheafield.comwrote:
**
Fixed in revision 4314. There's still an issue with some SRILM models
failing to build that I'll get to soon.
On 10/08/11 11:52, marco turchi wrote:
Hi,
Hi,
yes, you should prepare the output data with two factors, the lowercased
form and the recased form. You can then train a factored model with a
translation step (lowercase to lowercase) and a generation step (lowercase
to realcase).
-phi
On Fri, Oct 7, 2011 at 12:17 AM, Panos Kanavos
Hi,
Thanks for your suggestion. The problem is our server is not networked
so svn is not available. Anyway, I downloaded the GNU tarball and it works.
For those who want to try the 20100813, someone says it will work after
installing boost. Hope you good luck.
And for