Running LDA over wikipedia

Sid Fri, 22 Oct 2010 17:36:24 -0700

I am trying to train a topic model over filtered wikipedia.

I have tried running the LDA implementations for Mahout(hadoop). I always
run out of memory. Here is what I am using


1) ~47000 Selected articles from wikipedia released by them as supposely
well curated

2) Preprocessing of documents to plain text format followed by
stemming(Porter) and stop words removal and filtering some error I got 41000
non empty docs.

3) the unique vocab count from the last step o/p is about 440,000(Dictionary
size).

LDA options chosen I Chose about a 1000 topics to fit the model with a
smoothing of 0.05(50/numtopics) and decided to use Mahout and mapreduce.

The space required by the big dense matrix that LDA uses is
440000(vocab)*1000(topics)*8(sizeof int64) = 3.52 GB. Now is this matrix is
kept in the memory at once?? What is the implementation?

Is that space calculation correct if not please correct me, notify me of
limitations... I tried reducing the Vocabulary count to 220000 just to test
but even that is not quite working.

At the moment I employ a simple Hadoop with 2.00GB heap space per computer.
2 node cluster with 20 Map processes and 4 reducers, but I can change this.
BTW what defines the upperlimit on the heapspace I cant go beyond 2GB hadoop
says its above the valid limit.

1249 E Spence Avenue
Tempe Az, 85281
480-307-5994

Running LDA over wikipedia

Reply via email to