I am trying to train a topic model over filtered wikipedia. I have tried running the LDA implementations for Mahout(hadoop). I always run out of memory. Here is what I am using
1) ~47000 Selected articles from wikipedia released by them as supposely well curated 2) Preprocessing of documents to plain text format followed by stemming(Porter) and stop words removal and filtering some error I got 41000 non empty docs. 3) the unique vocab count from the last step o/p is about 440,000(Dictionary size). LDA options chosen I Chose about a 1000 topics to fit the model with a smoothing of 0.05(50/numtopics) and decided to use Mahout and mapreduce. The space required by the big dense matrix that LDA uses is 440000(vocab)*1000(topics)*8(sizeof int64) = 3.52 GB. Now is this matrix is kept in the memory at once?? What is the implementation? Is that space calculation correct if not please correct me, notify me of limitations... I tried reducing the Vocabulary count to 220000 just to test but even that is not quite working. At the moment I employ a simple Hadoop with 2.00GB heap space per computer. 2 node cluster with 20 Map processes and 4 reducers, but I can change this. BTW what defines the upperlimit on the heapspace I cant go beyond 2GB hadoop says its above the valid limit. 1249 E Spence Avenue Tempe Az, 85281 480-307-5994
