On Thu, Feb 21, 2013 at 11:48 AM, David LaBarbera < davidlabarb...@localresponse.com> wrote:
> Is there a rule of thumb for determining "leveling off" of perplexity? Is > this value controlled by the convergence delta? > The value of where the driver will automatically stop issuing new iterations is determined by the convergence delta (if (perplexity(iteration_n) / perplexity(iteration_n-1) ) - 1 < delta, stop), but determining what to set convergence delta to is hard to say, and must be found empirically. > Sorry for the table view. I reformatted it with just space. > Ah ok, much more readable. Document Count corpus size(MB) Topic Count Perplexity > Dictionary Size Runtime(min/iteration) > 40,044 3.2 10 > 16.326,15.418,15.191,15.088,15.028 14,097 1.5 > 40,044 3.2 20 > 26.461,24.517,23.996,23.805,23.882 14,097 6 > 40,044 3.2 40 > 19.722,18.185,17.823,17.680,17.608 14,097 11.5 > > 40,046 3.7 10 > 19.286,18.373,18.092,17.958,17.865 98,283 5.5 > 40,046 3.7 20 > 18.574,17.448,17.143,17.018,16.940 98,283 10.5 > > 44,767 4 10 > 19.928,18.815,18.521,18.350,18.225 31727 2.5 > 44,767 4 20 > 21.838,20.421,20.087,19.963,19.903 31727 4.5 > > 616,957 58.5 10 > 14.467,13.830,13.583,13.435,13.381 151,807 8.5 > 616,957 58.5 20 > 13.590,12.787,12.605,12.522,12.476 151,807 16 > > 616,972 58.4 10 > 14.646,13.904,13.646,13.573,13.543 54,280 4 > > 616,967 54.1 10 > 13.363,12.634,12.432,12.345,12.283 32,101 2.5 > 616,967 54.1 20 > 13.195,12.307,12.065,11.764,11.732 32,101 4.5 > If you could pick one of these corpora and topic sizes, and run it out to 25-50 iterations, and graph the perplexity after every 2 iterations, you should be able to visually see where the perplexity levels off. Alternately, look at the topics themselves for some of these iterations (like say iteration 10, 15, 20, 25, 30), and see where they start to visually gel into something sensible. After some point, they won't even appear to change very much at all (i.e. if you're inspecting using vectordump --sort, then the top 50 terms per topic will stop changing typically after around 20-30 iterations), at this point they're pretty much converged. This latter method (looking at your final output topic clusters) tends to be what I've used to know when I've converged "enough", until I've found that for my corpora, I have an intuition for how far they need to go with this algorithm before it's usually far enough. > > > > On Feb 21, 2013, at 12:00 PM, Jake Mannix <jake.man...@gmail.com> wrote: > > > I really can't read your results here, the formatting of your columns is > > pretty destroyed... you look like you've got results for 20 topics, as > > well as for 10, with different sized corpora? > > > > You can't compare convergence between corpora sizes - the perplexity will > > vary by order of magnitude between them. The only thing you should be > > comparing is that for a single fixed corpus, as you run it for 5, 10, 15, > > 20,... iterations, what does the (held-out) perplexity look like after > each > > of these? Does it start to level off? At some point you may start > > overfitting and having the perplexity go back up. Your convergence > > happened before that. > > > > I don't think I've ever needed to run more than 50 iterations, and > usually > > I stop after 20-30. The bigger the corpus, the more this becomes true. > > > > > > On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera < > > davidlabarb...@localresponse.com> wrote: > > > >> I've been running some performance test with the LDA algorithm and I'm > >> unsure how to gauge them. I ran 10 iterations each time and collected > the > >> perplexity value every 2 iterations with test fraction set to 0.1. These > >> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). > I'm > >> not sure about the memory or cpu specs. I also stored the documents on > hdfs > >> in 1MB blocks to get some parallelization. The documents I have were > very > >> short - 10-100 words each. Hopefully these results are clear. > >> > >> Document Count > >> corpus size (MB) > >> Topic Count > >> Perplexity > >> Dictionary Size > >> Runtime (min/iteration) > >> 40,044 3.2 10 16.326, 15.418, 15.191, 15.088, 15.028 14,097 > 1.5 > >> 40,044 3.2 > >> 20 > >> 26.461, 24.517, 23.996, 23.805, 23.882 14,097 > >> 6 > >> 40,,044 3.2 > >> 40 > >> 19.722, 18.185, 17.823, 17.680, 17.608 14,097 > >> 11.5 > >> 40,046 3.7 10 > >> 19.286, 18.373, 18.092, 17.958, 17.865 98,283 5.5 > >> 40,046 3.7 20 > >> 18.574, 17.448, 17.143, 17.018, 16.940 > >> 98,283 10.5 > >> > >> > >> > >> > >> > >> > >> 44,767 4 > >> 10 > >> 19.928, 18.815, 18.521, 18.350, 18.225 31727 2.5 > >> 44,767 4 > >> 20 > >> 21.838, 20.421, 20.087, 19.963, 19.903 31727 4.5 > >> 616,957 58.5 > >> 10 > >> 14.467, 13.830, 13.583, 13.435, 13.381 151,807 > >> 8.5 > >> 616,957 58.5 > >> 20 > >> 13.590, 12.787, 12.605, 12.522, 12.476 151,807 > >> 16 > >> 616,972 58.4 10 14.646, 13.904, 13.646, 13.573, 13.543 > >> 54,280 4 > >> 616,967 54.1 10 13.363, 12.634, 12.432, 12.345, 12.283 > >> 32,101 2.5 > >> 616,967 54.1 20 13.195, 12.307, 12.065, 11.764, 11.732 > >> 32,101 > >> 4.5 > >> > >> The question is how to interpret the results. In particular, Is there > >> anything telling me when to stop running LDA? I've tried running until > >> convergence, but I've never had the patience to see it finish. Does the > >> perplexity give some hint to the quality of the results? In attempting > to > >> reach convergence, I saw runs going to 200 iterations. If an iteration > >> takes around 5.5 minutes, that's 18 hours of processing - and that > doesn't > >> include overhead between iterations. > >> > >> David > > > > > > > > > > -- > > > > -jake > > -- -jake