On Thu, Feb 21, 2013 at 11:48 AM, David LaBarbera <
davidlabarb...@localresponse.com> wrote:

> Is there a rule of thumb for determining "leveling off" of perplexity? Is
> this value controlled by the convergence delta?
>

The value of where the driver will automatically stop issuing new
iterations is determined by the convergence delta (if
(perplexity(iteration_n) / perplexity(iteration_n-1) ) - 1 < delta, stop),
but determining what to set convergence delta to is hard to say, and must
be found empirically.


> Sorry for the table view. I reformatted it with just space.
>

Ah ok, much more readable.

Document Count  corpus size(MB) Topic Count     Perplexity
>                                      Dictionary Size Runtime(min/iteration)
> 40,044                    3.2                         10
>   16.326,15.418,15.191,15.088,15.028      14,097               1.5
> 40,044                    3.2                         20
>   26.461,24.517,23.996,23.805,23.882      14,097               6
> 40,044                    3.2                         40
>   19.722,18.185,17.823,17.680,17.608      14,097              11.5
>
> 40,046                    3.7                         10
>   19.286,18.373,18.092,17.958,17.865      98,283              5.5
> 40,046                    3.7                         20
>   18.574,17.448,17.143,17.018,16.940      98,283              10.5
>
> 44,767                    4                            10
>     19.928,18.815,18.521,18.350,18.225      31727               2.5
> 44,767                    4                            20
>     21.838,20.421,20.087,19.963,19.903      31727               4.5
>
> 616,957                  58.5                      10
> 14.467,13.830,13.583,13.435,13.381      151,807            8.5
> 616,957                  58.5                      20
> 13.590,12.787,12.605,12.522,12.476      151,807           16
>
> 616,972                  58.4                      10
> 14.646,13.904,13.646,13.573,13.543      54,280              4
>
> 616,967                  54.1                      10
> 13.363,12.634,12.432,12.345,12.283      32,101              2.5
> 616,967                  54.1                      20
> 13.195,12.307,12.065,11.764,11.732      32,101              4.5
>

If you could pick one of these corpora and topic sizes, and run it out to
25-50 iterations, and graph the perplexity after every 2 iterations, you
should be able to visually see where the perplexity levels off.
 Alternately, look at the topics themselves for some of these iterations
(like say iteration 10, 15, 20, 25, 30), and see where they start to
visually gel into something sensible.  After some point, they won't even
appear to change very much at all (i.e. if you're inspecting using
vectordump --sort, then the top 50 terms per topic will stop changing
typically after around 20-30 iterations), at this point they're pretty much
converged.

This latter method (looking at your final output topic clusters) tends to
be what I've used to know when I've converged "enough", until I've found
that for my corpora, I have an intuition for how far they need to go with
this algorithm before it's usually far enough.


>
>
>
> On Feb 21, 2013, at 12:00 PM, Jake Mannix <jake.man...@gmail.com> wrote:
>
> > I really can't read your results here, the formatting of your columns is
> > pretty destroyed...  you look like you've got results for 20 topics, as
> > well as for 10, with different sized corpora?
> >
> > You can't compare convergence between corpora sizes - the perplexity will
> > vary by order of magnitude between them.  The only thing you should be
> > comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
> > 20,... iterations, what does the (held-out) perplexity look like after
> each
> > of these?  Does it start to level off?  At some point you may start
> > overfitting and having the perplexity go back up.  Your convergence
> > happened before that.
> >
> > I don't think I've ever needed to run more than 50 iterations, and
> usually
> > I stop after 20-30.  The bigger the corpus, the more this becomes true.
> >
> >
> > On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
> > davidlabarb...@localresponse.com> wrote:
> >
> >> I've been running some performance test with the LDA algorithm and I'm
> >> unsure how to gauge them. I ran 10 iterations each time and collected
> the
> >> perplexity value every 2 iterations with test fraction set to 0.1. These
> >> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers).
> I'm
> >> not sure about the memory or cpu specs. I also stored the documents on
> hdfs
> >> in 1MB blocks to get some parallelization. The documents I have were
> very
> >> short - 10-100 words each.  Hopefully these results are clear.
> >>
> >> Document Count
> >> corpus size (MB)
> >> Topic Count
> >> Perplexity
> >> Dictionary Size
> >> Runtime  (min/iteration)
> >> 40,044   3.2     10     16.326, 15.418, 15.191, 15.088, 15.028  14,097
>  1.5
> >> 40,044 3.2
> >> 20
> >> 26.461, 24.517, 23.996, 23.805, 23.882 14,097
> >> 6
> >> 40,,044        3.2
> >> 40
> >> 19.722, 18.185, 17.823, 17.680, 17.608  14,097
> >> 11.5
> >> 40,046  3.7    10
> >> 19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
> >> 40,046  3.7    20
> >> 18.574, 17.448, 17.143, 17.018, 16.940
> >> 98,283   10.5
> >>
> >>
> >>
> >>
> >>
> >>
> >> 44,767  4
> >> 10
> >> 19.928, 18.815, 18.521, 18.350, 18.225  31727    2.5
> >> 44,767  4
> >> 20
> >> 21.838, 20.421, 20.087, 19.963, 19.903  31727    4.5
> >> 616,957        58.5
> >> 10
> >> 14.467, 13.830, 13.583, 13.435, 13.381  151,807
> >> 8.5
> >> 616,957        58.5
> >> 20
> >> 13.590, 12.787, 12.605, 12.522, 12.476  151,807
> >> 16
> >> 616,972         58.4    10     14.646, 13.904, 13.646, 13.573, 13.543
> >> 54,280  4
> >> 616,967         54.1    10     13.363, 12.634, 12.432, 12.345, 12.283
> >> 32,101  2.5
> >> 616,967         54.1    20     13.195, 12.307, 12.065, 11.764, 11.732
> >>     32,101
> >> 4.5
> >>
> >> The question is how to interpret the results. In particular, Is there
> >> anything telling me when to stop running LDA? I've tried running until
> >> convergence, but I've never had the patience to see it finish. Does the
> >> perplexity give some hint to the quality of the results? In attempting
> to
> >> reach convergence, I saw runs going to 200 iterations. If an iteration
> >> takes around 5.5 minutes, that's 18 hours of processing - and that
> doesn't
> >> include overhead between iterations.
> >>
> >> David
> >
> >
> >
> >
> > --
> >
> >  -jake
>
>


-- 

  -jake

Reply via email to