So in the process of getting the LDA improvements I've got brewing over on
GitHub, and I'm doing my good due diligence and making more unit tests and
so forth, and I'm trying to figure out the best way to unit test something
like this, and I wanted to get y'all's advice:

Unit tests don't really test for correctness of a complex problem.

Integration tests can, but need to be run either against some known data
which is small and you have an alternate means of computing the result (or
a reference set of computed results from another program, and the algorithm
is completely deterministic, etc).  Integration tests are also really long
to run, and our tests already take forever.

Internally, when testing on public data, I've been running the LDA code
against 20newsgroups, which provides a really good check of distributed
correctness vs. simple sequential in-memory results, and you can eyeball
the topics and see that they really make sense.

But I can't really write unit tests against 20newsgroups, as it's probably
too big, right?  For running tests, timewise, it could be slow, but
spacewise, maybe not: the vectorized corpus matrix is 8MB, the serialized
topic model (20 topics, natch) is about 1.7MB.

Any thoughts on the proper way to write unit tests for something like this?

I guess I could just start putting up patches on JIRA tickets, and ask for
suggestions of where to put unit tests given the code in the patches.

  -jake

Reply via email to