So in the process of getting the LDA improvements I've got brewing over on GitHub, and I'm doing my good due diligence and making more unit tests and so forth, and I'm trying to figure out the best way to unit test something like this, and I wanted to get y'all's advice:
Unit tests don't really test for correctness of a complex problem. Integration tests can, but need to be run either against some known data which is small and you have an alternate means of computing the result (or a reference set of computed results from another program, and the algorithm is completely deterministic, etc). Integration tests are also really long to run, and our tests already take forever. Internally, when testing on public data, I've been running the LDA code against 20newsgroups, which provides a really good check of distributed correctness vs. simple sequential in-memory results, and you can eyeball the topics and see that they really make sense. But I can't really write unit tests against 20newsgroups, as it's probably too big, right? For running tests, timewise, it could be slow, but spacewise, maybe not: the vectorized corpus matrix is 8MB, the serialized topic model (20 topics, natch) is about 1.7MB. Any thoughts on the proper way to write unit tests for something like this? I guess I could just start putting up patches on JIRA tickets, and ask for suggestions of where to put unit tests given the code in the patches. -jake