On Nov 2, 2011, at 5:13 AM, Jake Mannix wrote:

> So in the process of getting the LDA improvements I've got brewing over on
> GitHub, and I'm doing my good due diligence and making more unit tests and
> so forth, and I'm trying to figure out the best way to unit test something
> like this, and I wanted to get y'all's advice:
> 
> Unit tests don't really test for correctness of a complex problem.
> 
> Integration tests can, but need to be run either against some known data
> which is small and you have an alternate means of computing the result (or
> a reference set of computed results from another program, and the algorithm
> is completely deterministic, etc).  Integration tests are also really long
> to run, and our tests already take forever.
> 
> Internally, when testing on public data, I've been running the LDA code
> against 20newsgroups, which provides a really good check of distributed
> correctness vs. simple sequential in-memory results, and you can eyeball
> the topics and see that they really make sense.
> 
> But I can't really write unit tests against 20newsgroups, as it's probably
> too big, right?  For running tests, timewise, it could be slow, but
> spacewise, maybe not: the vectorized corpus matrix is 8MB, the serialized
> topic model (20 topics, natch) is about 1.7MB.
> 
> Any thoughts on the proper way to write unit tests for something like this?
> 

8mb's doesn't strike me as too big these days.  We might also look at testng, 
which I think allows you to mark tests as Integration and then they can be 
executed separately.  For instance, we might setup Jenkins to run them every 
hour or something. I've also got a machine here just waiting to do more regular 
testing. I guess the bigger question is what is the license on the data?  

Alternatively, the ASF email data is license free.  We could take and use a 
chunk of that.  You can pretty much have as much or as little as you want.  
Since it's broken down by project, it has the rough look and feel of 
20newsgroups at much bigger scale.

Dawid has also been doing some cool things w/ testing over in Lucene: 
http://www.lucidimagination.com/sites/default/files/file/Eurocon2011/dweiss-Eurocon_testing.pdf
  We aren't ready for it yet, but I would love to get some of those ideas 
incorporated.  

Ted and I also had a discussion (over beers) once about the notion of a project 
simply titled "good fake data".  Would be nice to be able to generate 
reasonable data automatically at whatever amount you need and with the right 
"shape", but of course, this is non-trivial.

We should also look into Maven's parallelization of tests.  That's made a big 
difference in Lucene, but I've heard Maven's support for it isn't as good.


> I guess I could just start putting up patches on JIRA tickets, and ask for
> suggestions of where to put unit tests given the code in the patches.
> 

+1

Reply via email to