Hi,
recently I spend a week in Sweden - getting there involved spending 6h on a ferry. I spend that time on some (potentially crazy) refactoring idaes I wanted to try out with Mahout for quite some time already. This is to share the results with you and maybe get some feedback. When looking at the final result, keep in mind that this is the result of a spike - so it's more a proof of concept than anything clean and polished. The goals I wanted to accomplish were two-fold: a) Running the Mahout tests takes ages at least on my machine (I admit that that one is pretty dated, however even running them on a faster laptop with ssd and three times more memory did not help). b) When telling the story of Mahout I am faced with comments along the lines of "I don't want to install Hadoop to run it" over and over again. It seems non- obvious that Mahout actually aims to be a stable, scalable Java library that comes with the added benefit of having quite a few of it's algorithms implemented on Hadoop. The third goal was pretty selfish, namely to get some experience with the random testing framework used over in Lucene-land. Here's what I did: - introduce the RandomTestRunner - fix all issues that bubbled up as a result (I ran into quite a few problems due to threads not being shut down either as part of the test (not too critical) or even during regular computation - happy to isolate what I ran into here and file separate issues (including fixes obviously) for each of these. - mark all long running tests with the nightly annotation - my goal here is not to switch them off forever but rather draw contributors' attention to those running particularly long (>20s) and fix them - convert our current core module into a parent, move any code in that into a submodule called stuff - move anything out of stuff and into module write that concerns serialization and is reasonably algorithm independent - move anything out of stuff and into module hadoop that really needs mapreduce to run - move anything out of stuff and into cli that offers just a command line interface to implementations (I might have missed some jobs here that still contain logic in addition to the command line stuff, all I did was to go through and fix failing tests, for several jobs I factored the parameters into separate beans to deal with default values, I suppose some of Frank's work could come in handy when doing that right.) - factor some of the unit testing utils into their own modules (those two could be collapsed actually) to avoid depending on running the tests just for compiling all the source code. Here's where you can look at the results: https://github.com/MaineC/mahout/tree/swedish-refactoring Unfortunately there's one huge commit on June 26th - I was so naive to believe in "let me just do a tiny refactoring and commit the successfully building code back" kind of assumption... If there's any interest in any of the results above, I'd be happy to go through the refactoring work in a non-feasibility-study mode and change thing piece by piece. There are still several open questions - like enable running tests in parallel (the changes I made to get that in the parent pom are right now disabled as I ran into some errors due to tests failing when running in parallel to other tests), our tests could benefit a lot from using more of the random testing functionality. Naming is more than sub-optimal. In the end I'd love to have a "all mahout included" jar in addition to the individual modules, etc. Feedback welcome - also feel free to ignore in case that stuff is just too far from the current roadmap, Isabel PS: If you read that - Alan, thanks for getting us to drive to the coast instead of into the mountains close to the arctic circle - that recommendation was awesome!
signature.asc
Description: This is a digitally signed message part.
