Hi,

recently I spend a week in Sweden - getting there involved spending 6h on a 
ferry. I spend that time on some (potentially crazy) refactoring idaes I wanted 
to try out with Mahout for quite some time already. This is to share the 
results 
with you and maybe get some feedback. When looking at the final result, keep in 
mind that this is the result of a spike - so it's more a proof of concept than 
anything clean and polished.

The goals I wanted to accomplish were two-fold:

a) Running the Mahout tests takes ages at least on my machine (I admit that 
that 
one is pretty dated, however even running them on a faster laptop with ssd and 
three times more memory did not help).

b) When telling the story of Mahout I am faced with comments along the lines of 
"I don't want to install Hadoop to run it" over and over again. It seems non-
obvious that Mahout actually aims to be a stable, scalable Java library that 
comes with the added benefit of having quite a few of it's algorithms 
implemented on Hadoop.

The third goal was pretty selfish, namely to get some experience with the 
random 
testing framework used over in Lucene-land.

Here's what I did:

- introduce the RandomTestRunner

- fix all issues that bubbled up as a result (I ran into quite a few problems 
due to threads not being shut down either as part of the test (not too 
critical) 
or even during regular computation - happy to isolate what I ran into here and 
file separate issues (including fixes obviously) for each of these.

- mark all long running tests with the nightly annotation - my goal here is not 
to switch them off forever but rather draw contributors' attention to those 
running particularly long (>20s) and fix them

- convert our current core module into a parent, move any code in that into a 
submodule called stuff

- move anything out of stuff and into module write that concerns serialization 
and is reasonably algorithm independent

- move anything out of stuff and into module hadoop that really needs mapreduce 
to run

- move anything out of stuff and into cli that offers just a command line 
interface to implementations (I might have missed some jobs here that still 
contain logic in addition to the command line stuff, all I did was to go 
through 
and fix failing tests, for several jobs I factored the parameters into separate 
beans to deal with default values, I suppose some of Frank's work could come in 
handy when doing that right.)

- factor some of the unit testing utils into their own modules (those two could 
be collapsed actually) to avoid depending on running the tests just for 
compiling all the source code.


Here's where you can look at the results:

https://github.com/MaineC/mahout/tree/swedish-refactoring

Unfortunately there's one huge commit on June 26th - I was so naive to believe 
in "let me just do a tiny refactoring and commit the successfully building code 
back" kind of assumption...

If there's any interest in any of the results above, I'd be happy to
go through the refactoring work in a non-feasibility-study mode and
change thing piece by piece. There are still several open questions - like 
enable running tests in parallel (the changes I made to get that in the parent 
pom are right now disabled as I ran into some errors due to tests failing when 
running in parallel to other tests), our tests could benefit a lot from using 
more of the random testing functionality. Naming is more than sub-optimal. In 
the end I'd love to have a "all mahout included" jar in addition to the 
individual modules, etc.


Feedback welcome - also feel free to ignore in case that stuff is just too far 
from the current roadmap,

Isabel


PS: If you read that - Alan, thanks for getting us to drive to the coast 
instead 
of into the mountains close to the arctic circle - that recommendation was 
awesome!

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to