Hey Mike! :)
- Classloading - I have had many problems with NutchConf due to the
way it loads it's resources. In a J2EE scenario, it's simply evil :)
Would there be any great problem with switching it's classloader to
Thead.currentThread().getContextClassloader() instead of the current
static classloader? It's a lot 'friendlier' to do it this way. I can
submit a patch to do this very quickly if others are keen (or anyone
can do it - I've done it locally, takes about 30 keystrokes!)
The only code I know uses class-loading is the plugin system.
But we have a very simple model there, each plugin has it own class-
loader and depending plugins share it's jar files in a UrlClassloader.
The mother class-loader of a plugin class-loader is the static
classloader of the plugin class, but there are no other hierarchies.
- Statics - On that issue, there are an awful lot of static classes
and methods around. This makes configuring and using Nutch in 'non
standard' ways difficult as things are hard coded together (for
example I can't easily swap out NutchConf to do my own configuration
mechanism as it's all static accesses!). Is there any interest in
removing / refactoring these statics out to make Nutch more flexible?
Are you using 0.7 or 0.8? One actually issue in 0.8 is to move to a
non static access of the nutch configuration.
- Plugins / physical files - Quite a lot of stuff in Nutch seems to
rely on physical files (for example plugins are loaded by looking for
the "/plugins" directory on disk IIRC). In a J2EE environment, this
means you can't deploy the WAR as a non-expanded WAR for example. Can
we switch from loading files directly to loading resources as streams?
This means you can load a file from the classloader regardless of
whether or not it exists as a physical file.
Why not, problem back in the days was to load physically jars and
plugin.xml files.
If you see other possibilities let us know.
More as I play more tomorrow - great work so far though, I love what I
see. I know I'm using things as they're "not meant to be used" but I'm
a big fan of flexible, simple systems and I think Nutch could get
there with only a little work.
Such things was already discussed in the context to put nutch on top
of a jmx, there is somehow a mail from Doug regarding plans porting
nutch tools to so called actions.
I think that is somehow out of scope today since 0.8 introduce map
reduce technology and you processing task are executed in a kind of
container (tasktracker).
As mentioned I guess 0.8 would be more interesting for you, in case
you are planing to integrate nutch somehow in a j2ee environment,
since at least you only need to bring the taks tracker (worker)and
jobtrcker (manager) as (jmx) beans into the container. May a problem
could be that the tasktracker executes it tasks in a new jvm, but
this can be changed easily. However executes task in a new jvm makes
a lot of sense for us, since it provides this kind of stability we
need in such a gird style distributed application.
Stefan