I have never looked at how Nutch works, nor have I used it. My questions might just be RTFM-related.
Lately people have asked me to help them out with simple domainspecific webindexing services. The requirements are, as usual when I'm involved, to run on very limited resources. What I did is to combine my very simple and minimalistic servlet engine <http://sf.net/project/servlet> with Lucene and NekoHTML, extracting only the the content "frame" from the static design of the site. This made me think of two things: It would be nice to use the features of Nutch instead of my own hacky stuff. How bound is Nutch to the J2EE-container? Would it be a big job to make it run on an alternative GUI? Or is is the container used for more than GUI? I.e. do all services (crawler, et.c.) run within the container? Do they have to? It would be nice to automatically detect the content "frame" by analyzing the DOM tree of the pages on a site. Is there such a feature in Nutch, contributed to, or publicly available in some other project? I'd be more than happy do discuss, write and contribute it back if I end up making one. -- karl