I have never looked at how Nutch works, nor have I used it. My questions
might just be RTFM-related.

Lately people have asked me to help them out with simple domainspecific
webindexing services. The requirements are, as usual when I'm involved,
to run on very limited resources. What I did is to combine my very
simple and minimalistic servlet engine <http://sf.net/project/servlet>
with Lucene and NekoHTML, extracting only the the content "frame" from
the static design of the site.

This made me think of two things:

It would be nice to use the features of Nutch instead of my own hacky
stuff. How bound is Nutch to the J2EE-container? Would it be a big job
to make it run on an alternative GUI? Or is is the container used for
more than GUI? I.e. do all services (crawler, et.c.) run within the
container? Do they have to?

It would be nice to automatically detect the content "frame" by
analyzing the DOM tree of the pages on a site. Is there such a feature
in Nutch, contributed to, or publicly available in some other project?
I'd be more than happy do discuss, write and contribute it back if I end
up making one.

-- 
karl

Reply via email to