About 2 months ago John Kleven posted asking about using nutch just to crawl.
I have the same question, essentially. One possible development tack I can
take with my project is: use nutch for crawling, then use Xapian for
tokenization, indexing, etc. Over time we will need to spider a lot of sites
so I'm disinclined to use wget.
Does nutch have out-of-the-box capability to spider sites and write the output
to html files? If not, can someone give me a quick summary of how I would
properly modify or subclass the nutch code?
____________________________________________________________________________________
Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general