Hello, I have a database (RDBMS) with URLs I need to periodically fetch in order to determine things like: page language, character set, HTTP status code, size, and eventually to index the content (although not in 1 big index, but a number of small ones). I am not interested in using Nutch to build 1 big index of fetched pages.
I am wondering if I could make use of Nutch for this, or at least some of Nutch's functionality. I believe I could dump URLs from my RDBMS and create a WebDB using WebDBInjector (bin/nutch inject ...). Next, I believe I could generate a fetch list containing all URLs in my WebDB, and have fetcher download them all. Is the above correct? I am not clear about what follows, and especially about the new plugins. Where/when do downloaded pages get processed by the plugins, and where do plugins write their output? I have a number of indices in my application (think lots of users, each with its own Lucene index -- see http://www.simpy.com/ ), so I need to do something like this: 1. for each user in my RDBMS 2. get all URLs from my RDBMS 3. for each URL get its lang,size,etc... from Nutch (WebDB?/Fetcher output?/plugins output?) 4. add this + the text of the fetched URL to user's index 5. update some RDBMS columns 6. end 7. end The step that is unclear is 3. Where do I get all that data I need (page size, HTTP status code, language, and text from the page)? If I missed a relevant Wiki page, please point me to it. Thanks, Otis ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
