Hello everyone, I'm not going to post my config files as not to spam you all, but I have a general question: I'm trying to index the pages of a website (obviously), and I've created a special page with a link to all the pages I want to index. I then pointed nutch to this special link page. I set max_outlinks appropriately, and I do see all the page URLs I expect go by in the log for the fetching stage.
When nutch gets to indexing, however, not all the documents appear in the log--it looks as if not all of the fetched pages are being indexed. Searching for terms I know are on the missing pages also turns up nothing--they're not in the index!? Can anybody tell me what factors affect the indexing stage? I want to have nutch index *all* documents it fetches. How can I do this? Any tips/ideas/things to configure? Thanks in advance, -Jeff
