Conal Tuohy wrote:

I'm creating a Lucene index using an XSP based on the sample, but I have a strange problem.

Some of the pages are crawled, but some are not crawled, and I can't see why.

I have DEBUG logging for the core.search components, so I can see the crawler crawling the site. I can see it read the links for each page, and I can see that it doesn't exclude any of the links. Yet it doesn't actually follow those links - the crawl simply comes to an end at some point, with some of the links uncrawled.


Have you enabled the "link view" for all the pages you want to crawl?


HTH

Michael


It seems to me that for every log entry from SimpleCocoonCrawlerImpl that says "Add URL: http://blah..."; I should also have an entry from SimpleLuceneXMLIndexerImpl that says "Indexing http://blah...";


The home page is crawled, and all of the pages off that page, and SOME of the pages off those pages, and SOME of the pages off THOSE pages. I can't see why some pages are crawled and others not. Perhaps the crawler simply stops at some point, and it hasn't finished its list of URLs. But why would it stop crawling without logging any error? BTW, the last entry in the log is always the SimpleLuceneXMLIndexerImpl reporting that it has indexed a page, e.g:

DEBUG (2003-06-09) 17:32.05:388 [core.search.lucene] (/search/reindex.xml) HttpProcessor[80][4]/SimpleLuceneXMLIndexerImpl: Indexing http://localhost:80/etexts/JCB-016/full.html?cocoon-view=content (text/xml)

Does anyone have any ideas where I could start looking?

I'm using the version RELEASE_2_1_M_2

Thanks

Con

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to