Hey all,

I'm trying to figure out how one could add functionality to Nutch which would allow users to specify a substitution/replace using regex for URLs. This is more useful in an intranet crawl than a large scale internet crawl. The main use for this might be for stripping off session IDs of URLs. They are unnecessary and can result in many "duplicate" URLs (URLs that are identical with the exception of the session ID).

Where would be the best place to put such functionality?
I've thought of a few places, but I wonder about the scalability of some of them.
* create it as some sort of tool that analyzes the WebDB before generating the fetch list
* do the replace as we discover URLs
* do the replace before the actual fetch URLs


Also, I'm curious about automatic detection of things like session IDs. Does anyone have some good ideas about that? My only idea is doing extra fetches for each page (taking off different parameters as we go) and comparing the pages' content with one another. Again this extra work is probably not scalable to past an intranet crawl. The reason I mention this now is that if we want to allow the possibility of automatic detection, then it might affect where we want the URL regex functionality to go.

Thanks for the pointers,

Luke Baker


------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to