hi. i have a possible project where i'm looking at extracting information from various public/college websites. i don't need to index the text/content of the sites. i do need to extract specific information.
as an example, a site might have a course schedule page, which in turn has links to the departments page, which in turn has links to the class information page. from a tree structure this would be: course listings by semester departments for the semester classes of each department class information obviously nutch/lucene has the ability to crawl a given site, does it have the ability to somehow 'link'/maintain a given relationship to the upstream page for a given piece of information. for my needs, i need to maintain the semester i get, as i follow the "link to the department, etc... this approach allows me to then store the complete course information in a db, so i can then iterate through the course information. i can accomplish this now, by creating a unique crawling app for each school. my curiousity is whether nutch/lucene can provide a basic crawling engine that i could then plug into, for my specific needs. i'm also curious as to the amount of additional development that would have to be created for my needs... thanks -bruce