Hi: you said that you are crawling college websites and use XPath to extract classes or courses information. That's good. But how do you determine a web page is about classes or not?
If you just crawled the whole web site, that must be a complete crawl and thus a complete tree. If you use some kind of automatic mechanism to classify web pages, there is the problem of precision/recall issue that relates to the completeness of the tree. So that depends how do you crawl website. good luck yanky 2009/3/5 bruce <[email protected]> > Hi... > > Sorry that this is a bit off track. Ok, maybe way off track! > > But I don't have anyone to bounce this off of.. > > I'm working on a crawling project, crawling a college website, to extract > course/class information. I've built a quick test app in python to crawl > the > site. I crawl at the top level, and work my way down to getting the > required > course/class schedule. The app works. I can consistently run it and extract > the information. The required information is based upon an XPath analysis > of > the DOM for the given pages that I'm parsing. > > My issue is now that I have a "basic" app that works, I need to figure out > how I guarantee that I'm correctly crawling the site. How do I know when > I've got an error at a given node/branch, so that the app knows that it's > not going to fetch the underlying branch/nodes of the tree.. > > When running the app, I can get 5000 classes on one run, 4700 on antoher, > etc... So I need some method of determining when I get a "complete" tree... > > How do I know when I have a complete "tree"! > > I'm looking for someone, or some group/prof that I can talk to about these > issues. I've been searching google, linkedin, etc.. for someone to bounce > thoughts with..! My goal is to eventually look at using nutch/lucene if at > all applicable. > > Any pointers, or people, or papers, etc... would be helpful. > > Thanks > > > >
