hi yanky... my issue is that i can create an app to parse the multiple levels i need for the given college site. this requires a manual inspection/process for each site.. but it's doable. a lot of effort for ~1000 sites.. but it's not too bad..
however, when i parse a site, at least for my 1st few test sites.. i'm realizing that i'm getting different results... this means that i'm getting something different for the different runs. now i don't know if this is because i'm slamming the sites, or there's a network issue, or what... furthermore, since this is automated.. i really don't have apriori knowledge of what the entire section of the site i'm parsing should return... al i can really do, is run the parse N number of times.. and take the union/join of al the returned courses... and hope that that gives me a close to 100% list of classes... if the site/section was static.. then i could see that one might create some sort of count of the expected itemlist that you're shooting for... but in my case, i'm dealing with a dynamic site that might change overtime... in researching articles, the net, etc.. i haven't run across anyone attempting to solve this issue... IE, how do you build a system, that lets you know with 100% accuracy, that you've captured everything you intend to capture! thanks -----Original Message----- From: yanky young [mailto:[email protected]] Sent: Wednesday, March 04, 2009 7:47 PM To: [email protected] Subject: Re: Parsing/Crawler Questions.. check your yahoo email box and find that email sent from nutch-user mailinglist and u will find any links you want there 2009/3/5 Edward Chen <[email protected]> > Hi, how to remove me from this mailing list ? please give me a link. > > Good Luck. > > > > > ________________________________ > From: yanky young <[email protected]> > To: [email protected] > Sent: Wednesday, March 4, 2009 8:48:33 PM > Subject: Re: Parsing/Crawler Questions.. > > Hi: > > you said that you are crawling college websites and use XPath to extract > classes or courses information. That's good. But how do you determine a web > page is about classes or not? > > If you just crawled the whole web site, that must be a complete crawl and > thus a complete tree. > If you use some kind of automatic mechanism to classify web pages, there is > the problem of precision/recall issue that relates to the completeness of > the tree. > > So that depends how do you crawl website. > > good luck > > yanky > > 2009/3/5 bruce <[email protected]> > > > Hi... > > > > Sorry that this is a bit off track. Ok, maybe way off track! > > > > But I don't have anyone to bounce this off of.. > > > > I'm working on a crawling project, crawling a college website, to extract > > course/class information. I've built a quick test app in python to crawl > > the > > site. I crawl at the top level, and work my way down to getting the > > required > > course/class schedule. The app works. I can consistently run it and > extract > > the information. The required information is based upon an XPath analysis > > of > > the DOM for the given pages that I'm parsing. > > > > My issue is now that I have a "basic" app that works, I need to figure > out > > how I guarantee that I'm correctly crawling the site. How do I know when > > I've got an error at a given node/branch, so that the app knows that it's > > not going to fetch the underlying branch/nodes of the tree.. > > > > When running the app, I can get 5000 classes on one run, 4700 on antoher, > > etc... So I need some method of determining when I get a "complete" > tree... > > > > How do I know when I have a complete "tree"! > > > > I'm looking for someone, or some group/prof that I can talk to about > these > > issues. I've been searching google, linkedin, etc.. for someone to bounce > > thoughts with..! My goal is to eventually look at using nutch/lucene if > at > > all applicable. > > > > Any pointers, or people, or papers, etc... would be helpful. > > > > Thanks > > > > > > > > > > > > >
