Re: Parsing/Crawler Questions..

yanky young Wed, 04 Mar 2009 17:49:02 -0800

Hi:

you said that you are crawling college websites and use XPath to extract
classes or courses information. That's good. But how do you determine a web
page is about classes or not?


If you just crawled the whole web site, that must be a complete crawl and
thus a complete tree.
If you use some kind of automatic mechanism to classify web pages, there is
the problem of precision/recall issue that relates to the completeness of
the tree.

So that depends how do you crawl website.

good luck

yanky

2009/3/5 bruce <[email protected]>

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl
> the
> site. I crawl at the top level, and work my way down to getting the
> required
> course/class schedule. The app works. I can consistently run it and extract
> the information. The required information is based upon an XPath analysis
> of
> the DOM for the given pages that I'm parsing.
>
> My issue is now that I have a "basic" app that works, I need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
>
> When running the app, I can get 5000 classes on one run, 4700 on antoher,
> etc... So I need some method of determining when I get a "complete" tree...
>
> How do I know when I have a complete "tree"!
>
> I'm looking for someone, or some group/prof that I can talk to about these
> issues. I've been searching google, linkedin, etc.. for someone to bounce
> thoughts with..! My goal is to eventually look at using nutch/lucene if at
> all applicable.
>
> Any pointers, or people, or papers, etc... would be helpful.
>
> Thanks
>
>
>
>

Re: Parsing/Crawler Questions..

Reply via email to