check your yahoo email box and find that email sent from nutch-user
mailinglist and u will find any links you want there


2009/3/5 Edward Chen <[email protected]>

> Hi, how to remove me from this mailing list ? please give me a link.
>
> Good Luck.
>
>
>
>
> ________________________________
> From: yanky young <[email protected]>
> To: [email protected]
> Sent: Wednesday, March 4, 2009 8:48:33 PM
> Subject: Re: Parsing/Crawler Questions..
>
> Hi:
>
> you said that you are crawling college websites and use XPath to extract
> classes or courses information. That's good. But how do you determine a web
> page is about classes or not?
>
> If you just crawled the whole web site, that must be a complete crawl and
> thus a complete tree.
> If you use some kind of automatic mechanism to classify web pages, there is
> the problem of precision/recall issue that relates to the completeness of
> the tree.
>
> So that depends how do you crawl website.
>
> good luck
>
> yanky
>
> 2009/3/5 bruce <[email protected]>
>
> > Hi...
> >
> > Sorry that this is a bit off track. Ok, maybe way off track!
> >
> > But I don't have anyone to bounce this off of..
> >
> > I'm working on a crawling project, crawling a college website, to extract
> > course/class information. I've built a quick test app in python to crawl
> > the
> > site. I crawl at the top level, and work my way down to getting the
> > required
> > course/class schedule. The app works. I can consistently run it and
> extract
> > the information. The required information is based upon an XPath analysis
> > of
> > the DOM for the given pages that I'm parsing.
> >
> > My issue is now that I have a "basic" app that works, I need to figure
> out
> > how I guarantee that I'm correctly crawling the site. How do I know when
> > I've got an error at a given node/branch, so that the app knows that it's
> > not going to fetch the underlying branch/nodes of the tree..
> >
> > When running the app, I can get 5000 classes on one run, 4700 on antoher,
> > etc... So I need some method of determining when I get a "complete"
> tree...
> >
> > How do I know when I have a complete "tree"!
> >
> > I'm looking for someone, or some group/prof that I can talk to about
> these
> > issues. I've been searching google, linkedin, etc.. for someone to bounce
> > thoughts with..! My goal is to eventually look at using nutch/lucene if
> at
> > all applicable.
> >
> > Any pointers, or people, or papers, etc... would be helpful.
> >
> > Thanks
> >
> >
> >
> >
>
>
>
>
>

Reply via email to