hi yanky...

my issue is that i can create an app to parse the multiple levels i need for
the given college site. this requires a manual inspection/process for each
site.. but it's doable. a lot of effort for ~1000 sites.. but it's not too
bad..

however, when i parse a site, at least for my 1st few test sites.. i'm
realizing that i'm getting different results... this means that i'm getting
something different for the different runs. now i don't know if this is
because i'm slamming the sites, or there's a network issue, or what...

furthermore, since this is automated.. i really don't have apriori knowledge
of what the entire section of the site i'm parsing should return... al i can
really do, is run the parse N number of times.. and take the union/join of
al the returned courses... and hope that that gives me a close to 100% list
of classes...

if the site/section was static.. then i could see that one might create some
sort of count of the expected itemlist that you're shooting for... but in my
case, i'm dealing with a dynamic site that might change overtime...

in researching articles, the net, etc.. i haven't run across anyone
attempting to solve this issue... IE, how do you build a system, that lets
you know with 100% accuracy, that you've captured everything you intend to
capture!

thanks



-----Original Message-----
From: yanky young [mailto:[email protected]]
Sent: Wednesday, March 04, 2009 7:47 PM
To: [email protected]
Subject: Re: Parsing/Crawler Questions..


check your yahoo email box and find that email sent from nutch-user
mailinglist and u will find any links you want there


2009/3/5 Edward Chen <[email protected]>

> Hi, how to remove me from this mailing list ? please give me a link.
>
> Good Luck.
>
>
>
>
> ________________________________
> From: yanky young <[email protected]>
> To: [email protected]
> Sent: Wednesday, March 4, 2009 8:48:33 PM
> Subject: Re: Parsing/Crawler Questions..
>
> Hi:
>
> you said that you are crawling college websites and use XPath to extract
> classes or courses information. That's good. But how do you determine a
web
> page is about classes or not?
>
> If you just crawled the whole web site, that must be a complete crawl and
> thus a complete tree.
> If you use some kind of automatic mechanism to classify web pages, there
is
> the problem of precision/recall issue that relates to the completeness of
> the tree.
>
> So that depends how do you crawl website.
>
> good luck
>
> yanky
>
> 2009/3/5 bruce <[email protected]>
>
> > Hi...
> >
> > Sorry that this is a bit off track. Ok, maybe way off track!
> >
> > But I don't have anyone to bounce this off of..
> >
> > I'm working on a crawling project, crawling a college website, to
extract
> > course/class information. I've built a quick test app in python to crawl
> > the
> > site. I crawl at the top level, and work my way down to getting the
> > required
> > course/class schedule. The app works. I can consistently run it and
> extract
> > the information. The required information is based upon an XPath
analysis
> > of
> > the DOM for the given pages that I'm parsing.
> >
> > My issue is now that I have a "basic" app that works, I need to figure
> out
> > how I guarantee that I'm correctly crawling the site. How do I know when
> > I've got an error at a given node/branch, so that the app knows that
it's
> > not going to fetch the underlying branch/nodes of the tree..
> >
> > When running the app, I can get 5000 classes on one run, 4700 on
antoher,
> > etc... So I need some method of determining when I get a "complete"
> tree...
> >
> > How do I know when I have a complete "tree"!
> >
> > I'm looking for someone, or some group/prof that I can talk to about
> these
> > issues. I've been searching google, linkedin, etc.. for someone to
bounce
> > thoughts with..! My goal is to eventually look at using nutch/lucene if
> at
> > all applicable.
> >
> > Any pointers, or people, or papers, etc... would be helpful.
> >
> > Thanks
> >
> >
> >
> >
>
>
>
>
>

Reply via email to