Re: Parsing/Crawler Questions..

yanky young Wed, 04 Mar 2009 20:41:43 -0800

Hi:

your question is how to ensure that you crawled the whole tree for a
website.


(1) you can't ensure that you can crawl the whole tree at a time because
most of today's websites are dynamic, even those websites you think of
static may actually be generated by some kind of content management system.

(2) you can recrawl those websites that you have crawled, periodically by
cron job or something like that

http://wiki.apache.org/nutch/IntranetRecrawl

(3) you can configure nutch to refetch modified web pages periodically.

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property>

if you wanna roll you own refetch schedule, you can try 1.0 trunk:
http://www.netlikon.de/docs/javadoc-nutch/trunk/org/apache/nutch/crawl/FetchSchedule.html

hope it helps

god luck

yanky

2009/3/5 bruce <[email protected]>

> hi yanky...
>
> my issue is that i can create an app to parse the multiple levels i need
> for
> the given college site. this requires a manual inspection/process for each
> site.. but it's doable. a lot of effort for ~1000 sites.. but it's not too
> bad..
>
> however, when i parse a site, at least for my 1st few test sites.. i'm
> realizing that i'm getting different results... this means that i'm getting
> something different for the different runs. now i don't know if this is
> because i'm slamming the sites, or there's a network issue, or what...
>
> furthermore, since this is automated.. i really don't have apriori
> knowledge
> of what the entire section of the site i'm parsing should return... al i
> can
> really do, is run the parse N number of times.. and take the union/join of
> al the returned courses... and hope that that gives me a close to 100% list
> of classes...
>
> if the site/section was static.. then i could see that one might create
> some
> sort of count of the expected itemlist that you're shooting for... but in
> my
> case, i'm dealing with a dynamic site that might change overtime...
>
> in researching articles, the net, etc.. i haven't run across anyone
> attempting to solve this issue... IE, how do you build a system, that lets
> you know with 100% accuracy, that you've captured everything you intend to
> capture!
>
> thanks
>
>
>
> -----Original Message-----
> From: yanky young [mailto:[email protected]]
> Sent: Wednesday, March 04, 2009 7:47 PM
> To: [email protected]
> Subject: Re: Parsing/Crawler Questions..
>
>
> check your yahoo email box and find that email sent from nutch-user
> mailinglist and u will find any links you want there
>
>
> 2009/3/5 Edward Chen <[email protected]>
>
> > Hi, how to remove me from this mailing list ? please give me a link.
> >
> > Good Luck.
> >
> >
> >
> >
> > ________________________________
> > From: yanky young <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, March 4, 2009 8:48:33 PM
> > Subject: Re: Parsing/Crawler Questions..
> >
> > Hi:
> >
> > you said that you are crawling college websites and use XPath to extract
> > classes or courses information. That's good. But how do you determine a
> web
> > page is about classes or not?
> >
> > If you just crawled the whole web site, that must be a complete crawl and
> > thus a complete tree.
> > If you use some kind of automatic mechanism to classify web pages, there
> is
> > the problem of precision/recall issue that relates to the completeness of
> > the tree.
> >
> > So that depends how do you crawl website.
> >
> > good luck
> >
> > yanky
> >
> > 2009/3/5 bruce <[email protected]>
> >
> > > Hi...
> > >
> > > Sorry that this is a bit off track. Ok, maybe way off track!
> > >
> > > But I don't have anyone to bounce this off of..
> > >
> > > I'm working on a crawling project, crawling a college website, to
> extract
> > > course/class information. I've built a quick test app in python to
> crawl
> > > the
> > > site. I crawl at the top level, and work my way down to getting the
> > > required
> > > course/class schedule. The app works. I can consistently run it and
> > extract
> > > the information. The required information is based upon an XPath
> analysis
> > > of
> > > the DOM for the given pages that I'm parsing.
> > >
> > > My issue is now that I have a "basic" app that works, I need to figure
> > out
> > > how I guarantee that I'm correctly crawling the site. How do I know
> when
> > > I've got an error at a given node/branch, so that the app knows that
> it's
> > > not going to fetch the underlying branch/nodes of the tree..
> > >
> > > When running the app, I can get 5000 classes on one run, 4700 on
> antoher,
> > > etc... So I need some method of determining when I get a "complete"
> > tree...
> > >
> > > How do I know when I have a complete "tree"!
> > >
> > > I'm looking for someone, or some group/prof that I can talk to about
> > these
> > > issues. I've been searching google, linkedin, etc.. for someone to
> bounce
> > > thoughts with..! My goal is to eventually look at using nutch/lucene if
> > at
> > > all applicable.
> > >
> > > Any pointers, or people, or papers, etc... would be helpful.
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
>
>

Re: Parsing/Crawler Questions..

Reply via email to