URI consistency is not under our control. Perhaps we should attempt to identify 
these pages first.

Thanks

 
 
-----Original message-----
> From:Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> Sent: Thu 05-Jul-2012 10:56
> To: user@nutch.apache.org
> Subject: Re: Adaptive scheduling, but different
> 
> Hi Markus,
> This is a tricky one, I have personally had terrible headaches with
> similar problems where an update to a piece of legislation completely
> changes it's URL, which makes the task of provenance hellishly
> complex... We addressed this by ensuring that legislation URI's stay
> consistent regardless of changes to textual content within any given
> artifact.
> 
> W.r.t your specific problem, it is really outwith your control how and
> when the URL's change (as you've already described) and for that I am
> struggling to provide you with any reasonable input... sorry.
> Lewis
> 
> On Thu, Jul 5, 2012 at 8:51 AM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
> > Any ideas?
> >
> >
> >
> > -----Original message-----
> >> From:Markus Jelsma <markus.jel...@openindex.io>
> >> Sent: Mon 02-Jul-2012 23:05
> >> To: user@nutch.apache.org
> >> Subject: Adaptive scheduling, but different
> >>
> >> Hi,
> >>
> >> We use an adaptive scheduler for our crawl, this works fine for most cases 
> >> but a specific type of page is crawled more often than it should. These 
> >> are usually news or article archives such as news/archive/12345. Most 
> >> websites generate these pages dynamically. The problem is that whenever a 
> >> new item is posted, all news/archive/* pages become modified, every 
> >> article or item shifts one position and changes thousands of URL's.
> >>
> >> The problem of adaptive scheduling for these pages should be obvious by 
> >> now. I have given it some thought the past few weeks but i haven't figured 
> >> out a generic solution just yet so any advice or out-of-the-box ideas or 
> >> very much appreciated!
> >>
> >> Thanks
> >> Markus
> >>
> 
> 
> 
> -- 
> Lewis
> 

Reply via email to