URI consistency is not under our control. Perhaps we should attempt to identify these pages first.
Thanks -----Original message----- > From:Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > Sent: Thu 05-Jul-2012 10:56 > To: user@nutch.apache.org > Subject: Re: Adaptive scheduling, but different > > Hi Markus, > This is a tricky one, I have personally had terrible headaches with > similar problems where an update to a piece of legislation completely > changes it's URL, which makes the task of provenance hellishly > complex... We addressed this by ensuring that legislation URI's stay > consistent regardless of changes to textual content within any given > artifact. > > W.r.t your specific problem, it is really outwith your control how and > when the URL's change (as you've already described) and for that I am > struggling to provide you with any reasonable input... sorry. > Lewis > > On Thu, Jul 5, 2012 at 8:51 AM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Any ideas? > > > > > > > > -----Original message----- > >> From:Markus Jelsma <markus.jel...@openindex.io> > >> Sent: Mon 02-Jul-2012 23:05 > >> To: user@nutch.apache.org > >> Subject: Adaptive scheduling, but different > >> > >> Hi, > >> > >> We use an adaptive scheduler for our crawl, this works fine for most cases > >> but a specific type of page is crawled more often than it should. These > >> are usually news or article archives such as news/archive/12345. Most > >> websites generate these pages dynamically. The problem is that whenever a > >> new item is posted, all news/archive/* pages become modified, every > >> article or item shifts one position and changes thousands of URL's. > >> > >> The problem of adaptive scheduling for these pages should be obvious by > >> now. I have given it some thought the past few weeks but i haven't figured > >> out a generic solution just yet so any advice or out-of-the-box ideas or > >> very much appreciated! > >> > >> Thanks > >> Markus > >> > > > > -- > Lewis >