RE: Issues while crawling pagination

Yossi Tamari Sat, 28 Jul 2018 16:03:02 -0700

Hi Shiva,

Having looked at the specific site, I have to amend my recommended max-depth 
from 1 to 2, since I assume you want to fetch the stories themselves, not just 
the hubpages.


If you want to crawl continuously, as Markus suggested, I still think you 
should keep the depth at 2, but define the first hubpage(s) to have a very high 
priority and very short recrawl delay. This is because stories are always added 
on the first page, and then get pushed back. I suspect that if you don't limit 
depth, and especially if you don't limit yourself to the domain, you will find 
yourself crawling the whole internet eventually. If you do limit to the domain, 
that won't be a problem, but unless you give special treatment to the first 
page(s), you will be continuously recrawling hundreds of thousands of static 
pages.

        Yossi.

> -----Original Message-----
> From: Markus Jelsma <markus.jel...@openindex.io>
> Sent: 29 July 2018 00:53
> To: user@nutch.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hello,
> 
> Yossi's suggestion is excellent if your case is crawl everything once, and 
> never
> again. However, if you need to crawl future articles as well, and have to deal
> with mutations, then let the crawler run continuously without regard for 
> depth.
> 
> The latter is the usual case, because after all, if you got this task a few 
> months
> ago you wouldn't need to go to a depth of 497342 right?
> 
> Regards,
> Markus
> 
> 
> 
> 
> -----Original message-----
> > From:Yossi Tamari <yossi.tam...@pipl.com>
> > Sent: Saturday 28th July 2018 23:09
> > To: user@nutch.apache.org; shivakarthik...@gmail.com;
> > nu...@lucene.apache.org
> > Subject: RE: Issues while crawling pagination
> >
> > Hi Shiva,
> >
> > My suggestion would be to programmatically generate a seeds file containing
> these 497342 URLs (since you know them in advance), and then use a very low
> max-depth (probably 1), and a high number of iterations, since only a small
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> > (Mathematically, If you fetch 1 URL per second from this domain, fetching
> 497342 URLs will take 138 hours).
> >
> >     Yossi.
> >
> > > -----Original Message-----
> > > From: ShivaKarthik S <shivakarthik...@gmail.com>
> > > Sent: 28 July 2018 23:20
> > > To: nu...@lucene.apache.org; user@nutch.apache.org
> > > Subject: Reg: Issues while crawling pagination
> > >
> > >  Hi
> > >
> > > Can you help me in figuring out the issue while crawling a hub page
> > > having pagination. Problem what i am facing is what depth to give
> > > and how to handle pagination.
> > > I have a hubpage which has a pagination of more than 4.95L.
> > > e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342
> is
> > > the number of pages under the hubpage latest-news>
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Shiva
> >
> >

RE: Issues while crawling pagination

Reply via email to