RE: Issues while crawling pagination

Markus Jelsma Sat, 28 Jul 2018 14:53:24 -0700

Hello,

Yossi's suggestion is excellent if your case is crawl everything once, and 
never again. However, if you need to crawl future articles as well, and have to 
deal with mutations, then let the crawler run continuously without regard for 
depth.


The latter is the usual case, because after all, if you got this task a few 
months ago you wouldn't need to go to a depth of 497342 right?

Regards,
Markus


 
 
-----Original message-----
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Saturday 28th July 2018 23:09
> To: user@nutch.apache.org; shivakarthik...@gmail.com; nu...@lucene.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hi Shiva,
> 
> My suggestion would be to programmatically generate a seeds file containing 
> these 497342 URLs (since you know them in advance), and then use a very low 
> max-depth (probably 1), and a high number of iterations, since only a small 
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> (Mathematically, If you fetch 1 URL per second from this domain, fetching 
> 497342 URLs will take 138 hours).
> 
>       Yossi.
> 
> > -----Original Message-----
> > From: ShivaKarthik S <shivakarthik...@gmail.com>
> > Sent: 28 July 2018 23:20
> > To: nu...@lucene.apache.org; user@nutch.apache.org
> > Subject: Reg: Issues while crawling pagination
> > 
> >  Hi
> > 
> > Can you help me in figuring out the issue while crawling a hub page having
> > pagination. Problem what i am facing is what depth to give and how to handle
> > pagination.
> > I have a hubpage which has a pagination of more than 4.95L.
> > e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342 is
> > the number of pages under the hubpage latest-news>
> > 
> > 
> > --
> > Thanks and Regards
> > Shiva
> 
>

RE: Issues while crawling pagination

Reply via email to