Re: Internal links not getting added to fetch list.

Sebastian Nagel Tue, 15 Oct 2013 11:28:33 -0700

Hi,

> I am only interested in the internal links.
Then
  db.ignore.external.links = false
is correct.


It is impossible to decide what's going wrong.
At a first glance, all seems ok except one:
plugin.includes contains "scoring-optic".
Should be "scoring-opic". I don't know but
that hardly the reason.

For a finer analysis, more details are required:
- URL filter and normalizers:
  are the desired URLs accepted
- CustomFetchSchedule.java:
  shouldFetch() may play a role

You can try to find the reason by:

% bin/nutch parsechecker 
"http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
Are all desired outlinks extracted by parser?

(after fetch of start url)
% bin/nutch readdb .../crawldb -dump crawldb_dump
% less crawldb_dump/part-*
Are they in CrawlDb?

Cheers,
Sebastian

On 10/13/2013 04:18 AM, S.L wrote:
> Hello All,
> 
> I am facing this problem with the URL
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL has
> many internal links present in  the page and also has many external links
> to other domains , I am only interested in the internal links.
> 
> However when this page is crawled the internal links in it are not added
> for fetching in the next round of fetching ( I have given a depth of 100).
> I have alread  set the db.ignore.internal.links as false ,but for some
> reason the internal links are not getting added to the next round of fetch
> list.
> 
> 
> On the other hand if I set the db.ignore.external.links as false, it correctly
> picks up all the external links from the page.
> 
> This problem is not present in any other domains , can some tell me what is
> it with this particular page ?
> 
> I have also attached the nucth-site.xml that I am using for your review,
> please advise.
>

Re: Internal links not getting added to fetch list.

Reply via email to