Sebastian, Thank you for the lead, after I use the ParseChecker , I get the following output , I can see that only two URLs are being parsed out of the page , *I see a pattern that* in this page almost all the URLs are enclosed in * <li></li>* tags and those are *not* getting picked up , the two URLs that are being picked by the parser are *not* enclosed in a <li> tag.
I have also attached the regex-urlfilter.txt along with the nutch-site.xml for your review. Please see the ParseChecker output below. fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 contentType: text/html signature: cb07f28617927cc0accb150b22f84649 --------- Url --------------- http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 --------- ParseData --------- Version: 5 Status: success(1,0) Title: All Categories Outlinks: 12 outlink: toUrl: http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: outlink: toUrl: http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor: outlink: toUrl: http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor: outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor: Skip to main content outlink: toUrl: http://www.ebay.com anchor: eBay outlink: toUrl: http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor: Shop by category outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your search keyword All Categories Advanced outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: Advanced outlink: toUrl: http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor: outlink: toUrl: http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor: outlink: toUrl: http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor: Content Metadata: Content-Language=en-US RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct 2013 23:12:51 GMT Content-Encoding=gzip Set-Cookie=lucky9=1113957;Domain=. ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close Content-Type=text/html;charset=utf-8 Server=eBay Server Cache-Control=private Pragma=no-cache Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 --------- ParseText --------- All Categories Skip to main content eBay Shop by category Enter your search keyword All Categories Advanced On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <[email protected] > wrote: > Hi, > > > I am only interested in the internal links. > Then > db.ignore.external.links = false > is correct. > > It is impossible to decide what's going wrong. > At a first glance, all seems ok except one: > plugin.includes contains "scoring-optic". > Should be "scoring-opic". I don't know but > that hardly the reason. > > For a finer analysis, more details are required: > - URL filter and normalizers: > are the desired URLs accepted > - CustomFetchSchedule.java: > shouldFetch() may play a role > > You can try to find the reason by: > > % bin/nutch parsechecker " > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" > Are all desired outlinks extracted by parser? > > (after fetch of start url) > % bin/nutch readdb .../crawldb -dump crawldb_dump > % less crawldb_dump/part-* > Are they in CrawlDb? > > Cheers, > Sebastian > > On 10/13/2013 04:18 AM, S.L wrote: > > Hello All, > > > > I am facing this problem with the URL > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL > has > > many internal links present in the page and also has many external links > > to other domains , I am only interested in the internal links. > > > > However when this page is crawled the internal links in it are not added > > for fetching in the next round of fetching ( I have given a depth of > 100). > > I have alread set the db.ignore.internal.links as false ,but for some > > reason the internal links are not getting added to the next round of > fetch > > list. > > > > > > On the other hand if I set the db.ignore.external.links as false, it > correctly > > picks up all the external links from the page. > > > > This problem is not present in any other domains , can some tell me what > is > > it with this particular page ? > > > > I have also attached the nucth-site.xml that I am using for your review, > > please advise. > > > >

