sebastian, Thank you , I just pulled a clean version of 1.7 without making any changes to anything except for adding the crawler name and the plugin.folders directory to the nutc-default.xml and ran it from eclipse( Ihave been running in eclipse all along) , I only see 8 outlinks in the output posted below, this is way lesser than 653 outlinks that you reported.
I am going to try run this from the trunk code as well as you did , in 1.7 though its definitely not behaving as expected . Please see the output below. fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 contentType: text/html signature: ae579f1bb9bac1953f8c94dcd55be9be --------- Url --------------- http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 --------- ParseData --------- Version: 5 Status: success(1,0) Title: All Categories Outlinks: 8 outlink: toUrl: http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: outlink: toUrl: http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=GHCOLL&h=24850anchor: outlink: toUrl: http://ir.ebaystatic.com/z/ic/1hsgocfebuyd3pnukphb3cmqz.css anchor: outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor: Skip to main content outlink: toUrl: http://www.ebay.com anchor: eBay outlink: toUrl: http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor: Shop by category outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: Advanced Content Metadata: Content-Language=en-US RlogId=t6gfv%3D9un%7F4g66%60%2815b0-141c8c973a8-0xeb Date=Thu, 17 Oct 2013 23:39:07 GMT Content-Encoding=gzip Set-Cookie=lucky9=1220459;Domain=. ebay.com;Expires=Tue, 16-Oct-2018 23:39:07 GMT;Path=/ Connection=close Content-Type=text/html;charset=utf-8 Server=eBay Server Cache-Control=private Pragma=no-cache Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 --------- ParseText --------- All Categories Skip to main content eBay Shop by category Enter your search keyword All Categories Advanced On Wed, Oct 16, 2013 at 5:00 PM, Sebastian Nagel <[email protected] > wrote: > Hi, > > when run parsechecker with current trunk of 1.x there are 653 outlinks > (including many "internal" ones): > > % nutch parsechecker " > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" > ... > Title: All Categories > Outlinks: 653 > ... > > > Which Nutch version is used? > Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or > 2.2.1) > without any custom extensions (parse filters, etc.)? > > Thanks, > Sebastian > > > > > On 10/16/2013 01:32 AM, S.L wrote: > > Sebastian, > > > > Thank you for the lead, after I use the ParseChecker , I get the > following > > output , I can see that only two URLs are being parsed out of the page , > *I > > see a pattern that* in this page almost all the URLs are enclosed in * > > <li></li>* tags and those are *not* getting picked up , the two URLs that > > are being picked by the parser are *not* enclosed in a <li> tag. > > > > I have also attached the regex-urlfilter.txt along with the > nutch-site.xml > > for your review. > > > > Please see the ParseChecker output below. > > > > fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > contentType: text/html > > signature: cb07f28617927cc0accb150b22f84649 > > --------- > > Url > > --------------- > > > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > --------- > > ParseData > > --------- > > > > Version: 5 > > Status: success(1,0) > > Title: All Categories > > Outlinks: 12 > > outlink: toUrl: > > http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: > > outlink: toUrl: > > > http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor > : > > outlink: toUrl: > > http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor: > > outlink: toUrl: > > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor > : > > Skip to main content > > outlink: toUrl: http://www.ebay.com anchor: eBay > > outlink: toUrl: > > http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay > > outlink: toUrl: > > > http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor > : > > Shop by category > > outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your > search > > keyword All Categories Advanced > > outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: > > Advanced > > outlink: toUrl: > > http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor: > > outlink: toUrl: > > > http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor > : > > outlink: toUrl: > > http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor: > > Content Metadata: Content-Language=en-US > > RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct > > 2013 23:12:51 GMT Content-Encoding=gzip > Set-Cookie=lucky9=1113957;Domain=. > > ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close > > Content-Type=text/html;charset=utf-8 Server=eBay Server > > Cache-Control=private Pragma=no-cache > > Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > > --------- > > ParseText > > --------- > > > > All Categories Skip to main content eBay Shop by category Enter your > search > > keyword All Categories Advanced > > > > > > > > > > On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi, > >> > >>> I am only interested in the internal links. > >> Then > >> db.ignore.external.links = false > >> is correct. > >> > >> It is impossible to decide what's going wrong. > >> At a first glance, all seems ok except one: > >> plugin.includes contains "scoring-optic". > >> Should be "scoring-opic". I don't know but > >> that hardly the reason. > >> > >> For a finer analysis, more details are required: > >> - URL filter and normalizers: > >> are the desired URLs accepted > >> - CustomFetchSchedule.java: > >> shouldFetch() may play a role > >> > >> You can try to find the reason by: > >> > >> % bin/nutch parsechecker " > >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" > >> Are all desired outlinks extracted by parser? > >> > >> (after fetch of start url) > >> % bin/nutch readdb .../crawldb -dump crawldb_dump > >> % less crawldb_dump/part-* > >> Are they in CrawlDb? > >> > >> Cheers, > >> Sebastian > >> > >> On 10/13/2013 04:18 AM, S.L wrote: > >>> Hello All, > >>> > >>> I am facing this problem with the URL > >>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this > URL > >> has > >>> many internal links present in the page and also has many external > links > >>> to other domains , I am only interested in the internal links. > >>> > >>> However when this page is crawled the internal links in it are not > added > >>> for fetching in the next round of fetching ( I have given a depth of > >> 100). > >>> I have alread set the db.ignore.internal.links as false ,but for some > >>> reason the internal links are not getting added to the next round of > >> fetch > >>> list. > >>> > >>> > >>> On the other hand if I set the db.ignore.external.links as false, it > >> correctly > >>> picks up all the external links from the page. > >>> > >>> This problem is not present in any other domains , can some tell me > what > >> is > >>> it with this particular page ? > >>> > >>> I have also attached the nucth-site.xml that I am using for your > review, > >>> please advise. > >>> > >> > >> > > > >

