Now I ran the clean trunk checkout as well ,unfortunately on a clean trunk checkout (with name and plugin.folder value added to nutch-default.xml) to I see the same behavior as the clean 1.7 tag checkout , obviously I am doing something wrong and that has to do with the config files because I am not modifying the source in any way .
Will it be possible for to you share the config files you have used in the clean trunk checkout with me please? The following is the output from the trunk checkout execution. fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 contentType: text/html signature: 7026e09a97ff6df53f85d668bd86bcba --------- Url --------------- http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 --------- ParseData --------- Version: 5 Status: success(1,0) Title: All Categories Outlinks: 8 outlink: toUrl: http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: outlink: toUrl: http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,GHCOLL,UX&h=24864anchor: outlink: toUrl: http://ir.ebaystatic.com/z/ic/1hsgocfebuyd3pnukphb3cmqz.css anchor: outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor: Skip to main content outlink: toUrl: http://www.ebay.com anchor: eBay outlink: toUrl: http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay outlink: toUrl: http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor: Shop by category outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: Advanced Content Metadata: Content-Language=en-US RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141c972d0a9-0xeb Date=Fri, 18 Oct 2013 02:44:06 GMT Content-Encoding=gzip Set-Cookie=lucky9=5427514;Domain=. ebay.com;Expires=Wed, 17-Oct-2018 02:44:06 GMT;Path=/ Connection=close Content-Type=text/html;charset=utf-8 Server=eBay Server Cache-Control=private Pragma=no-cache Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 --------- ParseText --------- All Categories Skip to main content eBay Shop by category Enter your search keyword All Categories Advanced On Wed, Oct 16, 2013 at 5:00 PM, Sebastian Nagel <[email protected] > wrote: > Hi, > > when run parsechecker with current trunk of 1.x there are 653 outlinks > (including many "internal" ones): > > % nutch parsechecker " > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" > ... > Title: All Categories > Outlinks: 653 > ... > > > Which Nutch version is used? > Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or > 2.2.1) > without any custom extensions (parse filters, etc.)? > > Thanks, > Sebastian > > > > > On 10/16/2013 01:32 AM, S.L wrote: > > Sebastian, > > > > Thank you for the lead, after I use the ParseChecker , I get the > following > > output , I can see that only two URLs are being parsed out of the page , > *I > > see a pattern that* in this page almost all the URLs are enclosed in * > > <li></li>* tags and those are *not* getting picked up , the two URLs that > > are being picked by the parser are *not* enclosed in a <li> tag. > > > > I have also attached the regex-urlfilter.txt along with the > nutch-site.xml > > for your review. > > > > Please see the ParseChecker output below. > > > > fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > contentType: text/html > > signature: cb07f28617927cc0accb150b22f84649 > > --------- > > Url > > --------------- > > > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 > > --------- > > ParseData > > --------- > > > > Version: 5 > > Status: success(1,0) > > Title: All Categories > > Outlinks: 12 > > outlink: toUrl: > > http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor: > > outlink: toUrl: > > > http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor > : > > outlink: toUrl: > > http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor: > > outlink: toUrl: > > > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor > : > > Skip to main content > > outlink: toUrl: http://www.ebay.com anchor: eBay > > outlink: toUrl: > > http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay > > outlink: toUrl: > > > http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor > : > > Shop by category > > outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your > search > > keyword All Categories Advanced > > outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor: > > Advanced > > outlink: toUrl: > > http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor: > > outlink: toUrl: > > > http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor > : > > outlink: toUrl: > > http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor: > > Content Metadata: Content-Language=en-US > > RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct > > 2013 23:12:51 GMT Content-Encoding=gzip > Set-Cookie=lucky9=1113957;Domain=. > > ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close > > Content-Type=text/html;charset=utf-8 Server=eBay Server > > Cache-Control=private Pragma=no-cache > > Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > > --------- > > ParseText > > --------- > > > > All Categories Skip to main content eBay Shop by category Enter your > search > > keyword All Categories Advanced > > > > > > > > > > On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi, > >> > >>> I am only interested in the internal links. > >> Then > >> db.ignore.external.links = false > >> is correct. > >> > >> It is impossible to decide what's going wrong. > >> At a first glance, all seems ok except one: > >> plugin.includes contains "scoring-optic". > >> Should be "scoring-opic". I don't know but > >> that hardly the reason. > >> > >> For a finer analysis, more details are required: > >> - URL filter and normalizers: > >> are the desired URLs accepted > >> - CustomFetchSchedule.java: > >> shouldFetch() may play a role > >> > >> You can try to find the reason by: > >> > >> % bin/nutch parsechecker " > >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1" > >> Are all desired outlinks extracted by parser? > >> > >> (after fetch of start url) > >> % bin/nutch readdb .../crawldb -dump crawldb_dump > >> % less crawldb_dump/part-* > >> Are they in CrawlDb? > >> > >> Cheers, > >> Sebastian > >> > >> On 10/13/2013 04:18 AM, S.L wrote: > >>> Hello All, > >>> > >>> I am facing this problem with the URL > >>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this > URL > >> has > >>> many internal links present in the page and also has many external > links > >>> to other domains , I am only interested in the internal links. > >>> > >>> However when this page is crawled the internal links in it are not > added > >>> for fetching in the next round of fetching ( I have given a depth of > >> 100). > >>> I have alread set the db.ignore.internal.links as false ,but for some > >>> reason the internal links are not getting added to the next round of > >> fetch > >>> list. > >>> > >>> > >>> On the other hand if I set the db.ignore.external.links as false, it > >> correctly > >>> picks up all the external links from the page. > >>> > >>> This problem is not present in any other domains , can some tell me > what > >> is > >>> it with this particular page ? > >>> > >>> I have also attached the nucth-site.xml that I am using for your > review, > >>> please advise. > >>> > >> > >> > > > >

