Sebastian,

Thank you for the lead, after I use the ParseChecker , I get the following
output , I can see that only two URLs are being parsed out of the page , *I
see a pattern that* in this page almost all the URLs are enclosed in  *
<li></li>* tags and those are *not* getting picked up , the two URLs that
are being picked by the parser are *not* enclosed in a <li> tag.

I have also attached the regex-urlfilter.txt along with the nutch-site.xml
for your review.

Please see the ParseChecker output below.

fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
contentType: text/html
signature: cb07f28617927cc0accb150b22f84649
---------
Url
---------------

http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: All Categories
Outlinks: 12
  outlink: toUrl:
http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
  outlink: toUrl:
http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor:
  outlink: toUrl:
http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor:
  outlink: toUrl:
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor:
Skip to main content
  outlink: toUrl: http://www.ebay.com anchor: eBay
  outlink: toUrl:
http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
  outlink: toUrl:
http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor:
Shop by category
  outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your search
keyword All Categories Advanced
  outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
Advanced
  outlink: toUrl:
http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor:
  outlink: toUrl:
http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor:
  outlink: toUrl:
http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor:
Content Metadata: Content-Language=en-US
RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct
2013 23:12:51 GMT Content-Encoding=gzip Set-Cookie=lucky9=1113957;Domain=.
ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close
Content-Type=text/html;charset=utf-8 Server=eBay Server
Cache-Control=private Pragma=no-cache
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
---------
ParseText
---------

All Categories Skip to main content eBay Shop by category Enter your search
keyword All Categories Advanced




On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> > I am only interested in the internal links.
> Then
>   db.ignore.external.links = false
> is correct.
>
> It is impossible to decide what's going wrong.
> At a first glance, all seems ok except one:
> plugin.includes contains "scoring-optic".
> Should be "scoring-opic". I don't know but
> that hardly the reason.
>
> For a finer analysis, more details are required:
> - URL filter and normalizers:
>   are the desired URLs accepted
> - CustomFetchSchedule.java:
>   shouldFetch() may play a role
>
> You can try to find the reason by:
>
> % bin/nutch parsechecker "
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
> Are all desired outlinks extracted by parser?
>
> (after fetch of start url)
> % bin/nutch readdb .../crawldb -dump crawldb_dump
> % less crawldb_dump/part-*
> Are they in CrawlDb?
>
> Cheers,
> Sebastian
>
> On 10/13/2013 04:18 AM, S.L wrote:
> > Hello All,
> >
> > I am facing this problem with the URL
> > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL
> has
> > many internal links present in  the page and also has many external links
> > to other domains , I am only interested in the internal links.
> >
> > However when this page is crawled the internal links in it are not added
> > for fetching in the next round of fetching ( I have given a depth of
> 100).
> > I have alread  set the db.ignore.internal.links as false ,but for some
> > reason the internal links are not getting added to the next round of
> fetch
> > list.
> >
> >
> > On the other hand if I set the db.ignore.external.links as false, it
> correctly
> > picks up all the external links from the page.
> >
> > This problem is not present in any other domains , can some tell me what
> is
> > it with this particular page ?
> >
> > I have also attached the nucth-site.xml that I am using for your review,
> > please advise.
> >
>
>

Reply via email to