sebastian,

Thank you , I just pulled a clean version of 1.7 without making any changes
to  anything except for adding the crawler name and the plugin.folders
directory to the nutc-default.xml and ran it from eclipse( Ihave been
running in eclipse all along) , I only see 8 outlinks in the output posted
below, this is way lesser than 653 outlinks that you reported.

I am going to try run this from the trunk code as well as you did  , in 1.7
though its definitely not behaving as expected .

Please see the output below.


fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
contentType: text/html
signature: ae579f1bb9bac1953f8c94dcd55be9be
---------
Url
---------------

http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: All Categories
Outlinks: 8
  outlink: toUrl:
http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
  outlink: toUrl:
http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=GHCOLL&h=24850anchor:
  outlink: toUrl:
http://ir.ebaystatic.com/z/ic/1hsgocfebuyd3pnukphb3cmqz.css anchor:
  outlink: toUrl:
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor:
Skip to main content
  outlink: toUrl: http://www.ebay.com anchor: eBay
  outlink: toUrl:
http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
  outlink: toUrl:
http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor:
Shop by category
  outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
Advanced
Content Metadata: Content-Language=en-US
RlogId=t6gfv%3D9un%7F4g66%60%2815b0-141c8c973a8-0xeb Date=Thu, 17 Oct 2013
23:39:07 GMT Content-Encoding=gzip Set-Cookie=lucky9=1220459;Domain=.
ebay.com;Expires=Tue, 16-Oct-2018 23:39:07 GMT;Path=/ Connection=close
Content-Type=text/html;charset=utf-8 Server=eBay Server
Cache-Control=private Pragma=no-cache
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
---------
ParseText
---------

All Categories Skip to main content eBay Shop by category Enter your search
keyword All Categories Advanced




On Wed, Oct 16, 2013 at 5:00 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> when run parsechecker with current trunk of 1.x there are 653 outlinks
> (including many "internal" ones):
>
> % nutch parsechecker "
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
> ...
> Title: All Categories
> Outlinks: 653
> ...
>
>
> Which Nutch version is used?
> Can you try to reproduce the problem with a "clean" Nutch (either 1.7 or
> 2.2.1)
> without any custom extensions (parse filters, etc.)?
>
> Thanks,
> Sebastian
>
>
>
>
> On 10/16/2013 01:32 AM, S.L wrote:
> > Sebastian,
> >
> > Thank you for the lead, after I use the ParseChecker , I get the
> following
> > output , I can see that only two URLs are being parsed out of the page ,
> *I
> > see a pattern that* in this page almost all the URLs are enclosed in  *
> > <li></li>* tags and those are *not* getting picked up , the two URLs that
> > are being picked by the parser are *not* enclosed in a <li> tag.
> >
> > I have also attached the regex-urlfilter.txt along with the
> nutch-site.xml
> > for your review.
> >
> > Please see the ParseChecker output below.
> >
> > fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> > parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> > contentType: text/html
> > signature: cb07f28617927cc0accb150b22f84649
> > ---------
> > Url
> > ---------------
> >
> > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> > ---------
> > ParseData
> > ---------
> >
> > Version: 5
> > Status: success(1,0)
> > Title: All Categories
> > Outlinks: 12
> >   outlink: toUrl:
> > http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
> >   outlink: toUrl:
> >
> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor
> :
> >   outlink: toUrl:
> > http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor:
> >   outlink: toUrl:
> >
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor
> :
> > Skip to main content
> >   outlink: toUrl: http://www.ebay.com anchor: eBay
> >   outlink: toUrl:
> > http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
> >   outlink: toUrl:
> >
> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor
> :
> > Shop by category
> >   outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your
> search
> > keyword All Categories Advanced
> >   outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
> > Advanced
> >   outlink: toUrl:
> > http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.jsanchor:
> >   outlink: toUrl:
> >
> http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor
> :
> >   outlink: toUrl:
> > http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor:
> > Content Metadata: Content-Language=en-US
> > RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct
> > 2013 23:12:51 GMT Content-Encoding=gzip
> Set-Cookie=lucky9=1113957;Domain=.
> > ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close
> > Content-Type=text/html;charset=utf-8 Server=eBay Server
> > Cache-Control=private Pragma=no-cache
> > Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> > ---------
> > ParseText
> > ---------
> >
> > All Categories Skip to main content eBay Shop by category Enter your
> search
> > keyword All Categories Advanced
> >
> >
> >
> >
> > On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <
> [email protected]
> >> wrote:
> >
> >> Hi,
> >>
> >>> I am only interested in the internal links.
> >> Then
> >>   db.ignore.external.links = false
> >> is correct.
> >>
> >> It is impossible to decide what's going wrong.
> >> At a first glance, all seems ok except one:
> >> plugin.includes contains "scoring-optic".
> >> Should be "scoring-opic". I don't know but
> >> that hardly the reason.
> >>
> >> For a finer analysis, more details are required:
> >> - URL filter and normalizers:
> >>   are the desired URLs accepted
> >> - CustomFetchSchedule.java:
> >>   shouldFetch() may play a role
> >>
> >> You can try to find the reason by:
> >>
> >> % bin/nutch parsechecker "
> >> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1";
> >> Are all desired outlinks extracted by parser?
> >>
> >> (after fetch of start url)
> >> % bin/nutch readdb .../crawldb -dump crawldb_dump
> >> % less crawldb_dump/part-*
> >> Are they in CrawlDb?
> >>
> >> Cheers,
> >> Sebastian
> >>
> >> On 10/13/2013 04:18 AM, S.L wrote:
> >>> Hello All,
> >>>
> >>> I am facing this problem with the URL
> >>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this
> URL
> >> has
> >>> many internal links present in  the page and also has many external
> links
> >>> to other domains , I am only interested in the internal links.
> >>>
> >>> However when this page is crawled the internal links in it are not
> added
> >>> for fetching in the next round of fetching ( I have given a depth of
> >> 100).
> >>> I have alread  set the db.ignore.internal.links as false ,but for some
> >>> reason the internal links are not getting added to the next round of
> >> fetch
> >>> list.
> >>>
> >>>
> >>> On the other hand if I set the db.ignore.external.links as false, it
> >> correctly
> >>> picks up all the external links from the page.
> >>>
> >>> This problem is not present in any other domains , can some tell me
> what
> >> is
> >>> it with this particular page ?
> >>>
> >>> I have also attached the nucth-site.xml that I am using for your
> review,
> >>> please advise.
> >>>
> >>
> >>
> >
>
>

Reply via email to