Oops ! with the attached regex-urlfilter.txt along with the nutch-site.xml
for your review this time.
On Tue, Oct 15, 2013 at 7:32 PM, S.L <[email protected]> wrote:
> Sebastian,
>
> Thank you for the lead, after I use the ParseChecker , I get the following
> output , I can see that only two URLs are being parsed out of the page , *I
> see a pattern that* in this page almost all the URLs are enclosed in *
> <li></li>* tags and those are *not* getting picked up , the two URLs that
> are being picked by the parser are *not* enclosed in a <li> tag.
>
> I have also attached the regex-urlfilter.txt along with the nutch-site.xml
> for your review.
>
> Please see the ParseChecker output below.
>
> fetching: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> parsing: http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> contentType: text/html
> signature: cb07f28617927cc0accb150b22f84649
> ---------
> Url
> ---------------
>
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1
> ---------
> ParseData
> ---------
>
> Version: 5
> Status: success(1,0)
> Title: All Categories
> Outlinks: 12
> outlink: toUrl:
> http://ir.ebaystatic.com/z/es/sbn2cgpp4y0s5ag0ptqhvfdcu.css anchor:
> outlink: toUrl:
> http://gh.ebaystatic.com/header/css/all.min?combo=11&ds=3&siteid=0&rvr=106&factor=AKAMIZEDAC,UX&h=24857anchor:
> outlink: toUrl:
> http://ir.ebaystatic.com/z/y2/pkp41uauqe0andx5iwudbddry.css anchor:
> outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1#mainContentanchor:
> Skip to main content
> outlink: toUrl: http://www.ebay.com anchor: eBay
> outlink: toUrl:
> http://p.ebaystatic.com/aw/pics/globalheader/spr11.pnganchor: eBay
> outlink: toUrl:
> http://www.ebay.com/sch/allcategories/all-categories?_trksid=m570.l3694anchor:
> Shop by category
> outlink: toUrl: http://www.ebay.com/sch/i.html anchor: Enter your
> search keyword All Categories Advanced
> outlink: toUrl: http://www.ebay.com/sch/ebayadvsearch/?rt=nc anchor:
> Advanced
> outlink: toUrl:
> http://ir.ebaystatic.com/z/mh/zjkdj0vsquy3xj4jb1kvi20z3.js anchor:
> outlink: toUrl:
> http://gh.ebaystatic.com/header/js/rpt.min?combo=11&rvr=142&ds=3&siteid=0&factor=AKAMIZEDAC,UX&h=24857anchor:
> outlink: toUrl:
> http://rover.ebay.com/roversync/?site=0&stg=1&mpt=1381878771981 anchor:
> Content Metadata: Content-Language=en-US
> RlogId=t6gfv%3D9un%7F4g66%60%28d%3E75-141be64b10f-0xbb Date=Tue, 15 Oct
> 2013 23:12:51 GMT Content-Encoding=gzip Set-Cookie=lucky9=1113957;Domain=.
> ebay.com;Expires=Sun, 14-Oct-2018 23:12:52 GMT;Path=/ Connection=close
> Content-Type=text/html;charset=utf-8 Server=eBay Server
> Cache-Control=private Pragma=no-cache
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> ---------
> ParseText
> ---------
>
> All Categories Skip to main content eBay Shop by category Enter your
> search keyword All Categories Advanced
>
>
>
>
> On Tue, Oct 15, 2013 at 2:26 PM, Sebastian Nagel <
> [email protected]> wrote:
>
>> Hi,
>>
>> > I am only interested in the internal links.
>> Then
>> db.ignore.external.links = false
>> is correct.
>>
>> It is impossible to decide what's going wrong.
>> At a first glance, all seems ok except one:
>> plugin.includes contains "scoring-optic".
>> Should be "scoring-opic". I don't know but
>> that hardly the reason.
>>
>> For a finer analysis, more details are required:
>> - URL filter and normalizers:
>> are the desired URLs accepted
>> - CustomFetchSchedule.java:
>> shouldFetch() may play a role
>>
>> You can try to find the reason by:
>>
>> % bin/nutch parsechecker "
>> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1"
>> Are all desired outlinks extracted by parser?
>>
>> (after fetch of start url)
>> % bin/nutch readdb .../crawldb -dump crawldb_dump
>> % less crawldb_dump/part-*
>> Are they in CrawlDb?
>>
>> Cheers,
>> Sebastian
>>
>> On 10/13/2013 04:18 AM, S.L wrote:
>> > Hello All,
>> >
>> > I am facing this problem with the URL
>> > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this
>> URL has
>> > many internal links present in the page and also has many external
>> links
>> > to other domains , I am only interested in the internal links.
>> >
>> > However when this page is crawled the internal links in it are not added
>> > for fetching in the next round of fetching ( I have given a depth of
>> 100).
>> > I have alread set the db.ignore.internal.links as false ,but for some
>> > reason the internal links are not getting added to the next round of
>> fetch
>> > list.
>> >
>> >
>> > On the other hand if I set the db.ignore.external.links as false, it
>> correctly
>> > picks up all the external links from the page.
>> >
>> > This problem is not present in any other domains , can some tell me
>> what is
>> > it with this particular page ?
>> >
>> > I have also attached the nucth-site.xml that I am using for your review,
>> > please advise.
>> >
>>
>>
>
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Test-Crawler</value>
<description>Test-Crawler</description>
</property>
<property>
<name>http.agent.description</name>
<value>Test-Crawler</value>
<description></description>
</property>
<property>
<name>http.robots.agents</name>
<value>Test-Crawler</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>fetcher.parse</name>
<value>true</value>
<description>If true, fetcher will parse content. Default is false, which means
that a separate parsing step is required after fetching is finished.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>temp</value>
<description></description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
</description>
</property>
<!-- web db properties -->
<property>
<name>db.fetch.interval.default</name>
<value>5</value>
<description>The default number of seconds between re-fetches of a page (300 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>5</value>
<description>The maximum number of seconds between re-fetches of a page
(900 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>http.redirect.max</name>
<value>4</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<!-- <property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.CustomFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property> -->
<property>
<name>http.timeout</name>
<value>50000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>http.max.delays</name>
<value>1000</value>
<description>The number of times a thread wsummary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the ill delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.</description>
</property>
<property>
<name>plugin.folders</name>
<value>/home/general/workspace/nutch/src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<property>
<name>fetcher.threads.per.host.by.ip</name>
<value>false</value>
<description></description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>30000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>http.useHttp11</name>
<value>true</value>
<description>NOTE: at the moment this works only for protocol-httpclient.
If true, use HTTP 1.1, if false use HTTP 1.0 .
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>50</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>30</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>5</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<!-- solr index properties -->
<property>
<name>solr.commit.size</name>
<value>100</value>
<description>
Defines the name of the file that will be used in the mapping of internal
nutch field names to solr index fields as specified in the target Solr schema.
</description>
</property>
<property>
<name>parser.timeout</name>
<value>-1</value>
</property>
<property>
<name>extract.prunetags</name>
<value>style,script</value>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>100</value>
<description></description>
</property>
<property>
<name>fetcher.timelimit.mins</name>
<value>-1</value>
<description>This is the number of minutes allocated to the fetching.
Once this value is reached, any remaining entry from the input URL list is skipped
and all active queues are emptied. The default value of -1 deactivates the time limit.
</description>
</property>
<property>
<name>fetcher.max.exceptions.per.queue</name>
<value>-1</value>
<description>The maximum number of protocol-level exceptions (e.g. timeouts) per
host (or IP) queue. Once this value is reached, any remaining entries from this
queue are purged, effectively stopping the fetching from this host/IP. The default
value of -1 deactivates this limit.
</description>
</property>
<!-- Added based on the suggestion from nutch mailing list -->
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
<!-- URL normalizer properties -->
<property>
<name>urlnormalizer.order</name>
<value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
<description>Order in which normalizers will run. If any of these isn't
activated it will be silently skipped. If other normalizers not on the
list are activated, they will run in random order after the ones
specified here are run.
</description>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer class.
</description>
</property>
<property>
<name>urlnormalizer.loop.count</name>
<value>1</value>
<description>Optionally loop through normalizers several times, to make
sure that all transformations have been performed.
</description>
</property>
</configuration>