Internal links not getting added to fetch list.

S.L Sat, 12 Oct 2013 19:19:36 -0700

Hello All,

I am facing this problem with the URL
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 , this URL has
many internal links present in  the page and also has many external links
to other domains , I am only interested in the internal links.


However when this page is crawled the internal links in it are not added
for fetching in the next round of fetching ( I have given a depth of 100).
I have alread  set the db.ignore.internal.links as false ,but for some
reason the internal links are not getting added to the next round of fetch
list.


On the other hand if I set the db.ignore.external.links as false, it correctly
picks up all the external links from the page.

This problem is not present in any other domains , can some tell me what is
it with this particular page ?

I have also attached the nucth-site.xml that I am using for your review,
please advise.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>


  <property>
    <name>http.agent.name</name>
    <value>Test-Crawler</value>
    <description>Test-Crawler</description>
  </property>
  <property>
    <name>http.agent.description</name>
    <value>Test-Crawler</value>
    <description></description>
  </property>
 <property>
  <name>http.robots.agents</name>
  <value>Test-Crawler</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>
  
  <property>
  <name>fetcher.parse</name>
  <value>true</value>
  <description>If true, fetcher will parse content. Default is false, which means
  that a separate parsing step is required after fetching is finished.</description>
</property>

   <property>
    <name>hadoop.tmp.dir</name>
    <value>temp</value>
    <description></description>
  </property> 
  
  <property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

  <!-- web db properties -->

<property>
  <name>db.fetch.interval.default</name>
  <value>5</value>
  <description>The default number of seconds between re-fetches of a page (300 days).
  </description>
</property>

<property>
  <name>db.fetch.interval.max</name>
  <value>5</value>
  <description>The maximum number of seconds between re-fetches of a page
  (900 days). After this period every page in the db will be re-tried, no
  matter what is its status.
  </description>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>
<property>
  <name>http.redirect.max</name>
  <value>4</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

<!-- <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.CustomFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
</property> -->


<property>
  <name>http.timeout</name>
    <value>50000</value>
      <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>http.max.delays</name>
  <value>1000</value>
  <description>The number of times a thread wsummary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the ill delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>


<property>
  <name>plugin.folders</name>
  <value>/home/general/workspace/nutch/src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>




<property>
    	<name>fetcher.threads.per.host.by.ip</name>
    	<value>false</value>
    	<description></description>
</property>


<property>
  <name>db.max.outlinks.per.page</name>
  <value>30000</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

<property>
  <name>http.useHttp11</name>
  <value>true</value>
  <description>NOTE: at the moment this works only for protocol-httpclient.
  If true, use HTTP 1.1, if false use HTTP 1.0 .
  </description>
</property>
<property>
  <name>fetcher.server.delay</name>
  <value>0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>50</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>30</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>
<property>
  <name>fetcher.threads.per.host</name>
  <value>5</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>


<!-- solr index properties -->
<property>
  <name>solr.commit.size</name>
  <value>100</value>
  <description>
  Defines the name of the file that will be used in the mapping of internal
  nutch field names to solr index fields as specified in the target Solr schema.
  </description>
</property> 

<property>
  <name>parser.timeout</name>
  <value>-1</value>
</property> 

<property>
  <name>extract.prunetags</name>
  <value>style,script</value>
</property> 

<property>
<name>fetcher.threads.per.queue</name>
   <value>100</value>
   <description></description>
</property>

<property>
  <name>fetcher.timelimit.mins</name>
  <value>-1</value>
  <description>This is the number of minutes allocated to the fetching.
  Once this value is reached, any remaining entry from the input URL list is skipped 
  and all active queues are emptied. The default value of -1 deactivates the time limit.
  </description>
</property>
<property>
  <name>fetcher.max.exceptions.per.queue</name>
  <value>-1</value>
  <description>The maximum number of protocol-level exceptions (e.g. timeouts) per
  host (or IP) queue. Once this value is reached, any remaining entries from this
  queue are purged, effectively stopping the fetching from this host/IP. The default
  value of -1 deactivates this limit.
  </description>
</property>

<!--  Added based on the suggestion from nutch mailing list -->
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more)|scoring-optic|urlnormalizer-(pass|regex|basic)</value>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
  <name>parse.plugin.file</name>
  <value>parse-plugins.xml</value>
  <description>The name of the file that defines the associations between
  content-types and parsers.</description>
</property>

<!-- URL normalizer properties -->

<property>
  <name>urlnormalizer.order</name>
  <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
  <description>Order in which normalizers will run. If any of these isn't
  activated it will be silently skipped. If other normalizers not on the
  list are activated, they will run in random order after the ones
  specified here are run.
  </description>
</property>


<property>
  <name>urlnormalizer.regex.file</name>
  <value>regex-normalize.xml</value>
  <description>Name of the config file used by the RegexUrlNormalizer class.
  </description>
</property>



<property>
  <name>urlnormalizer.loop.count</name>
  <value>1</value>
  <description>Optionally loop through normalizers several times, to make
  sure that all transformations have been performed.
  </description>
</property>

    
</configuration>

Internal links not getting added to fetch list.

Reply via email to