Hi Jigal,

> One site is indexed by Nutch. Now it should be limited to the pages that
> are linked in the seed URL (no further crawling necessary).
Have a look at the plugin "scoring-depth" and add to your nutch-site.xml
(cf. conf/nutch-default.xml):


<!-- scoring-depth properties
 Add 'scoring-depth' to the list of active plugins
 in the parameter 'plugin.includes' in order to use it.
 -->

<property>
  <name>scoring.depth.max</name>
  <value>2</value>
  <description>Max depth value from seed allowed by default.
  Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from.
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  </description>
</property>

> Furthermore all
> pages must be revisited daily (and new pages must be indexed daily too).

See property "db.fetch.interval.default",
also take the time to check other
  db.fetch.interval.*
  db.fetch.schedule.*
properties.

> Another wish is to exclude pages with certain content on them. Currently we
> do this by a delete query after Nutch finishes. We can keep it this way,
> but I wondered if there was a smarter option.

How is such content identified?

Cheers,
Sebastian

On 04/06/2016 11:38 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi,
> 
> Probably not too complex for those who are used to fiddling with the
> configuration, but I could use some pointer on how to achieve the following.
> 
> One site is indexed by Nutch. Now it should be limited to the pages that
> are linked in the seed URL (no further crawling necessary). Furthermore all
> pages must be revisited daily (and new pages must be indexed daily too).
> 
> Another wish is to exclude pages with certain content on them. Currently we
> do this by a delete query after Nutch finishes. We can keep it this way,
> but I wondered if there was a smarter option.
> 
> Thanks in advance for pointing me in the right direction.
> 

Reply via email to