I've lost the thread, but someone here had recently asked for our nutch xml configuration file. Our developer's back from holidays so I've got the info now. Note that some of the configuration variables are not in the default file as we've made modifications. On our dual xeon, 8gigs of ram, scsi raid 0 server this config will fill about a 10mbs line. If the number of threads is increased to about 50, it'll fill a 40mbs pipe while crawling.

We also exclude quite a number of different file types that nutch by default would crawl (some rather obscure program files and the like). That helped us initially as well, as did cutting down the size of our page sizes. There's a lot of 3/5/20meg pdf's and word documents out there that'll really slow things down.

without further ado, here's our current config file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
<name>address.ip.file</name>
<value>ip-address.txt</value>
<description>Name of file on CLASSPATH containing ip addresses used by urlfilter
-ip (IPURLFilter) plugin. (Keren added)</description>
</property>

<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered recoverable
errors is generated for fetch.</description>
</property>

<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from the different
host are ignored. (Keren added) </description>
</property>

<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, kee
ping the only the highest quality links. </description>
</property>

<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a page. </des
cription>
</property>

<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>The score of new pages added by the injector.</description>
</property>

<property>
<name>db.score.link.external</name>
<value>1.0</value>
<description>The score factor for new pages added due to a link from another hos
t relative to the referencing page's score.
</description>
</property>

<property>
<name>db.score.link.internal</name>
<value>1.0</value>
<description>The score factor for pages added due to a link from the same host,
relative to the referencing page's score.
</description>
</property>

<property>
<name>dropped.url.file</name>
<value>/home/xxx/xxxx/nutch/dropped_urls.out</value>
<description>Name of file containing dropped urls used by urlfilter-ip (IPURLFil
ter) plugin. (Keren added)</description>
</property>

<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between successive req
uests to the same server.</description>
</property>

<property>
<name>fetcher.threads.fetch</name>
<value>20</value>
<description>The number of FetcherThreads the fetcher should use. This is also d etermines the maximum number of requests that are made at once (each FetcherThread
handles one connection).</description>
</property>

<property>
<name>fetcher.threads.per.host</name>
<value>3</value>
<description>This number is the maximum number of threads that should be allowed
to access a host at one time.</description>
</property>

<property>
<name>http.agent.email</name>
<value>xxxxxxxxx</value>
<description>An email address to advertise in the HTTP 'From' request header and
User-Agent header.</description>
</property>

<property>
<name>http.agent.url</name>
<value>xxxxxxxxxxxxxxxxxxx</value>
<description>A URL to advertise in the User-Agent header. This will appear in pa
renthesis after the agent name. </description>
</property>

<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no trunca
tion at all. </description>
</property>

<property>
<name>http.max.delays</name>
<value>3</value>
<description>The number of times a thread will delay when trying to fetch a page . Each time it finds that a host is busy, it will wait fetcher.server.delay. Aft er http.max.delays attepts, it will give up on the page for now.</description>
</property>

<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when trying
to fetch a page.</description>
</property>

<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>
<description>When true scores for a page are multipled by the log of the number
of incoming links to the page.</description>
</property>

<property>
<name>indexer.boost.link.count.weight</name>
<value>100.0</value>
<description>Scores for a page are multipled by the log (the number of incomingl
inks * this parameter) to the page. (Keren added)</description>
</property>

<property>
<name>indexer.score.power</name>
<value>0.5</value>
<description>Determines the power of link analyis scores. Each pages's boost is set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link analysis
<value>0.5</value>
<description>Determines the power of link analyis scores. Each pages's boost is set to <I>score<SUP>scorePower</SUP></I> where <I>score</I> is its link analysis score and <I>scorePower</I> is the value of this parameter. This is compiled into indexes, so, when this is changed, pages must be re-indexed for it to take effect
.</description>
</property>

<property>
<name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-httpclient|urlfilter-ip|parse-(text|html|p
df|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to include. Any p lugin not matching this expression is excluded. In any case you need at least incl ude the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description>
</property>

<property>
<name>urlfilter.ip.file</name>
<value>ip-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions used by ur
lfilter-ip (IPURLFilter) plugin. (Keren added)</description>
</property>
</nutch-conf>

Reply via email to