nutch-xml.conf

Insurance Squared Inc. Sun, 13 Aug 2006 09:28:57 -0700

I've lost the thread, but someone here had recently asked for our nutchxml configuration file. Our developer's back from holidays so I've gotthe info now. Note that some of the configuration variables are not inthe default file as we've made modifications. On our dual xeon, 8gigsof ram, scsi raid 0 server this config will fill about a 10mbs line. Ifthe number of threads is increased to about 50, it'll fill a 40mbs pipewhile crawling.

We also exclude quite a number of different file types that nutch bydefault would crawl (some rather obscure program files and the like).That helped us initially as well, as did cutting down the size of ourpage sizes. There's a lot of 3/5/20meg pdf's and word documents outthere that'll really slow things down.


without further ado, here's our current config file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
<name>address.ip.file</name>
<value>ip-address.txt</value>

<description>Name of file on CLASSPATH containing ip addresses used byurlfilter

-ip (IPURLFilter) plugin. (Keren added)</description>
</property>

<property>
<name>db.fetch.retry.max</name>
<value>3</value>

<description>The maximum number of times a url that has encounteredrecoverable

errors is generated for fetch.</description>
</property>

<property>
<name>db.ignore.external.links</name>
<value>false</value>

<description>If true, when adding new links to a page, links from thedifferent

host are ignored. (Keren added) </description>
</property>

<property>
<name>db.ignore.internal.links</name>
<value>true</value>

<description>If true, when adding new links to a page, links from thesame hostare ignored. This is an effective way to limit the size of the linkdatabase, kee

ping the only the highest quality links. </description>
</property>

<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>

<description>The maximum number of outlinks that we'll process for apage. </des

cription>
</property>

<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>The score of new pages added by the injector.</description>
</property>

<property>
<name>db.score.link.external</name>
<value>1.0</value>

<description>The score factor for new pages added due to a link fromanother hos

t relative to the referencing page's score.
</description>
</property>

<property>
<name>db.score.link.internal</name>
<value>1.0</value>

<description>The score factor for pages added due to a link from thesame host,

relative to the referencing page's score.
</description>
</property>

<property>
<name>dropped.url.file</name>
<value>/home/xxx/xxxx/nutch/dropped_urls.out</value>

<description>Name of file containing dropped urls used by urlfilter-ip(IPURLFil

ter) plugin. (Keren added)</description>
</property>

<property>
<name>fetcher.server.delay</name>
<value>5.0</value>

<description>The number of seconds the fetcher will delay betweensuccessive req

uests to the same server.</description>
</property>

<property>
<name>fetcher.threads.fetch</name>
<value>20</value>

<description>The number of FetcherThreads the fetcher should use. Thisis also determines the maximum number of requests that are made at once (eachFetcherThread

handles one connection).</description>
</property>

<property>
<name>fetcher.threads.per.host</name>
<value>3</value>

<description>This number is the maximum number of threads that shouldbe allowed

to access a host at one time.</description>
</property>

<property>
<name>http.agent.email</name>
<value>xxxxxxxxx</value>

<description>An email address to advertise in the HTTP 'From' requestheader and

User-Agent header.</description>
</property>

<property>
<name>http.agent.url</name>
<value>xxxxxxxxxxxxxxxxxxx</value>

<description>A URL to advertise in the User-Agent header. This willappear in pa

renthesis after the agent name. </description>
</property>

<property>
<name>http.content.limit</name>
<value>65536</value>

<description>The length limit for downloaded content, in bytes. If thisvalue isnonnegative (>=0), content longer than it will be truncated; otherwise,no trunca

tion at all. </description>
</property>

<property>
<name>http.max.delays</name>
<value>3</value>

<description>The number of times a thread will delay when trying tofetch a page. Each time it finds that a host is busy, it will waitfetcher.server.delay. After http.max.delays attepts, it will give up on the page fornow.</description>

</property>

<property>
<name>http.redirect.max</name>
<value>0</value>

<description>The maximum number of redirects the fetcher will followwhen trying

to fetch a page.</description>
</property>

<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>

<description>When true scores for a page are multipled by the log ofthe number

of incoming links to the page.</description>
</property>

<property>
<name>indexer.boost.link.count.weight</name>
<value>100.0</value>

<description>Scores for a page are multipled by the log (the number ofincomingl

inks * this parameter) to the page. (Keren added)</description>
</property>

<property>
<name>indexer.score.power</name>
<value>0.5</value>

<description>Determines the power of link analyis scores. Each pages'sboost isset to scorescorePower where score is its linkanalysis

<value>0.5</value>

<description>Determines the power of link analyis scores. Each pages'sboost isset to scorescorePower where score is its linkanalysisscore and scorePower is the value of this parameter. This iscompiled intoindexes, so, when this is changed, pages must be re-indexed for it totake effect

.</description>
</property>

<property>
<name>plugin.includes</name>

df|msword)|index-basic|query-(basic|site|url)</value>

<description>Regular expression naming plugin directory names toinclude. Any plugin not matching this expression is excluded. In any case you need atleast include the nutch-extensionpoints plugin. By default Nutch includes crawlingjust HTMLand plain text via HTTP, and basic indexing and search plugins.</description>

</property>

<property>
<name>urlfilter.ip.file</name>
<value>ip-urlfilter.txt</value>

<description>Name of file on CLASSPATH containing regular expressionsused by ur

lfilter-ip (IPURLFilter) plugin. (Keren added)</description>
</property>
</nutch-conf>

nutch-xml.conf

Reply via email to