nutch mergelinkdb" by LukasVlcek

Apache Wiki Wed, 10 May 2006 10:12:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by LukasVlcek:
http://wiki.apache.org/nutch/nutch-0%2e8-dev/bin/nutch_mergelinkdb

New page:
= "mergelinkdb" is an alias for "org.apache.nutch.crawl.LinkDbMerger" =

== Merges several LinkDb(s) together. URLFilters can be optionaly used to 
filter out specific content. ==

This tool can be useful if you built partial !LinkDb(s) from groups of
segments, and then you need to integrate them into one (e.g. for
indexing or for searching). Or you can use it with a single !LinkDb, just
to filter out unwanted URLs and links.

It's possible to use this tool just for filtering - in that case
only one !LinkDb should be specified in arguments.

If more than one !LinkDb contains information about the same URL,
all inlinks are accumulated, but only at most ''db.max.inlinks''
inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and
to any incoming link URL. If a target URL is prohibited, all
inlinks to that target will be removed, including the target URL. If
some of incoming links are prohibited, only they will be removed, and they
won't count when checking the above-mentioned maximum limit.

=== Usage ===
 nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.!LinkDbMerger output_linkdb 
linkdb1 [linkdb2 linkdb3 ...] [-filter]

  '''output_linkdb:''' Output !LinkDb.[[BR]]
  '''linkdb1 [linkdb2 linkdb3 ...]:''' One or many input !LinkDb(s).[[BR]]
  '''-filter:''' Actual URLFilters to be applied on urls and links in 
!LinkDb(s).[[BR]]

=== Configuration Files ===
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]

=== Other Files ===
 None.

=== Caveats and Notes ===
 index.done file is not created.

DevelopmentCommandLineOptions

[Nutch Wiki] Update of "nutch-0.8-dev/bin/nutch mergelinkdb" by LukasVlcek

Reply via email to