Re: Merging CrawlDBs

Sebastian Nagel Thu, 02 Feb 2023 01:09:32 -0800

Hi Kamil,

> I was wondering if this script is advisable to use?

I haven't tried the script itself but some of the underlying commands
- mergedb, etc.

> merge command ($nutch_dir/nutch merge $index_dir $new_indexes)

Of course, some of the commands are obsolete. Long time ago, Nutch
used Lucene index shards directly. Now the management of indexes
(including merging of shards) is delegated to Solr or Elasticsearch.

> I plan to use it for crawls of non-overlapping urls.

... just a few thoughts about this particular use case:

Why you want to merge the data structures?

- if they're disjoint there is no need for it
- all operations (CrawlDb: generate, update, etc.)
  are much faster on smaller structures

If required: most of the Nutch jobs can read multiple segments or CrawlDbs.
However, it might be that the command-line tool expects only a single
CrawlDb or segment.
- we could extend the command-line params
- or just copy the sequence files into one single path

~Sebastian

On 2/2/23 01:54, Kamil Mroczek wrote:

Hi,

I am testing how merging crawls works and found this script
https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.

I was wondering if this script is advisable to use? I plan to use it for
crawls of non-overlapping urls.

I am wary of using it since it is located under "Archive & Legacy" on the
wiki. But after running some tests it seems to function correctly. I only
had to remove the merge command ($nutch_dir/nutch merge $index_dir
$new_indexes)since that is not a command anymore.

I am not necessarily looking for a list of potential issues (if the list is
long), just trying to understand why it might be under the archive.

Kamil

Re: Merging CrawlDBs

Reply via email to