Re: Merging CrawlDBs

Kamil Mroczek Thu, 02 Feb 2023 06:17:19 -0800

Thanks for the info Sebastian.

Re: Why do you want to merge the data structures?


To help inform my crawl strategy I am trying to see what is possible and
it feels like having the ability to run concurrent crawls might get around
any limitations in the software. I am currently seeding a set of domains to
act as a foundation for my crawling and I am performing more targeted
crawls (by domain). As I discover more domains I want to crawl, I want to
see if I can kick off a new crawler while another one is in progress and
then merge the 2 later on. I expect that once I have a solid foundation
that I will probably only have a single crawler running on a single DB.



On Thu, Feb 2, 2023 at 4:09 AM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

> Hi Kamil,
>
>  > I was wondering if this script is advisable to use?
>
> I haven't tried the script itself but some of the underlying commands
> - mergedb, etc.
>
>  > merge command ($nutch_dir/nutch merge $index_dir $new_indexes)
>
> Of course, some of the commands are obsolete. Long time ago, Nutch
> used Lucene index shards directly. Now the management of indexes
> (including merging of shards) is delegated to Solr or Elasticsearch.
>
>
>  > I plan to use it for crawls of non-overlapping urls.
>
> ... just a few thoughts about this particular use case:
>
> Why you want to merge the data structures?
>
> - if they're disjoint there is no need for it
> - all operations (CrawlDb: generate, update, etc.)
>    are much faster on smaller structures
>
> If required: most of the Nutch jobs can read multiple segments or CrawlDbs.
> However, it might be that the command-line tool expects only a single
> CrawlDb or segment.
> - we could extend the command-line params
> - or just copy the sequence files into one single path
>
> ~Sebastian
>
> On 2/2/23 01:54, Kamil Mroczek wrote:
> > Hi,
> >
> > I am testing how merging crawls works and found this script
> > https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.
> >
> > I was wondering if this script is advisable to use? I plan to use it for
> > crawls of non-overlapping urls.
> >
> > I am wary of using it since it is located under "Archive & Legacy" on the
> > wiki. But after running some tests it seems to function correctly. I only
> > had to remove the merge command ($nutch_dir/nutch merge $index_dir
> > $new_indexes)since that is not a command anymore.
> >
> > I am not necessarily looking for a list of potential issues (if the list
> is
> > long), just trying to understand why it might be under the archive.
> >
> > Kamil
> >
>

Re: Merging CrawlDBs

Reply via email to