Thanks for the info Sebastian. Re: Why do you want to merge the data structures?
To help inform my crawl strategy I am trying to see what is possible and it feels like having the ability to run concurrent crawls might get around any limitations in the software. I am currently seeding a set of domains to act as a foundation for my crawling and I am performing more targeted crawls (by domain). As I discover more domains I want to crawl, I want to see if I can kick off a new crawler while another one is in progress and then merge the 2 later on. I expect that once I have a solid foundation that I will probably only have a single crawler running on a single DB. On Thu, Feb 2, 2023 at 4:09 AM Sebastian Nagel <wastl.na...@googlemail.com.invalid> wrote: > Hi Kamil, > > > I was wondering if this script is advisable to use? > > I haven't tried the script itself but some of the underlying commands > - mergedb, etc. > > > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) > > Of course, some of the commands are obsolete. Long time ago, Nutch > used Lucene index shards directly. Now the management of indexes > (including merging of shards) is delegated to Solr or Elasticsearch. > > > > I plan to use it for crawls of non-overlapping urls. > > ... just a few thoughts about this particular use case: > > Why you want to merge the data structures? > > - if they're disjoint there is no need for it > - all operations (CrawlDb: generate, update, etc.) > are much faster on smaller structures > > If required: most of the Nutch jobs can read multiple segments or CrawlDbs. > However, it might be that the command-line tool expects only a single > CrawlDb or segment. > - we could extend the command-line params > - or just copy the sequence files into one single path > > ~Sebastian > > On 2/2/23 01:54, Kamil Mroczek wrote: > > Hi, > > > > I am testing how merging crawls works and found this script > > https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl. > > > > I was wondering if this script is advisable to use? I plan to use it for > > crawls of non-overlapping urls. > > > > I am wary of using it since it is located under "Archive & Legacy" on the > > wiki. But after running some tests it seems to function correctly. I only > > had to remove the merge command ($nutch_dir/nutch merge $index_dir > > $new_indexes)since that is not a command anymore. > > > > I am not necessarily looking for a list of potential issues (if the list > is > > long), just trying to understand why it might be under the archive. > > > > Kamil > > >