As far as point number one is concerned, I would ask, why are you forcing yourself to a single CDX file? For quite some time OWB is supporting wildcard like syntax to load one or more CDX files for each collection/endpoint. It is certainly helpful to have less number bigger CDX files than a lot of small CDX files. However, when file system or other limitations arrive, there is no harm inb keeping more than one relatively bigger CDX files in a directory and load them all for lookup.
Incremental merging is fairly fast and efficient [linear O(N+M)] operation if the incremental file is also sorted before merging and -m flag is passed to the sort command to tell that the input files are already sorted. I am not too sure about the ZipNumCluster, but I have some vague idea that it can be used in case where CDX files grow beyond some limits. Best, -- Sawood Alam Department of Computer Science Old Dominion University Norfolk VA 23529 On Mon, Dec 12, 2016 at 1:53 PM, Darren Hardy <[email protected]> wrote: > We have a ~20TB (and growing) installation of cdx-server here at Stanford > Library. We're running into some scaling problems that we'd like some > feedback on. > > 1. > > What is the best configuration for large (>100GB) CDX files? We're > currently using a single CDX file for our instance and each time we want to > add more content, we have to sort/merge the whole thing again. Is there > another configuration that supports incremental indexing, like > WatchedCDXSource? > 2. > > Does anyone have some rough performance characteristics for the CDX > generation code (bin/cdx-indexer)? Is it CPU or IO intensive? > 3. > > What are other institutions using for their filesystem storage of WARC > files? And, how are you able to grow that over time? We are limited in our > options since our NetApp storage is shared by many stakeholders here. So, > we're looking at having to deal with multiple NFS mounts. > > -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
