So, you recommend we use the WatchedCDXSource for a collection of CDX files, rather than a single CDX file. Is there a practical limit to the number of CDX files the server can handle? We have dozens of collections. Also, we're concerned about the scalability of how the server reads these CDX files -- if we have a hundred, is that too much? My understanding is that the server does a binary search of the CDX file to locate the information it needs -- based on looking at the FlatFile class.
Thanks, -Darren On Monday, December 12, 2016 at 11:44:12 AM UTC-8, Sawood Alam wrote: > > As far as point number one is concerned, I would ask, why are you forcing > yourself to a single CDX file? For quite some time OWB is supporting > wildcard like syntax to load one or more CDX files for each > collection/endpoint. It is certainly helpful to have less number bigger CDX > files than a lot of small CDX files. However, when file system or other > limitations arrive, there is no harm inb keeping more than one relatively > bigger CDX files in a directory and load them all for lookup. > > Incremental merging is fairly fast and efficient [linear O(N+M)] operation > if the incremental file is also sorted before merging and -m flag is passed > to the sort command to tell that the input files are already sorted. > > I am not too sure about the ZipNumCluster, but I have some vague idea that > it can be used in case where CDX files grow beyond some limits. > > Best, > > -- > Sawood Alam > Department of Computer Science > Old Dominion University > Norfolk VA 23529 > > > On Mon, Dec 12, 2016 at 1:53 PM, Darren Hardy <[email protected] > <javascript:>> wrote: > >> We have a ~20TB (and growing) installation of cdx-server here at Stanford >> Library. We're running into some scaling problems that we'd like some >> feedback on. >> >> 1. >> >> What is the best configuration for large (>100GB) CDX files? We're >> currently using a single CDX file for our instance and each time we want >> to >> add more content, we have to sort/merge the whole thing again. Is there >> another configuration that supports incremental indexing, like >> WatchedCDXSource? >> 2. >> >> Does anyone have some rough performance characteristics for the CDX >> generation code (bin/cdx-indexer)? Is it CPU or IO intensive? >> 3. >> >> What are other institutions using for their filesystem storage of >> WARC files? And, how are you able to grow that over time? We are limited >> in >> our options since our NetApp storage is shared by many stakeholders here. >> So, we're looking at having to deal with multiple NFS mounts. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "openwayback-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
