We have a ~20TB (and growing) installation of cdx-server here at Stanford 
Library. We're running into some scaling problems that we'd like some 
feedback on.

   1. 
   
   What is the best configuration for large (>100GB) CDX files? We're 
   currently using a single CDX file for our instance and each time we want to 
   add more content, we have to sort/merge the whole thing again. Is there 
   another configuration that supports incremental indexing, like 
   WatchedCDXSource?
   2. 
   
   Does anyone have some rough performance characteristics for the CDX 
   generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
   3. 
   
   What are other institutions using for their filesystem storage of WARC 
   files? And, how are you able to grow that over time? We are limited in our 
   options since our NetApp storage is shared by many stakeholders here. So, 
   we're looking at having to deal with multiple NFS mounts.
   

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to