Hi Darren, On Tuesday, December 13, 2016 at 5:53:28 AM UTC+11, Darren Hardy wrote: > > > 1. > > What is the best configuration for large (>100GB) CDX files? We're > currently using a single CDX file for our instance and each time we want > to > add more content, we have to sort/merge the whole thing again. Is there > another configuration that supports incremental indexing, like > WatchedCDXSource? > > One strategy is the following. Keep several CDX files of increasing size. Perhaps:
Layer 0 - today's index Layer 1 - this month's index Layer 2 - this year's index Layer 3 - the rest of the index At the end of the day rewrite this month's index by merging today's index into it. At the end of the month rewrite this year's index by merging last month's. At the end of the year merge the year's index into the full index. This amortizes the amount of updating that has to be done at the expense of needing to do 4 lookups for every query. To better copy with spikes of suddenly increased input you could base the layers on file size rather than time. The data structure I've just described is similar to a log-structured merge (LSM) tree. There are a number of general purpose databases which implement this sort of structure. We (National Library of Australia) use RocksDB for this purpose and our CDX server is here: https://github.com/nla/outbackcdx Another option that more recently occurred to me, that may be a better than OutbackCDX when indexes are too large for a single machine is Cassandra which has some similarities with RocksDB (including LSM and compression support) but is more focused on being a distributed database with sharding etc. I've heard about people experimenting with storing their CDX index in Solr too which uses a somewhat similar merging process for its segment files. > 1. > > Does anyone have some rough performance characteristics for the CDX > generation code (bin/cdx-indexer)? Is it CPU or IO intensive? > > It's both CPU and IO intensive for large gzipped collections. I don't have any figures but in my experience typically the CPU bottleneck is in the gzip code decompressing the WARC files. But if you throw say 32 cores at it (by doing multiple WARCs in parallel), you might find IO becomes the bottleneck, depending on what your storage setup is. For example we routinely saturate a 10gbit network link when reading WARCs for indexing. > 1. > > What are other institutions using for their filesystem storage of WARC > files? And, how are you able to grow that over time? We are limited in our > options since our NetApp storage is shared by many stakeholders here. So, > we're looking at having to deal with multiple NFS mounts. > > Hardware-wise for bulk WARC storage we currently use an Isilon NAS which is used across the institution for all sorts of bulk storage purposes. This can conveniently expose petabytes of filesystem as a single NFS mount. Just about any kind of spinning disk storage should be workable though. I know others are using local storage accessing a large cluster of servers over HTTP or HDFS. We have two filesystems: a 'working' filesystem for new data and active crawls, and a 'preservation' filesystem for permanent bulk storage. We store the CDX index on SSD directly installed in a server. We keep track of the current location of each WARC in an SQL database and Wayback requests access to the WARC files over HTTP rather than directly accessing the filesystem, allowing us to move things from filesystem to filesystem or to different types of storage without an interruption to service and without Wayback having to worry about the details about where anything is stored. Hope that helps, Alex -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
