Hi Darren,

On Tuesday, December 13, 2016 at 5:53:28 AM UTC+11, Darren Hardy wrote:
>
>
>    1. 
>    
>    What is the best configuration for large (>100GB) CDX files? We're 
>    currently using a single CDX file for our instance and each time we want 
> to 
>    add more content, we have to sort/merge the whole thing again. Is there 
>    another configuration that supports incremental indexing, like 
>    WatchedCDXSource?
>    
>
One strategy is the following. Keep several CDX files of increasing size. 
Perhaps:

Layer 0 - today's index
Layer 1 - this month's index
Layer 2 - this year's index
Layer 3 - the rest of the index

At the end of the day rewrite this month's index by merging today's index 
into it. At the end of the month rewrite this year's index by merging last 
month's. At the end of the year merge the year's index into the full index.

This amortizes the amount of updating that has to be done at the expense of 
needing to do 4 lookups for every query. To better copy with spikes of 
suddenly increased input you could base the layers on file size rather than 
time.

The data structure I've just described is similar to a log-structured merge 
(LSM) tree. There are a number of general purpose databases which implement 
this sort of structure. We (National Library of Australia) use RocksDB for 
this purpose and our CDX server is here: https://github.com/nla/outbackcdx

Another option that more recently occurred to me, that may be a better than 
OutbackCDX when indexes are too large for a single machine is Cassandra 
which has some similarities with RocksDB (including LSM and compression 
support) but is more focused on being a distributed database with sharding 
etc.

I've heard about people experimenting with storing their CDX index in Solr 
too which uses a somewhat similar merging process for its segment files.


>    1. 
>    
>    Does anyone have some rough performance characteristics for the CDX 
>    generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
>    
> It's both CPU and IO intensive for large gzipped collections. I don't have 
any figures but in my experience typically the CPU bottleneck is in the 
gzip code decompressing the WARC files. But if you throw say 32 cores at it 
(by doing multiple WARCs in parallel), you might find IO becomes the 
bottleneck, depending on what your storage setup is. For example we 
routinely saturate a 10gbit network link when reading WARCs for indexing.


>    1. 
>    
>    What are other institutions using for their filesystem storage of WARC 
>    files? And, how are you able to grow that over time? We are limited in our 
>    options since our NetApp storage is shared by many stakeholders here. So, 
>    we're looking at having to deal with multiple NFS mounts.
>    
>
Hardware-wise for bulk WARC storage we currently use an Isilon NAS which is 
used across the institution for all sorts of bulk storage purposes. This 
can conveniently expose petabytes of filesystem as a single NFS mount. Just 
about any kind of spinning disk storage should be workable though. I know 
others are using local storage accessing a large cluster of servers over 
HTTP or HDFS.

We have two filesystems: a 'working' filesystem for new data and active 
crawls, and a 'preservation' filesystem for permanent bulk storage. We 
store the CDX index on SSD directly installed in a server. We keep track of 
the current location of each WARC in an SQL database and Wayback requests 
access to the WARC files over HTTP rather than directly accessing the 
filesystem, allowing us to move things from filesystem to filesystem or to 
different types of storage without an interruption to service and without 
Wayback having to worry about the details about where anything is stored.

Hope that helps,

Alex

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to