Thanks for the feedback. I like the layered index approach.

So, you're using an SQL database rather than the flat file `path-index.txt` 
file format for mapping to WARC file locations? What's the scale in 
number/size of WARC files that you're dealing with? We have ~20k files 
totalling ~20TB right now.

I'm a little confused about the server ecosystem. We're just running the 
OpenWayback `cdx-server.war`.  Are people generally running this 
cdx-server.war in production settings? Or for scaling purposes, should we 
consider switching to another implementation that uses a database for the 
index?

We run our service from VMs and we haven't added SSD storage to our NetApp 
yet (it's relatively expensive). So we're stuck with relatively slow NFS 
mounts in the short term.

Thanks,
-Darren

On Monday, December 12, 2016 at 4:01:39 PM UTC-8, Alex Osborne wrote:
>
> Hi Darren,
>
> On Tuesday, December 13, 2016 at 5:53:28 AM UTC+11, Darren Hardy wrote:
>>
>>
>>    1. 
>>    
>>    What is the best configuration for large (>100GB) CDX files? We're 
>>    currently using a single CDX file for our instance and each time we want 
>> to 
>>    add more content, we have to sort/merge the whole thing again. Is there 
>>    another configuration that supports incremental indexing, like 
>>    WatchedCDXSource?
>>    
>>
> One strategy is the following. Keep several CDX files of increasing size. 
> Perhaps:
>
> Layer 0 - today's index
> Layer 1 - this month's index
> Layer 2 - this year's index
> Layer 3 - the rest of the index
>
> At the end of the day rewrite this month's index by merging today's index 
> into it. At the end of the month rewrite this year's index by merging last 
> month's. At the end of the year merge the year's index into the full index.
>
> This amortizes the amount of updating that has to be done at the expense 
> of needing to do 4 lookups for every query. To better copy with spikes of 
> suddenly increased input you could base the layers on file size rather than 
> time.
>
> The data structure I've just described is similar to a log-structured 
> merge (LSM) tree. There are a number of general purpose databases which 
> implement this sort of structure. We (National Library of Australia) use 
> RocksDB for this purpose and our CDX server is here: 
> https://github.com/nla/outbackcdx
>
> Another option that more recently occurred to me, that may be a better 
> than OutbackCDX when indexes are too large for a single machine is 
> Cassandra which has some similarities with RocksDB (including LSM and 
> compression support) but is more focused on being a distributed database 
> with sharding etc.
>
> I've heard about people experimenting with storing their CDX index in Solr 
> too which uses a somewhat similar merging process for its segment files.
>
>
>>    1. 
>>    
>>    Does anyone have some rough performance characteristics for the CDX 
>>    generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
>>    
>> It's both CPU and IO intensive for large gzipped collections. I don't 
> have any figures but in my experience typically the CPU bottleneck is in 
> the gzip code decompressing the WARC files. But if you throw say 32 cores 
> at it (by doing multiple WARCs in parallel), you might find IO becomes the 
> bottleneck, depending on what your storage setup is. For example we 
> routinely saturate a 10gbit network link when reading WARCs for indexing.
>
>
>>    1. 
>>    
>>    What are other institutions using for their filesystem storage of 
>>    WARC files? And, how are you able to grow that over time? We are limited 
>> in 
>>    our options since our NetApp storage is shared by many stakeholders here. 
>>    So, we're looking at having to deal with multiple NFS mounts.
>>    
>>
> Hardware-wise for bulk WARC storage we currently use an Isilon NAS which 
> is used across the institution for all sorts of bulk storage purposes. This 
> can conveniently expose petabytes of filesystem as a single NFS mount. Just 
> about any kind of spinning disk storage should be workable though. I know 
> others are using local storage accessing a large cluster of servers over 
> HTTP or HDFS.
>
> We have two filesystems: a 'working' filesystem for new data and active 
> crawls, and a 'preservation' filesystem for permanent bulk storage. We 
> store the CDX index on SSD directly installed in a server. We keep track of 
> the current location of each WARC in an SQL database and Wayback requests 
> access to the WARC files over HTTP rather than directly accessing the 
> filesystem, allowing us to move things from filesystem to filesystem or to 
> different types of storage without an interruption to service and without 
> Wayback having to worry about the details about where anything is stored.
>
> Hope that helps,
>
> Alex
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to