As far as point number one is concerned, I would ask, why are you forcing
yourself to a single CDX file? For quite some time OWB is supporting
wildcard like syntax to load one or more CDX files for each
collection/endpoint. It is certainly helpful to have less number bigger CDX
files than a lot of small CDX files. However, when file system or other
limitations arrive, there is no harm inb keeping more than one relatively
bigger CDX files in a directory and load them all for lookup.

Incremental merging is fairly fast and efficient [linear O(N+M)] operation
if the incremental file is also sorted before merging and -m flag is passed
to the sort command to tell that the input files are already sorted.

I am not too sure about the ZipNumCluster, but I have some vague idea that
it can be used in case where CDX files grow beyond some limits.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529


On Mon, Dec 12, 2016 at 1:53 PM, Darren Hardy <[email protected]>
wrote:

> We have a ~20TB (and growing) installation of cdx-server here at Stanford
> Library. We're running into some scaling problems that we'd like some
> feedback on.
>
>    1.
>
>    What is the best configuration for large (>100GB) CDX files? We're
>    currently using a single CDX file for our instance and each time we want to
>    add more content, we have to sort/merge the whole thing again. Is there
>    another configuration that supports incremental indexing, like
>    WatchedCDXSource?
>    2.
>
>    Does anyone have some rough performance characteristics for the CDX
>    generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
>    3.
>
>    What are other institutions using for their filesystem storage of WARC
>    files? And, how are you able to grow that over time? We are limited in our
>    options since our NetApp storage is shared by many stakeholders here. So,
>    we're looking at having to deal with multiple NFS mounts.
>
> --
> You received this message because you are subscribed to the Google Groups
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to