Based on my understanding, records are looked up in each CDX file using
binary search. However, if you have a a lot of CDX files to perform the
lookup from then the file list would be iterated over linearly and then a
binary search would be performed in each. This is why fewer lager CDX files
are preferred over many small ones, but being religious about one gigantic
CDX file per collection would be an over kill in my opinion.

Another reason why it is not recommended to read from a lot of small CDX
files is that the server needs to load the references of all those CDX
files in the memory, which might not be an issue initially, but if the
number of CDX files is really large then a fair amount of memory would be
needed to keep just the file references.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529


On Mon, Dec 12, 2016 at 4:13 PM, Darren Hardy <[email protected]>
wrote:

> So, you recommend we use the WatchedCDXSource for a collection of CDX
> files, rather than a single CDX file. Is there a practical limit to the
> number of CDX files the server can handle? We have dozens of collections.
> Also, we're concerned about the scalability of how the server reads these
> CDX files -- if we have a hundred, is that too much? My understanding is
> that the server does a binary search of the CDX file to locate the
> information it needs -- based on looking at the FlatFile class.
>
> Thanks,
> -Darren
>
>
>
> On Monday, December 12, 2016 at 11:44:12 AM UTC-8, Sawood Alam wrote:
>>
>> As far as point number one is concerned, I would ask, why are you forcing
>> yourself to a single CDX file? For quite some time OWB is supporting
>> wildcard like syntax to load one or more CDX files for each
>> collection/endpoint. It is certainly helpful to have less number bigger CDX
>> files than a lot of small CDX files. However, when file system or other
>> limitations arrive, there is no harm inb keeping more than one relatively
>> bigger CDX files in a directory and load them all for lookup.
>>
>> Incremental merging is fairly fast and efficient [linear O(N+M)]
>> operation if the incremental file is also sorted before merging and -m flag
>> is passed to the sort command to tell that the input files are already
>> sorted.
>>
>> I am not too sure about the ZipNumCluster, but I have some vague idea
>> that it can be used in case where CDX files grow beyond some limits.
>>
>> Best,
>>
>> --
>> Sawood Alam
>> Department of Computer Science
>> Old Dominion University
>> Norfolk VA 23529
>>
>>
>> On Mon, Dec 12, 2016 at 1:53 PM, Darren Hardy <[email protected]>
>> wrote:
>>
>>> We have a ~20TB (and growing) installation of cdx-server here at
>>> Stanford Library. We're running into some scaling problems that we'd like
>>> some feedback on.
>>>
>>>    1.
>>>
>>>    What is the best configuration for large (>100GB) CDX files? We're
>>>    currently using a single CDX file for our instance and each time we want 
>>> to
>>>    add more content, we have to sort/merge the whole thing again. Is there
>>>    another configuration that supports incremental indexing, like
>>>    WatchedCDXSource?
>>>    2.
>>>
>>>    Does anyone have some rough performance characteristics for the CDX
>>>    generation code (bin/cdx-indexer)? Is it CPU or IO intensive?
>>>    3.
>>>
>>>    What are other institutions using for their filesystem storage of
>>>    WARC files? And, how are you able to grow that over time? We are limited 
>>> in
>>>    our options since our NetApp storage is shared by many stakeholders here.
>>>    So, we're looking at having to deal with multiple NFS mounts.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "openwayback-dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to