Disclaimer:  I'm just an Xindice user, not a developer.

Perhaps it would be helpful if you looked in more detail at the reasons why
Xindice is better for large numbers of relatively small documents.

Essentially, the issue is in indexing for queries.  Indexes (at least when I
last looked at the source code) were on collections (not on the documents),
and mapped from values of elements or attributes to the subset of docs which
contained those elements or attributes.  If you issue a queries for certain
subtrees of the documents satisfying certain conditions, first the
appropriate subset is found using whatever indexes you provide, and then
XPath is used to extract the subtrees from the selected documents.  The
index, being some kind of Btree or hash, is reasonably efficient for large
collections.  XPath is reasonably efficient for small documents, unless it
contains expressions that cause scans of the entire document, e.g. //x (all
the elements named x, at whatever position in the document).  A query which
retrieves an entire document by id will be quite fast, regardless of how
large the collection is (probably logarithmic in collection size at worst).

I don't believe there is any inherent limit in the size of a collection
except that the internal compressed form of the collection and its indexes
must fit in a single file (so you'll need a big disk and appropriate file
system), at least in the current implementation (I think).

All that being said, it is likely that 400 million non-trivial docs is
larger than has been used in Xindice before.  If you were to attempt to use
it for this project, you would likely hit some problems and drive
improvements in the software.  If I were you thinking of embarking on such
an experiment, I'd want to have a good picture of the pace of development of
Xindice, the responsiveness of the developers to reported problems, since
you're likely to hit some showstoppers, and the plan for future development
(to see if its aims overlap largely with yours, or whether there is a
different focus).

Jeff

Reply via email to