"Ogren, Philip V." wrote: > We are indexing a large corpus of XML documents (~10M). One thing that > Verity does with XML notes is that it indexes each XML tag as a zone.* > What's cool about it is that the zones are nested so that it mirrors the > schema of your XML document. You can limit your search to any part of the > document by searching on specific zones. A Verity zone is analogous to a > Lucene field. Verity also has 'field' indexes - but these are a different > kind of index that Lucene does not have. Verity fields allow you to index > various numeric types, date types etc. side-by-side with your textual index. > > The edge that Verity zones have over Lucene fields is that they are nested. > However, nested fields can be simulated quite easily in Lucene by doing > redundant indexing. I have a hunch this is what Verity does anyways because > their indexes are HUGE.
The XML indexing scheme we developed for Lucene here at ISOGEN (and posted about late last year) provides more complete XML indexing than Verity can provide because it is not limited by some of the constraints inherent in Verity's zone mechanism. Our indexing approach is also infinitely more flexible than Verity's (or any of other commercial systems) because relatively simple Java code can be used to extend the default indexing to optimize for specific DTDs or types of queries. Also, Verity is, as far as I know, unable to index elements are attributes that have "." (period) in their names because their indexers always treat "." as a word separator. Doh. Of the commercial full-text indexers that do XML indexing, my analysis is that Verity does the best job, but it is still, in my opinion, not sufficiently complete or flexible to be useful in production. Otherwise, Verity is a full-text fine indexing system. Cheers, Eliot Kimber ISOGEN International, LLC -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>