"Ogren, Philip V." wrote:
> We are indexing a large corpus of XML documents (~10M).  One thing that
> Verity does with XML notes is that it indexes each XML tag as a zone.*
> What's cool about it is that the zones are nested so that it mirrors the
> schema of your XML document.  You can limit your search to any part of the
> document by searching on specific zones.  A Verity zone is analogous to a
> Lucene field.  Verity also has 'field' indexes - but these are a different
> kind of index that Lucene does not have.  Verity fields allow you to index
> various numeric types, date types etc. side-by-side with your textual index.
> 
> The edge that Verity zones have over Lucene fields is that they are nested.
> However, nested fields can be simulated quite easily in Lucene by doing
> redundant indexing.  I have a hunch this is what Verity does anyways because
> their indexes are HUGE.

The XML indexing scheme we developed for Lucene here at ISOGEN (and
posted about late last year) provides more complete XML indexing than
Verity can provide because it is not limited by some of the constraints
inherent in Verity's zone mechanism. Our indexing approach is also
infinitely more flexible than Verity's (or any of other commercial
systems) because relatively simple Java code can be used to extend the
default indexing to optimize for specific DTDs or types of queries.

Also, Verity is, as far as I know, unable to index elements are
attributes that have "." (period) in their names because their indexers
always treat "." as a word separator. Doh.

Of the commercial full-text indexers that do XML indexing, my analysis
is that Verity does the best job, but it is still, in my opinion, not
sufficiently complete or flexible to be useful in production. Otherwise,
Verity is a full-text fine indexing system.

Cheers,

Eliot Kimber
ISOGEN International, LLC

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to