Doug wrote:

> I'm having trouble getting a clear picture of your indexing scheme.

     I've been doing a lot of thinking about this same problem, so I
may be a little more in tune with what Elliot's saying.  By the way,
Elliot, I'm very interested in your results.  I considered the basic
approach you're using, but I thought it was a bit extreme in terms of
having zillions of tiny lucene Documents.  I'm working on a quick
kludge that may serve my immediate purposes (if it does, I'm planning
to post the deatils here).
 
> Could you provide some simple examples, e.g., for the xml:

>   <tag1>this is some text
>     <tag2>and some other text</tag2>
>   </tag1>
> would you have something like the following?
>   doc1
>     node_type: tag1
>     contents: this is some text
>   doc2
>     node_type: tag2
>     contents: and some other text
>   doc3
>     node_type: all_contents
>     contents: this is some text and some other text

     I think that's exactly what Elliot is intending.
 

> My first instinct would be to have something like:
>   doc1
>     tag1: this is some text
>     tag2: and some other text
>     all-tags: this is some text and some other text
> What do you need that that does not achieve?

     Name collision - you can have multiple Elements at different
levels, and you may have attributes and tags having the same name.
Obviously one way around this is "Don't do that", but that could get
really tiresome, quickly.

     If you just conflate the elements and attributes under the same
name (i.e. field "blah" contains a concatenated set of values from all
occurrences of both elements and attributes) then your searches become
much more limited in what you can specify.  This is, by the way, the
approach I'm trying out, with a second stage to refine the results and
drop out false positives.  But I'll have to wait on saying any more
about that.

     All of this, of course, is in the context of having arbitrary XML
documents.  If you have predefined XML schemas then you can hand-code
the mappings from elements to lucene document fields.  But then you
trade a heck of a lot of flexibility for a lot of maintenance.

Steven J. Owens
[EMAIL PROTECTED]


Reply via email to