I'm planning to use Lucene to index scads of XML files whose data model
includes replicated blocks of tags.  Translation: a novice question follows.

My files have a common XML pattern (for illustrative purposes):

<blocks>
   <block id="123">some text 1</block>
   <block id="456">some text 2</block>
   <block id="789">some text 3</block>
</blocks>

Each block has a unique id, but the tagname is identical.  The actual data
model has nested tags within these blocks - ie: metadata with the same
tagnames within each block.  So, in the real data model, there are multiple
identical tagnames that are associated with a specific parent.  Something
more like this:

<blocks>
   <block id="123">
      <author>Joe Blow</author>
      <job>hack</job>
   </block>
   <block id="456">
      <author>Jane Doe</author>
      <job>President</job>
   </block>
</blocks>

In latter case, I need to be able to search by author or job, for example,
and get the tag's text contents as well as the parent block id.

Adding a field name of "block" or "author" or "job" multiple times to the
same Lucene Document, according to the Lucene javadoc, has the effect of
appending the text for search purposes.  I take that to mean, in order to
use a 'hit' I would need to somehow uniquely identify the field from which
the content came even though the content was appended for search purposes.

If I searched an 'author' field name and got a hit, I would not be able to
disambiguate which block id the actual hit belonged to.  Or if I searched on
"job", how would I know a hit belonged to block id 456 instead of block id
123 parent?

What is the Lucene approach for indexing a single document that has the same
field name appearing in multiple places and then using the hit to find the
exact association of block id in the above example?

Hope this question makes sense.  I'm sure I'm missing something
obvious/simple in how the API would work in this case.  Thanks,

Landon Cox


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to