On Sunday 13 October 2002 04:18, you wrote: > What is the cleanest way in Lucene to add documents to > an index, if the entire document is not readily > available at one time? > > E.g., I want to index the text as well as the > anchor-text of a stream of html pages, where the > anchor-text terms get associated with the page _being > pointed to_. For a document d_i, I don't know all the > terms that should be added to its "anchor" field, > until I've seen all documents d_j that link to d_i.
Mr. Codd would normalize the anchor texts as an attribute of the link from the pointing page to the pointed page. So a clean way is to store these links in a separate (lucene) db, putting the anchor text in a stored field so it can be retrieved when needed. Since lucene is a free format database, you can store these links in any lucene db. It depends on the circumstances (ie. what operation is most time critical) what the best place is. This also depends on how you want to search your anchor fields: should they be in the same lucene document as the pointed to page, or could you just allow searching for anchors in a separate 'links' db? Have fun, Ype -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>