Doug Cutting wrote:
>>From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]
>>
>>Doug, thanks for posting these. I may end up going in this
>>direction in
>>the next few days and will use this as a blueprint. Maybe I'll end up
>>putting in the first pass implementation and then you can
>>later further
>>tune it when you get to it.
>>
>
>Great! One implementation tip: when merging terms from segments, build an
>array of ints for each segment, indexed by term number. These map from old
>segment term numbers to new term numbers in the merged index. Then merging
>vectors is really easy: just re-number them using the array for their
>segment. Vectors can be merged in a single pass through the vector file for
>each segment, writing the new vector file in a single pass.
>
Ok, got it.
>
>>Question on term numbers through: what would be an approach
>>for merging
>>these across multiple IndexReaders for the purposes of MultiSearcher?
>>
>
>As you imply, it is possible to seek a SegmentTermEnum to a term number, but
>not a SegmentsTermEnum.
>
Did I imply that? :)
I was just thinking about numbering, but the tip above suggests that the
terms will be fully renumbered when looking at them from the
MultiSearcher. I think that is ok. Documents are assigned ranges
instead, and we could do this for Terms since the term numbers probably
do not need to be ordered the same way as the terms.
>This could be fixed in a number of ways. The
>simplest and fastest would be to declare that term numbers are unavaliable
>for unoptimized indexes and throw an exception. A slower, kinder approach
>would be to, the first time this method is called, iterate through all of
>the terms. One could either save all of the terms in an array, which would
>be fastest, but use a lot of memory, or one could save every, say, 128th
>term in an array. Then, to find the nth term, do a binary search of this
>array for the term before it. Then you can seek all of the sub-enums to
>that term and then merge them up to the desired term, counting as you go.
>That's probably the best compromise: it's probably fast enough, and it
>doesn't use too much memory.
>
>Note that, for good performance, clustering algorithms etc. should operate
>only on document and term numbers. These integers should only be mapped to
>Term and Document objects when they are displayed to the user. Thus the
>performance requirements for that mapping are not extreme. Lucene uses a
>similar strategy to keep search fast: internally documents are referred to
>by number: only when a Hit is displayed is it converted to a Document
>object.
>
Yes, I see that. One additional problem that I need to solve for my
application is that I need to map from stemmed forms of the terms to at
least one un-stemmed form. Ideally it would be all un-stemmed forms, but
I can live with the first one. I realize that Lucene does not ealisy
support this because of the separation of church and state (I mean the
term filtering prior to indexing and querying), but I still need this
functionality... So, the question is, is this going to be common enough
to add a concept of a TermDictionary to Lucene and provide methods to
access it on the IndexReader and IndexWriter? If not, I could implement
this externally, but then I would not be able to use the IO framework
and whole concept of directories. Also, since the Term numbers are going
to be euphemeral just like doc numbers, externally I would have to refer
to them by text, slowing dow the translation process, etc., etc., etc..
It's not yet clear enough in my mind to put an API together. Maybe the
way to do this is to create and Analyzer that outputs a subclass of Term
that has additional data, namely: String original_text, and int data.
The data int is to keep application-specific flags such as term
classification. Then the indexing code can be extended to support these
extra fields and maintain the TermDictionary with them. The first entry
for a given term wins in terms of the original_text and the data int.
Any ideas to make this less of a hack?
Dmitry.
>
>
>Doug
>