Thanks ~ Yes it seems this would be quite difficult to achieve with Lucene. Nevermind, I'll try to figure out a workaround for it.
Thanks for helping =) Cedric On Feb 16, 2008 5:30 AM, Paul Elschot <[EMAIL PROTECTED]> wrote: > Hi Cedric, > > I think I'm beginning to get the point of the [10/5/2], > and why you called that requirement a bit strange, see below. > > To use both normal position info and paragraph position info > you'll need two separate, one normal, and one for the paragraphs. > > To use the normal field to determine the matches, and the > paragraph field to determine the weightings of these matches > the TermPositions of both fields will have to be advanced > completely in sync. That is possible, but not really nice to do. > If Lucene had multiple positions for an indexed term, it > would be straightforward. > But as long as that is not the case, you'll either have to advance > the two TermPositions in sync, or use payloads with the > paragraph numbers. > > Or you could relax the paragraph numbering requirement > into a positional requirement, and use the modified SpanFirstQuery. > That could be done by using an avarage paragraph length to > determine the weight at the matching position. > As this is easy to implement, I'd first implement this and try to sell > it to the users :) > > At that marketing moment you might as well ask the users > what they think of matches that cross paragraph borders. > Do you already have a firm requirement for that case? > > SpanNotQuery can be used to prevent matches over paragraph > borders when these are indexed as such, but I would not expect > that you would need those, given the fuzzyness of the [10/5/2]. > > Regards, > Paul Elschot > > > Op Friday 15 February 2008 09:45:58 schreef Cedric Ho: > > > Hi Paul, > > > > Do you mean the following? > > > > e.g. to index this: "first second third <paragraphBorder> forth fifth six" > > > > originally it would be indexed as: > > (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5) > > > > now it will be: > > (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1) > > > > Then those Query classes that depends on the positional information > > (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need > > those Query classes as well. > > > > Cedric > > > > > > > For each word in the input stream make sure that the position > > > at which it is indexed in an extra field is the same as the paragraph > > > number. That will involve only allowing a position increment at > > > a paragraph border during indexing. > > > Call this extra field the paragraph field if you will. > > > > > > Then, during search, search for a Term in paragraph field, and > > > use the position from that field, i.e. the paragraph number > > > to find a weight for the found term. > > > Have a look at PhraseQuery on how to use term positions during > > > search. It computes relative positions, but it works on the absolute > > > positions that it gets from the index. > > > > > > SpanFirstQuery also allows to do that, it's a bit more involved, but > > > in the end it works from the same absolute positions from the index. > > > The version at the jira issue will even allow to use the length of the > > > matching spans as the absolute paragraph number, which, in turn, > > > allows the use of a Similarity for the paragraph weights [10/5/2]. > > > > > > There is nothing special about indexed term positions; any term can > > > be indexed at any position in a field. Lucene will take advantage of > > > the incremental nature of positions by storing only compressed > > > differences of positions in the index, but during search the original > > > positions are directly available, You can do the same with payloads, > > > but why reimplement something that is already available? > > > > > > Payloads have better uses than positional info, for one they are > > > great to avoid disjunctions. For example for verbs, one could > > > index only the stem and use a payload for the actual inflected > > > form (singular/plural, past/present, first/second/third person, etc). > > > > > > Regards, > > > Paul Elschot > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]