See below... On Tue, Feb 12, 2008 at 10:08 PM, André Warnier <[EMAIL PROTECTED]> wrote:
> > > Doron Cohen wrote: > > On Thu, Feb 7, 2008 at 6:03 PM, André Warnier <[EMAIL PROTECTED]> wrote: > > > >> ... > >> Does anyone have an example of how this works ? > >> (or an explanation in plain French-speaker-friendly tutorial-like > English > >> ?) > >> > > > > Do you mean "how to make it work for you" or "how does it work inside"? > > The first option is easier to explain (though I know no French :)) > > When you create an IndexWritier you provide it an Analyzer. > > That analyzer is used when a document is added to the index. > > The analyzer.getPositionIncrementGap() specifies the position > > gap between separate additions of same field. By default it > > returns 0 (which is not working well in your example). To modify this > > you can override this method in "your" analyzer to return a nonzero gap, > > for example 5. This is easy when subclassing any existing analyzer. > > > > Doron > > > > Now I may be starting to get it (although we French-speaking guys are > slow (but thorough)). Do you mean the following (add question mark at > end) : > - imagine that I would create a field "descriptors" for each of my > documents > - prior to adding a "phrase" to the "descriptors" field, I pass it > through an Analyser, the Analyser breaks it down into words, and notes > for each word the position in the phrase... This is true. Just note that (1) "passing-through-the-analyzer" is usually done for you by the IndexWriter, and (2) you are adding text (rather than phrase), and that text - depending on the field properties - is analyzed into tokens. - then the Analyser feeds it into the index, where the individual words > are stored, together with their relative position in the "phrase"... > - so that, for instance (ignoring any stripping of stopwords), the > phrase "the white cat jumped over the sleeping dog" is now stored in > the "descriptors" index as "1:the 2:white 3:cat 4:jumped 5:over 6:the > 7:sleeping 8:dog", the "n:" prefixes (so to speak) being the positions > in the phrase/field.. Yes, though usually starting in position 0. - so that, if I later search for "white cat"~1 in "dsecriptors", it will > find this document, bacause the "distance" between "white" and "cat" is > 1 (or 0, depending how one counts) .. Yes, though the default is 0, so "white jumped" would not match but "white jumped"~1 will match. - now, if I (forcefullly) specify a "PositionIncrementGap" of 10 to my > Analayser, then for the second addition to the same "descriptors" field, > it will start the numbering at 19 (?). Yes - thus if for instance the second instance of "descriptors" is the > phrase "the cow bit the cat", this will be indexed as "19:the 20:cow > 21:bit 22:the 23:cat". > - and when searching for "dog cow"~5, it would not find this document, > because the gap betweeb "8:dog" and "20:cow" is greater than 5 ? > > Is it something like that, or have I not got it at all ? Yes it is. To generalise my question, what I would like to know is this : assuming > I have two "descriptors" for the same document : "Electrical and > Electronic Engineering" and "Engineering Studies". > Is there a way to index this document (among others), and to later do a > search which will find the documents which have a "descriptors" > containing both "Electronic" and "Studies" in the same instance of > "descriptors", thus not finding this one ? Yes, you can do this by specifying a large enough gap, using either sloppy phrase query (as above) or using span-near-queries. Luke is a tool that allows to search and inspect a Lucene index. I think you will find it useful. - Doron