On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
> > From: Erik Hatcher <[EMAIL PROTECTED]>
> > Date: February 12, 2005 3:09:15 PM MST
> > To: "Lucene Users List" <[email protected]>
> > Subject: Re: Multiple Keywords/Keyphrases fields
> >
> >
> > The real question to answer is what types of queries you're planning
> > on making. Rather than look at it from indexing forward, consider it
> > from searching backwards.
> >
> > How will users query using those keyword phrases?
>
> Hi Erik. Good point.
>
> There are two uses we are making of the keyphrases:
>
> - Graphical Navigation: A Flash graphical browser will allow users to
> fly around in a space of documents, choosing what to be viewing:
> Authors, Keyphrases and Textual terms. In any of these cases, the
> "closeness" of any of the fields will govern how close they will appear
> graphically. In the case of authors, we will weight collaboration ..
> how often the authors work together. In the case of Keyphrases, we
> will want to use something like distance vectors like you show in the
> book using the cosine measure. Thus the keyphrases need to be separate
> entities within the document .. it would be a bug for us if the terms
> leaked across the separate kephrases within the document.
>
> - Textual Search: In this case, we will have two ways to search the
> keyphrases. The first would be like the graphical navigation above
> where searching for "complex system" should require the terms to be in
> a single keyphrase. The second way will be looser, where we may simply
> pool the keyphrases with titles and abstract, and allow them all to be
> searched together within the document.
>
> Does this make sense? So the question from the search standpoint is:
> do multiple instances of a field act like there are barriers across the
> instances, or are they somehow treated as a single instance somehow.
Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name,
the analyzer is asked to provide a new tokenstream.
This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.
> In terms of the closeness calculation, for example, can we get separate
> term vectors for each instance of the keyphrase field, or will we get a
> single vector combining all the keyphrase terms within a single
> document?
The positions in the TermVectors are treated in the same way.
To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.
A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.
Regards,
Paul Elschot
>
> I hope this is clear! Kinda hard to articulate.
>
> Owen
>
> > Erik
> >
> > On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
> >
> >> I'm getting a bit more serious about the final form of our lucene
> >> index. Each document has DocNumber, Authors, Title, Abstract, and
> >> Keywords. By Keywords, I mean a comma separated list, each entry
> >> having possibly many terms in a phrase like:
> >> temporal infomax, finite state automata, Markov chains,
> >> conditional entropy, neural information processing
> >>
> >> I presume I should be using a field "Keywords" which have many
> >> "entries" or "instances" per document (one per comma separated
> >> phrase). But I'm not sure the right way to handle all this. My
> >> assumption is that I should analyze them individually, just as we do
> >> for free text (the Abstract, for example), thus in the example above
> >> having 5 entries of the nature
> >> doc.add(Field.Text("Keywords", "finite state automata"));
> >> etc, analyzing them because these are author-supplied strings with no
> >> canonical form.
> >>
> >> For guidance, I looked in the archive and found the attached email,
> >> but I didn't see the answer. (I'm not concerned about the dups, I
> >> presume that is equivalent to a boos of some sort) Does this seem
> >> right?
> >>
> >> Thanks once again.
> >>
> >> Owen
> >>
> >>> From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> >>> Subject: Multiple equal Fields?
> >>> Date: Tue, 17 Feb 2004 12:47:58 +0100
> >>>
> >>> Hi!
> >>> What happens if I do this:
> >>>
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "blah"));
> >>>
> >>> Is there a field "foo" with value "blah" or are there two "foo"s
> >>> (actually not
> >>> possible) or is there one "foo" with the values "bar" and "blah"?
> >>>
> >>> And what does happen in this case:
> >>>
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "bar"));
> >>> doc.add(Field.Text("foo", "bar"));
> >>>
> >>> Does lucene store this only once?
> >>>
> >>> Timo
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]