Re: Multiple Keywords/Keyphrases fields

2005-02-16 Thread Paul Elschot
On Wednesday 16 February 2005 06:49, Owen Densmore wrote:
> > From: Erik Hatcher <[EMAIL PROTECTED]>
> > Date: February 12, 2005 3:09:15 PM MST
> > To: "Lucene Users List" 
> > Subject: Re: Multiple Keywords/Keyphrases fields
> >
> >
> > The real question to answer is what types of queries you're planning 
> > on making.  Rather than look at it from indexing forward, consider it 
> > from searching backwards.
> >
> > How will users query using those keyword phrases?
> 
> Hi Erik.  Good point.
> 
> There are two uses we are making of the keyphrases:
> 
>   - Graphical Navigation: A Flash graphical browser will allow users to 
> fly around in a space of documents, choosing what to be viewing: 
> Authors, Keyphrases and Textual terms.  In any of these cases, the 
> "closeness" of any of the fields will govern how close they will appear 
> graphically.  In the case of authors, we will weight collaboration .. 
> how often the authors work together.  In the case of Keyphrases, we 
> will want to use something like distance vectors like you show in the 
> book using the cosine measure.  Thus the keyphrases need to be separate 
> entities within the document .. it would be a bug for us if the terms 
> leaked across the separate kephrases within the document.
> 
>   - Textual Search: In this case, we will have two ways to search the 
> keyphrases.  The first would be like the graphical navigation above 
> where searching for "complex system" should require the terms to be in 
> a single keyphrase.  The second way will be looser, where we may simply 
> pool the keyphrases with titles and abstract, and allow them all to be 
> searched together within the document.
> 
> Does this make sense?  So the question from the search standpoint is: 
> do multiple instances of a field act like there are barriers across the 
> instances, or are they somehow treated as a single instance somehow.  

Multiple field instances with the same name in a document are concatenated in
the index in the order in which they where added to the document.
For each instance of a field in the document, even when it has the same name, 
the analyzer is asked to provide a new tokenstream. 

This happens in org.apache.lucene.index.DocumentWriter.invertDocument(),
The last position offset in the field as indexed is maintained for this
purpose.

> In terms of the closeness calculation, for example, can we get separate 
> term vectors for each instance of the keyphrase field, or will we get a 
> single vector combining all the keyphrase terms within a single 
> document?

The positions in the TermVectors are treated in the same way.

To put a barrier between field instances with the same name
one can put a gap in the indexed term positions. This gap needs a larger
query proximity to match. AND like queries will match in the indexed field.

A gap is implemented by providing the a tokenstream from the analyzer
that has a position increment that equals the gap for the first token in the
stream.
For the first field instance with same name the gap is not needed.

Regards,
Paul Elschot

> 
> I hope this is clear!  Kinda hard to articulate.
> 
> Owen
> 
> > Erik
> >
> > On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
> >
> >> I'm getting a bit more serious about the final form of our lucene 
> >> index.  Each document has DocNumber, Authors, Title, Abstract, and 
> >> Keywords.  By Keywords, I mean a comma separated list, each entry 
> >> having possibly many terms in a phrase like:
> >>temporal infomax, finite state automata, Markov chains,
> >>conditional entropy, neural information processing
> >>
> >> I presume I should be using a field "Keywords" which have many 
> >> "entries" or "instances" per document (one per comma separated 
> >> phrase).  But I'm not sure the right way to handle all this.  My 
> >> assumption is that I should analyze them individually, just as we do 
> >> for free text (the Abstract, for example), thus in the example above 
> >> having 5 entries of the nature
> >>doc.add(Field.Text("Keywords", "finite state automata"));
> >> etc, analyzing them because these are author-supplied strings with no 
> >> canonical form.
> >>
> >> For guidance, I looked in the archive and found the attached email, 
> >> but I didn't see the answer.  (I'm not concerned about the dups, I 
> >> presume that is equivalent to a boos of some sort) Does this seem 
> >> right?
> >>
> >> Thanks once again.
> >>
> >> O

Re: Multiple Keywords/Keyphrases fields

2005-02-15 Thread Owen Densmore
From: Erik Hatcher <[EMAIL PROTECTED]>
Date: February 12, 2005 3:09:15 PM MST
To: "Lucene Users List" 
Subject: Re: Multiple Keywords/Keyphrases fields
The real question to answer is what types of queries you're planning 
on making.  Rather than look at it from indexing forward, consider it 
from searching backwards.

How will users query using those keyword phrases?
Hi Erik.  Good point.
There are two uses we are making of the keyphrases:
	- Graphical Navigation: A Flash graphical browser will allow users to 
fly around in a space of documents, choosing what to be viewing: 
Authors, Keyphrases and Textual terms.  In any of these cases, the 
"closeness" of any of the fields will govern how close they will appear 
graphically.  In the case of authors, we will weight collaboration .. 
how often the authors work together.  In the case of Keyphrases, we 
will want to use something like distance vectors like you show in the 
book using the cosine measure.  Thus the keyphrases need to be separate 
entities within the document .. it would be a bug for us if the terms 
leaked across the separate kephrases within the document.

	- Textual Search: In this case, we will have two ways to search the 
keyphrases.  The first would be like the graphical navigation above 
where searching for "complex system" should require the terms to be in 
a single keyphrase.  The second way will be looser, where we may simply 
pool the keyphrases with titles and abstract, and allow them all to be 
searched together within the document.

Does this make sense?  So the question from the search standpoint is: 
do multiple instances of a field act like there are barriers across the 
instances, or are they somehow treated as a single instance somehow.  
In terms of the closeness calculation, for example, can we get separate 
term vectors for each instance of the keyphrase field, or will we get a 
single vector combining all the keyphrase terms within a single 
document?

I hope this is clear!  Kinda hard to articulate.
Owen
Erik
On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated 
phrase).  But I'm not sure the right way to handle all this.  My 
assumption is that I should analyze them individually, just as we do 
for free text (the Abstract, for example), thus in the example above 
having 5 entries of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, 
but I didn't see the answer.  (I'm not concerned about the dups, I 
presume that is equivalent to a boos of some sort) Does this seem 
right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Keywords/Keyphrases fields

2005-02-12 Thread Erik Hatcher
The real question to answer is what types of queries you're planning on 
making.  Rather than look at it from indexing forward, consider it from 
searching backwards.

How will users query using those keyword phrases?
Erik
On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated 
phrase).  But I'm not sure the right way to handle all this.  My 
assumption is that I should analyze them individually, just as we do 
for free text (the Abstract, for example), thus in the example above 
having 5 entries of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, 
but I didn't see the answer.  (I'm not concerned about the dups, I 
presume that is equivalent to a boos of some sort) Does this seem 
right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Multiple Keywords/Keyphrases fields

2005-02-12 Thread Owen Densmore
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated phrase). 
 But I'm not sure the right way to handle all this.  My assumption is 
that I should analyze them individually, just as we do for free text 
(the Abstract, for example), thus in the example above having 5 entries 
of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, but 
I didn't see the answer.  (I'm not concerned about the dups, I presume 
that is equivalent to a boos of some sort) Does this seem right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]