Re: Vector Space Model in Lucene?

2003-11-13 Thread Otis Gospodnetic
Lucene does not implement vector space model.

Otis

--- [EMAIL PROTECTED] wrote:
> Hi,
> 
> does Lucene implement a Vector Space Model? If yes, does anybody have
> an
> example of how using it?
> 
> Cheers,
> Ralf
> 
> -- 
> NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
> Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
> 
> Jetzt kostenlos anmelden unter http://www.gmx.net
> 
> +++ GMX - die erste Adresse für Mail, Message, More! +++
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
Really? And what model is used/implemented by Lucene?

THX
Leo
Otis Gospodnetic wrote:

Lucene does not implement vector space model.

Otis

--- [EMAIL PROTECTED] wrote:
 

Hi,

does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
does it matter? vector space is only one of several important ones.

Herb

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?


Really? And what model is used/implemented by Lucene?

THX
Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
The model implies the quality, thus it does matter.

ad "several important models") Are any of them implemented in Lucene?

Chong, Herb wrote:

does it matter? vector space is only one of several important ones.

Herb

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Really? And what model is used/implemented by Lucene?

THX
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
there are at least two or three others including pure probabilistic ones. most of the 
ones i designed were probabilistic ones that are not vector space but more general. 
vector space is a special case of a much more general probabilistic framework document 
representation.

Herb

-Original Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 9:08 AM
To: Lucene Users List
Subject: AW: Vector Space Model in Lucene?

what are these several other important ones?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
to me, vector space implies thinking inside the box.

Herb...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
like all vector space models i have come across, Lucene ignores interterm correlation.

Herb

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Andrzej Bialecki
Chong, Herb wrote:

like all vector space models i have come across, Lucene ignores interterm correlation.

Herb
Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
curious because one of my projects (already using Lucene) could benefit 
from such feature. Right now I'm using a bastardized version of Markov 
chains, but it's more of a hack...

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
i don't know of any open source search engine that incorporates interterm correlation. 
i have been looking into how to do this in Lucene and so far, it's not been promising. 
the indexing engine and file format needs to be changed. there are very few search 
engines that incorporate interterm correlation in any mathematically and 
linguistically rigorous manner. i designed a couple, but they were all research 
experiments.

if you are familiar with the TREC automatic adhoc track? my experiments with the 
TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average 
precision by proper use of interterm correlation. my project at the time was cancelled 
after TREC-7 and so there haven't been any new developments.

Herb

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?

Herb

Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
curious because one of my projects (already using Lucene) could benefit 
from such feature. Right now I'm using a bastardized version of Markov 
chains, but it's more of a hack...

-- 
Best regards,
Andrzej Bialecki

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Dror Matalon
Hi,

I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ...

TIA,

Dror

On Fri, Nov 14, 2003 at 12:52:15PM -0500, Chong, Herb wrote:
> i don't know of any open source search engine that incorporates interterm 
> correlation. i have been looking into how to do this in Lucene and so far, it's not 
> been promising. the indexing engine and file format needs to be changed. there are 
> very few search engines that incorporate interterm correlation in any mathematically 
> and linguistically rigorous manner. i designed a couple, but they were all research 
> experiments.
> 
> if you are familiar with the TREC automatic adhoc track? my experiments with the 
> TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average 
> precision by proper use of interterm correlation. my project at the time was 
> cancelled after TREC-7 and so there haven't been any new developments.
> 
> Herb
> 
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 12:39 PM
> To: Lucene Users List
> Subject: Re: Vector Space Model in Lucene?
> 
> Herb
> 
> Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
> curious because one of my projects (already using Lucene) could benefit 
> from such feature. Right now I'm using a bastardized version of Markov 
> chains, but it's more of a hack...
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:27, Dror Matalon wrote:

I might be the only person on the list who's having a hard time
following this discussion.
Nope. I don't understand a word of what those guys are talking about 
either :)

 Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ...
That's good enough :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
when people type in multiword queries, mostly they are interested in phrases in the 
linguistic sense. phrases don't cross sentence boundaries. you need certain features 
in the index and in the ranking algorithm to capture that distinction and rank 
documents truly having that phrase higher than documents that just happen to have the 
same words as the phrase. it also has to accommodate the human tendency to leave off 
words after mentioning the full form of the phrase once.

Herb

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 2:28 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?


Hi,

I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ...

TIA,

Dror

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:32  PM, Chong, Herb wrote:
when people type in multiword queries, mostly they are interested in 
phrases in the linguistic sense. phrases don't cross sentence 
boundaries. you need certain features in the index and in the ranking 
algorithm to capture that distinction and rank documents truly having 
that phrase higher than documents that just happen to have the same 
words as the phrase. it also has to accommodate the human tendency to 
leave off words after mentioning the full form of the phrase once.

Herb
In the Lucene-sense of things, sounds like you're after one Document 
per sentence.  You then get your boundaries automatically as well as 
the "distance weighting" through the coord() Similarity function.  At 
least that seems like a close approximation of what Lucene offers.  
Thoughts?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
it solves one part of the problem, but there are a lot of sentences in a typical 
document. you'll need to composite a rank of a document from its constituent sentences 
then. there are less drastic ways to solve the problem. the other problem is that 
Lucene doesn't consider the term order in the query unless the query is formulated as 
a phrase. a simple bag-of-words query doesn't make use of the ordering of terms that 
apply in a given language.

Herb

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 2:49 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?



In the Lucene-sense of things, sounds like you're after one Document 
per sentence.  You then get your boundaries automatically as well as 
the "distance weighting" through the coord() Similarity function.  At 
least that seems like a close approximation of what Lucene offers.  
Thoughts?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:54  PM, Chong, Herb wrote:
it solves one part of the problem, but there are a lot of sentences in 
a typical document. you'll need to composite a rank of a document from 
its constituent sentences then. there are less drastic ways to solve 
the problem. the other problem is that Lucene doesn't consider the 
term order in the query unless the query is formulated as a phrase. a 
simple bag-of-words query doesn't make use of the ordering of terms 
that apply in a given language.
BooleanQuery _could_ take the order of terms into account for weighting 
and scale the weights accordingly, I believe.

I get the feeling you're looking for reasons that Lucene is inadequate. 
 This may be the case for the uses you're speaking of, but there is 
quite a bit of flexibility with Lucene in terms of Analysis, scoring, 
and custom Query implementations that all relate to what you've been 
speaking of.  And, of course, Lucene is a low-level component of which 
a higher level piece could be built around.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:54, Chong, Herb wrote:

it solves one part of the problem, but there are a lot of sentences in 
a typical document. you'll need to composite a rank of a document from 
its constituent sentences then. there are less drastic ways to solve 
the problem. the other problem is that Lucene doesn't consider the 
term order in the query unless the query is formulated as a phrase. a 
simple bag-of-words query doesn't make use of the ordering of terms 
that apply in a given language.
This all sounds wonderfully exotic, but, from all the different 
esoteric approaches you ever tried, what, if anything, made a concrete 
and noticeable impact on the quality of your search?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
it has its limitations. that is why i am looking at what it would take to solve some 
of them. parsing documents to recognize sentences and storing sentence boundaries in 
the index would solve the ones that are most limiting. superposing interterm 
correlation on top of Lucene isn't very useful because then you have to build a new 
almost-duplicate index. since when people type in more than one word as a query, they 
almost invariably mean phrases, a search engine that doesn't take advantage of that 
isn't making use of fundamental linguistic knowledge. what has to be stored isn't 
much, but implementing it can be.

Herb

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:08 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?


I get the feeling you're looking for reasons that Lucene is inadequate. 
  This may be the case for the uses you're speaking of, but there is 
quite a bit of flexibility with Lucene in terms of Analysis, scoring, 
and custom Query implementations that all relate to what you've been 
speaking of.  And, of course, Lucene is a low-level component of which 
a higher level piece could be built around.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
if you know what TREC is, you know what i meant earlier. this isn't exotic technology, 
this is close to 15 year old technology.

Herb

-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:12 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?

This all sounds wonderfully exotic, but, from all the different 
esoteric approaches you ever tried, what, if anything, made a concrete 
and noticeable impact on the quality of your search?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:16, Chong, Herb wrote:

if you know what TREC is, you know what i meant earlier. this isn't 
exotic technology, this is close to 15 year old technology.
This is not really what I asked. What I would be interested to know is 
what approach you consider to provide the "biggest bang for you bucks"?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Vector Space Model in Lucene?

2003-11-14 Thread Chong, Herb
taking advantage of interterm correlations recognizing linguistic rules is an easy to 
do thing compared to all the other schemes i know. it doesn't require genuine NLP, 
external knowledge bases, and commonsense reasoning. it does require smarter document 
parsers and an extended index to implement efficiently.

Herb...

-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:20 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?

This is not really what I asked. What I would be interested to know is 
what approach you consider to provide the "biggest bang for you bucks"?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Incorporating inter-term correlation into Lucene isn't that hard; I've 
done it.  Nor is it incompatible with the vector-space model.  I'm not 
happy with the specific correlation metric that I picked, which is why 
I'm not eager to generally release the code I wrote, but I think that 
the basic mechanism that I came up with (query expansion via correlated 
terms, where the added terms were boosted according to the strength of 
the correlation) is fairly sound.  And I didn't need any changes to 
Lucene to do this.

You can get some details on the specific mechanism that I used here, if 
you're interested:

http://www.ics.uci.edu/~jmadden/research/index.html

(and go down to "Fuzzy Term Expansion and Document Reweighting", about 
halfway down.)

If you decide that my ideas are interesting enough that you want to 
have a look at my code, let me know, and perhaps we can work something 
out.

Regards,

Joshua O'Madadhain

On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote:

i don't know of any open source search engine that incorporates 
interterm correlation. i have been looking into how to do this in 
Lucene and so far, it's not been promising. the indexing engine and 
file format needs to be changed. there are very few search engines 
that incorporate interterm correlation in any mathematically and 
linguistically rigorous manner. i designed a couple, but they were all 
research experiments.

if you are familiar with the TREC automatic adhoc track? my 
experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 
0.10 improvement in average precision by proper use of interterm 
correlation. my project at the time was cancelled after TREC-7 and so 
there haven't been any new developments.

 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
if you didn't have to change the index then you haven't got all the factors needed to 
do it well. terms can't cross sentence boundaries and the index doesn't store sentence 
boundaries.

Herb...

-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]


Incorporating inter-term correlation into Lucene isn't that hard; I've 
done it.  Nor is it incompatible with the vector-space model.  I'm not 
happy with the specific correlation metric that I picked, which is why 
I'm not eager to generally release the code I wrote, but I think that 
the basic mechanism that I came up with (query expansion via correlated 
terms, where the added terms were boosted according to the strength of 
the correlation) is fairly sound.  And I didn't need any changes to 
Lucene to do this.

You can get some details on the specific mechanism that I used here, if 
you're interested:

http://www.ics.uci.edu/~jmadden/research/index.html

(and go down to "Fuzzy Term Expansion and Document Reweighting", about 
halfway down.)

If you decide that my ideas are interesting enough that you want to 
have a look at my code, let me know, and perhaps we can work something 
out.

Regards,

Joshua O'Madadhain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Not sure what you mean by "terms can't cross sentence boundaries".  If 
you're only using single-word terms, that's trivially true.  What is it 
that you're trying to achieve, exactly?  (Your comment makes it sound 
as though you simultaneously want and don't want sentence boundaries, 
so I'm confused.)

Joshua

On Friday, Nov 14, 2003, at 10:13 US/Pacific, Chong, Herb wrote:

if you didn't have to change the index then you haven't got all the 
factors needed to do it well. terms can't cross sentence boundaries 
and the index doesn't store sentence boundaries.

Herb...

-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it.  Nor is it incompatible with the vector-space model.  I'm not
happy with the specific correlation metric that I picked, which is why
I'm not eager to generally release the code I wrote, but I think that
the basic mechanism that I came up with (query expansion via correlated
terms, where the added terms were boosted according to the strength of
the correlation) is fairly sound.  And I didn't need any changes to
Lucene to do this.
You can get some details on the specific mechanism that I used here, if
you're interested:
http://www.ics.uci.edu/~jmadden/research/index.html

(and go down to "Fuzzy Term Expansion and Document Reweighting", about
halfway down.)
If you decide that my ideas are interesting enough that you want to
have a look at my code, let me know, and perhaps we can work something
out.
Regards,

Joshua O'Madadhain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
if you are handling inter correlation properly, then terms can't cross sentence 
boundaries. if you are not paying attention to sentence boundaries, then you are not 
following rules of linguistics.

Herb...

-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:53 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Not sure what you mean by "terms can't cross sentence boundaries".  If 
you're only using single-word terms, that's trivially true.  What is it 
that you're trying to achieve, exactly?  (Your comment makes it sound 
as though you simultaneously want and don't want sentence boundaries, 
so I'm confused.)

Joshua

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 01:13  PM, Chong, Herb wrote:
if you didn't have to change the index then you haven't got all the 
factors needed to do it well. terms can't cross sentence boundaries 
and the index doesn't store sentence boundaries.
You mean if you have text like this: "Hello Herb.  Have a nice day!", 
you want to prevent phrase queries for "herb have"?  You could prevent 
sentence boundary crossing with clever use of the token position I 
suspect.  Would that accomplish what you're after?

Could you give a really dumbed down simple example of what you mean by 
inter-term correlation?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
since i am working now on financial news, here is an example:

capital gains tax

if i just run this query against a million document newswire index, i know i am going 
to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is 
overrestrictive. the fact that the three terms occur next to each other in the query 
means that documents with the three terms far apart should not get nearly as much 
weight in the ranking scheme. a sentence ending with two terms "capital gains" 
followed by a sentence starting with the term "tax" should not be a highly ranked 
match. that means you need sentence boundaries in the index. the indexing and the 
query analysis scheme has to understand the linguistic concept of a phrase, and 
phrases do not cross sentence boundaries.

Herb

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:52 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

You mean if you have text like this: "Hello Herb.  Have a nice day!", 
you want to prevent phrase queries for "herb have"?  You could prevent 
sentence boundary crossing with clever use of the token position I 
suspect.  Would that accomplish what you're after?

Could you give a really dumbed down simple example of what you mean by 
inter-term correlation?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 19:50, Chong, Herb wrote:

if you are handling inter correlation properly, then terms can't cross 
sentence boundaries.
Could you not break down your document along sentences boundary? If you 
manage to figure out what a sentence is, that is.

if you are not paying attention to sentence boundaries, then you are 
not following rules of linguistics.
Rules of linguistics? Is there such a thing? :)

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Erik Hatcher
On Friday, November 14, 2003, at 02:02  PM, Chong, Herb wrote:
if i just run this query against a million document newswire index, i 
know i am going to get lots of hits. the phrase "capital gains tax" 
hits a lot fewer documents, but is overrestrictive. the fact that the 
three terms occur next to each other in the query means that documents 
with the three terms far apart should not get nearly as much weight in 
the ranking scheme. a sentence ending with two terms "capital gains" 
followed by a sentence starting with the term "tax" should not be a 
highly ranked match. that means you need sentence boundaries in the 
index. the indexing and the query analysis scheme has to understand 
the linguistic concept of a phrase, and phrases do not cross sentence 
boundaries.
With Lucene's analysis process, you can assign a position increment to 
tokens.  The default value is 1, meaning its the next position.  Phrase 
queries default to a slop of 0, meaning they must be in successive 
positions.  When analyzing and you encounter a sentence boundary, you 
could set the position increment of the next word (the first word of 
the next sentence) to a high number (to account for users searching 
with potential slop, or just something greater than one if you never 
use sloppy phrase searches).

Does this get you closer to what you're after?

As for how to weight queries by the distance from terms I'll have 
to think on that some, but I suspect something reasonable could be done 
with a custom Similarity or a custom type of Query.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
something in the index needs to mark sentence boundaries so that words close together 
in the query and found close together in the text get penalized for being separated by 
a boundary. proximity isn't enough because in more complex queries, some of the 
intermediate words might be omitted. in other words, A near B near C works only when 
A, B, and C occur close together and fails when B is omitted, but in English, A or B 
sometimes do get omitted after the first mention of the phrase A B C.

as far as the ranking calculation goes, the distance itself isn't a direct factor. put 
it this way, imagine you have a sliding window on the query of some number of terms. 
suppose you have a 5 term query. an exact match of the 5 terms on the window ought to 
match higher than 4 out of 5. similarly, matching 4 terms out of 5 with the 5th term 
nearby but in another sentence should not rank as high. 5, BTW, is the magic number 
for English and other languages that use the same rules for multiword term composition.

read this - http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf. read the section 
on Lexical Affinities. also find this paper: Y. Maarek and F. Smadja. Full text 
indexing based on lexical relations: An application: Software libraries. In 
Proceedings of the 12th International ACM SIGIR Conference on Research and Development 
in Information Retrieval, pages 198--206, 1989.

Herb
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 2:10 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


With Lucene's analysis process, you can assign a position increment to 
tokens.  The default value is 1, meaning its the next position.  Phrase 
queries default to a slop of 0, meaning they must be in successive 
positions.  When analyzing and you encounter a sentence boundary, you 
could set the position increment of the next word (the first word of 
the next sentence) to a high number (to account for users searching 
with potential slop, or just something greater than one if you never 
use sloppy phrase searches).

Does this get you closer to what you're after?

As for how to weight queries by the distance from terms I'll have 
to think on that some, but I suspect something reasonable could be done 
with a custom Similarity or a custom type of Query.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Philippe Laflamme
> Rules of linguistics? Is there such a thing? :)

Actually, yes there is. Natural Language Processing (NLP) is a very broad
research subject but a lot has come out of it.

More specifically, Rule-based taggers have become very popular since Eric
Brill published his works on trainable rule-based tagging.

Essentially, it comes to down analysing sentences to determine the role
(noun, verb, etc.) of each words. It's very helpful to extract noun-phrases
such has "cardiovascular disease" or "magnetic resonance imaging" from
documents.

So, yep... you can definitely derive rules to analyse natural language...

I'm sure you already know about all of this... just thought it might be
interesting for some...

Phil

> -Original Message-
> From: petite_abeille [mailto:[EMAIL PROTECTED]
> Sent: November 14, 2003 14:04
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model in
> Lucene?]
>
>
>
> On Nov 14, 2003, at 19:50, Chong, Herb wrote:
>
> > if you are handling inter correlation properly, then terms can't cross
> > sentence boundaries.
>
> Could you not break down your document along sentences boundary? If you
> manage to figure out what a sentence is, that is.
>
> > if you are not paying attention to sentence boundaries, then you are
> > not following rules of linguistics.
>
> Rules of linguistics? Is there such a thing? :)
>
> PA.
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:

Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very 
broad
research subject but a lot has come out of it.
A lot of what? "If" statements? :)

More specifically, Rule-based taggers have become very popular since 
Eric
Brill published his works on trainable rule-based tagging.

Essentially, it comes to down analysing sentences to determine the role
(noun, verb, etc.) of each words. It's very helpful to extract 
noun-phrases
such has "cardiovascular disease" or "magnetic resonance imaging" from
documents.
I would agree with that. But it's easier said than done. And the result 
are never, er, clear cut.

So, yep... you can definitely derive rules to analyse natural 
language...
Well... beyond the jargon and the impressive math... this all boils 
down to fuzzy heuristics and judgment calls... but perhaps this is just 
me :)

I'm sure you already know about all of this...
Not really. I'm more of a dilettante than a "NLP expert".

just thought it might be
interesting for some...
Sure. But my take on this, is that pigs will fly before NLP turns into 
a predictable "science" :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Philippe Laflamme
> >> Rules of linguistics? Is there such a thing? :)
> >
> > Actually, yes there is. Natural Language Processing (NLP) is a very
> > broad
> > research subject but a lot has come out of it.
>
> A lot of what? "If" statements? :)

Yes... just like every software boils down to branching and while loops for
the processor... ;o)

> I would agree with that. But it's easier said than done.

Yes, of course this is very complex. That's why NLP is a very popular field
of research: it's challenging!

> And the result are never, er, clear cut.

You're correct, results are not 100% perfect. But getting 95% is pretty
impressive when you're dealing with a computer software. Don't forget, even
with many years (decades even) of experience with our own language, we
humans still manage to misunderstand certain sentences... can you really
expect a software to be 100% correct all the time?

> Sure. But my take on this, is that pigs will fly before NLP turns into
> a predictable "science" :)

Maybe you're right, technologies derived from NLP may never be perfect. But
it doesn't make them useless. Quite the contrary I think.

I'm not a Lucene expert, but I'm sure it could benefit from using derived
NLP methods for text analysis. Maybe someone out there has some experience
they might want to share with us?

Thanks,
Phil

> -Original Message-
> From: petite_abeille [mailto:[EMAIL PROTECTED]
> Sent: November 14, 2003 14:36
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model in
> Lucene?]
>
>
>
> On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:
>
> >> Rules of linguistics? Is there such a thing? :)
> >
> > Actually, yes there is. Natural Language Processing (NLP) is a very
> > broad
> > research subject but a lot has come out of it.
>
> A lot of what? "If" statements? :)
>
> > More specifically, Rule-based taggers have become very popular since
> > Eric
> > Brill published his works on trainable rule-based tagging.
> >
> > Essentially, it comes to down analysing sentences to determine the role
> > (noun, verb, etc.) of each words. It's very helpful to extract
> > noun-phrases
> > such has "cardiovascular disease" or "magnetic resonance imaging" from
> > documents.
>
> I would agree with that. But it's easier said than done. And the result
> are never, er, clear cut.
>
> > So, yep... you can definitely derive rules to analyse natural
> > language...
>
> Well... beyond the jargon and the impressive math... this all boils
> down to fuzzy heuristics and judgment calls... but perhaps this is just
> me :)
>
> > I'm sure you already know about all of this...
>
> Not really. I'm more of a dilettante than a "NLP expert".
>
> > just thought it might be
> > interesting for some...
>
> Sure. But my take on this, is that pigs will fly before NLP turns into
> a predictable "science" :)
>
> PA.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Chong, Herb wrote:
since i am working now on financial news, here is an example:

capital gains tax

if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to each other in the query means that documents with the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence ending with two terms "capital gains" followed by a sentence starting with the term "tax" should not be a highly ranked match. that means you need sentence boundaries in the index. the indexing and the query analysis scheme has to understand the linguistic concept of a phrase, and phrases do not cross sentence boundaries.
Have sentence boundaries actually proven to be that userful in this sort 
of thing.  For example, if the text were something like:

  "Such sales would be considered long term capital gains.  Tax on 
these is 20%."

Then penalizing for the sentence boundary wouldn't be valid.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
then i wouldn't have typed capital gains tax. there is psychology of query creation 
too and that is one thing i am taking advantage of.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:15 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Have sentence boundaries actually proven to be that userful in this sort 
of thing.  For example, if the text were something like:

   "Such sales would be considered long term capital gains.  Tax on 
these is 20%."

Then penalizing for the sentence boundary wouldn't be valid.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
I'm still confused what the issue here is.  If you're interested in 
stopping exact phrase matching from crossing sentence boundaries, that's 
easy to do with setPositionIncrement().  If you want to score an exact 
phrase match higher than a sloppy phrase match, and in turn score this 
higher than a non-phrasal match, this is all easy to do with Lucene. 
Depending on how much control you need, you could do it with a single 
sloppy phrase query (perhaps with a custom Similarity implementation), 
or perhaps you'd need to combine an OR query of the terms with an exact 
phrase match and with a sloppy phrase match, each with different boosts.

Certainly there are lots of scoring algorithms that one cannot easily 
implement with Lucene.  I'm just not yet clear on what you need to do 
that Lucene cannot support.

Cheers,

Doug

Chong, Herb wrote:
then i wouldn't have typed capital gains tax. there is psychology of query creation too and that is one thing i am taking advantage of.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:15 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Have sentence boundaries actually proven to be that userful in this sort 
of thing.  For example, if the text were something like:

   "Such sales would be considered long term capital gains.  Tax on 
these is 20%."

Then penalizing for the sentence boundary wouldn't be valid.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Chong, Herb
you're describing ad-hoc solutions to a problem that have an effect, but not one that 
is easily predictable. one can concoct all sorts of combinations of the query 
operators that would have something of the effect that i am describing. crossing 
sentence boundaries, however, can't be done without having some sentence boundaries as 
a reference. on top of this, there is a relatively simple concept which, if 
implemented, takes away all the ad-hocness of the solutions and replaces it with a 
something that is both linguistically and mathematically sound and on top of which 
won't materially make the engine core more complicated. that concept is that multiword 
queries are mostly multiword terms and they can't cross sentence boundaries according 
to the rules of English.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 3:33 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Certainly there are lots of scoring algorithms that one cannot easily 
implement with Lucene.  I'm just not yet clear on what you need to do 
that Lucene cannot support.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Doug Cutting
Leo Galambos wrote:
There are other (more trivial) problems as well. One geek from UFAL (our 
NLP lab) reported, that it was a hard problem to find the boundaries, or 
rather, to say whether a dot is a dot or something else, i.e. "blah, 
i.e. blah" "i.b.m." "i.p. pavlov" "3.14" "28.10.2003" etc.

On the other hand, I would rather like to know the model which is 
implemented by Lucene. If it is not a vector model, what is it? ;-)
I would call it a vector space model.

The best description of how Lucene scores is:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote:

Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? "If" statements? :)
Yes... just like every software boils down to branching and while 
loops for
the processor... ;o)
Hehe... ;) But NLP seems to suffer more from heuristics disguised in 
fancy jargon than other fields...


I would agree with that. But it's easier said than done.
Yes, of course this is very complex. That's why NLP is a very popular 
field
of research: it's challenging!
Indeed.


And the result are never, er, clear cut.
You're correct, results are not 100% perfect. But getting 95% is pretty
impressive when you're dealing with a computer software. Don't forget, 
even
with many years (decades even) of experience with our own language, we
humans still manage to misunderstand certain sentences... can you 
really
expect a software to be 100% correct all the time?
Nope. Therefore my "tongue in cheek" comments...


Sure. But my take on this, is that pigs will fly before NLP turns into
a predictable "science" :)
Maybe you're right, technologies derived from NLP may never be 
perfect. But
it doesn't make them useless. Quite the contrary I think.
Perhaps. I'm not saying it's utterly useless as a whole. But... NLP has 
a noted tendency to over promise and under deliver. Plus, it's marred 
with too much jargons which is suspicious in and by itself :)

I'm not a Lucene expert, but I'm sure it could benefit from using 
derived
NLP methods for text analysis.
For "hardcore" text analysis, perhaps. But Lucene is an low level 
indexing library. You can build something much more, er, esoteric on 
top of it. But I don't think that the core library would benefit from 
any "bizarre" additions. Plus, the core elements of the library provide 
already more than enough room to play with whatever scheme you may have 
in mind.

Maybe someone out there has some experience
they might want to share with us?
Perhaps. But one way or another, and as far as Lucene is concerned, you 
will be better off building something exotic on top of Lucene than 
messing around with its internals.

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
PA,

 But Lucene is an low level indexing library. 
I'm sure most people here will agree that lucene is much more than a 
_low level_ indexing library.

May be it is just a library, but definitely the *highest level* search 
technology available in the web for free.
You ride roughshod over the hard work of others.

You just talking b*** guy! After 3 days I have around 90 (!!!) 
postings from you in my inbox from lucene and nutch.
All of them are at least a summary of the first 3 entries of a google 
search or just bla bla.

You definitely miss to read the mailing list guidelines.
You can find it here and you should read them carefully!
http://jakarta.apache.org/site/mail.html This could be a useful link for 
you as well: http://www.dtcc.edu/cs/rfc1855.html

Since you don't wish to get personal mails, sorry to you, I have to post 
it here.
I'm sorry to all other as well, since this is normally a harmonic place 
for good IR conversation and the web is a open for everyone.
But this only working with an  etiquette and this guy really miss to 
read it. I'm sure I'm not the only one, that see it like this.

PA, can you do me a personal favour?! Install you a IRC Client and go to 
#rss! There is  a nice "send" aqua button as well.

Thanks!
Stefan
P.S. I'm really really sorry for this mail!!!

PP.S. It is not the first mailing list that PA is spamming see first entry:

http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&q=petite_abeille%40mac.com&btnG=Google+Search




Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Andrzej Bialecki
Well ... Sure, nothing can replace a human mind. But believe it or not, 
there are studies which show that even human experts can significantly 
differ in their opinions on what are key-phrases for a given text. So, 
the results are never clear cut with humans either...

So, in this sense a heuristic tool for sentence splitting and key-phrase 
detection can go long ways. For example, the application I mentioned, 
uses quite a few heuristic rules (+ Markov chains as a heavier 
ammunition :-), and it comes up with the following phrases for your 
email discussion (the text quoted below):

(lang=EN): NLP, trainable rule-based tagging, natural language 
processing, apache, NLP expert

Now, this set of key-phrases does reflect the main noun-phrases in the 
text... which means I have a practical and tangible benefit from NLP. 
QED ;-)

Best regards,
Andrzej
petite_abeille wrote:
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:

Rules of linguistics? Is there such a thing? :)


Actually, yes there is. Natural Language Processing (NLP) is a very broad
research subject but a lot has come out of it.


A lot of what? "If" statements? :)

More specifically, Rule-based taggers have become very popular since Eric
Brill published his works on trainable rule-based tagging.
Essentially, it comes to down analysing sentences to determine the role
(noun, verb, etc.) of each words. It's very helpful to extract 
noun-phrases
such has "cardiovascular disease" or "magnetic resonance imaging" from
documents.


I would agree with that. But it's easier said than done. And the result 
are never, er, clear cut.

So, yep... you can definitely derive rules to analyse natural language...


Well... beyond the jargon and the impressive math... this all boils down 
to fuzzy heuristics and judgment calls... but perhaps this is just me :)

I'm sure you already know about all of this...


Not really. I'm more of a dilettante than a "NLP expert".

just thought it might be
interesting for some...


Sure. But my take on this, is that pigs will fly before NLP turns into a 
predictable "science" :)

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Tatu Saloranta
On Friday 14 November 2003 11:50, Chong, Herb wrote:
> if you are handling inter correlation properly, then terms can't cross
> sentence boundaries. if you are not paying attention to sentence
> boundaries, then you are not following rules of linguistics.

Isn't that quite strict interpretation, however? There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way;
it may not be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?

As to storing boundaries in index; am I naive if I suggested just marker 
tokens that could easily be used to mark boundaries (sentence, paragraph, 
section)? Code that uses that information would obviously need to know 
details of marking used, but would it be infeasible to use such in-band 
information?

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Tatu Saloranta
On Friday 14 November 2003 13:39, Chong, Herb wrote:
> you're describing ad-hoc solutions to a problem that have an effect, but
> not one that is easily predictable. one can concoct all sorts of
> combinations of the query operators that would have something of the effect
> that i am describing. crossing sentence boundaries, however, can't be done

Hmmh? You implied that there are some useful distance heuristics (words
5 words apart or more correlate much less), and others have pointed out Lucene 
has many useful components.

Building more complex system from small components is usually considered a 
Good Thing (tm), not an "ad hoc solution". In fact, I would guess most 
experienced people around here start with Lucene defaults, and build their 
own systems gradually customizing more and more of pieces.
It may be there are actual fundamental problems with Lucene, regarding 
approach you'd prefer, but I don't think it makes sense to brush off 
suggestions regarding distance  & fuzzy/sloppy queries by claiming they are 
"just hacks".

> without having some sentence boundaries as a reference. on top of this,
> there is a relatively simple concept which, if implemented, takes away all
> the ad-hocness of the solutions and replaces it with a something that is
> both linguistically and mathematically sound and on top of which won't

Like most people have pointed out, linguistics are nothing of sorts of exact 
science; and comparing it to maths sounds like apples vs. oranges to me.
I'm not even convinced one can use general terms like "linguistically sound"; 
especially as content being indexed and searched on is often mixture of 
natural and programming languages (at least with knowledge bases I work 
with).

Now; if you (or anyone else) could build more advanced query mechanism either 
on top of Lucene fundamentals, or have modified version, THAT would be 
useful. But it's more efficient to first consider suggestions, and especially 
WHAT WORKS as opposed to argue for what appears most elegant a solution.

> materially make the engine core more complicated. that concept is that
> multiword queries are mostly multiword terms and they can't cross sentence
> boundaries according to the rules of English.

Which brings us back to the problem of detecting boundaries. Punctuation can 
help; classifications of words can help; all are inexact "science". Which 
just makes me wonder if just considering token distances might then just be 
plenty good enough.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Stefan Groschupf
Herb,

On Friday 14 November 2003 13:39, Chong, Herb wrote:
 

you're describing ad-hoc solutions to a problem that have an effect, but
not one that is easily predictable. one can concoct all sorts of
combinations of the query operators that would have something of the effect
that i am describing. crossing sentence boundaries, however, can't be done
   

I'm not sure understand you right. You wish to make language based 
queries to a IR system?
If I'm right may be I can help you!
Here is presentation by the MIT that does something similar.
http://mitworld.mit.edu/stream/155/
That is a video > 1h. But the first talk will contain the part is 
interesting for you.
Leslie Pack Kaelbling: MIT Professor, Electrical Engineering and 
Computer Science hold it.
(the sequence with the two guys with a phone is interesting for you).
They just have 2 components around the "nl ir query". It is speech to 
text and text to speech.

What you can do is use a pos tagger (i.e. a maximum entropy model based 
or  Brill tagger if you just have english) and use a data mining 
algorithm for weight your terms.
May be you can use a hidden Markov model for that.

You can build this on top of lucene, shouldn't be that difficult.

But may be I understand you wrong.. ..

Cheers
Stefan
--
open technology: www.media-style.com
open source: www.weta-group.net
open discussion: www.text-mining.org


AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-15 Thread Karsten Konrad

>>
Rules of linguistics? Is there such a thing? :)
>>

Yes there are. How can you expect communication (the goal of
the game that natural language is about) to work if the game 
has no rules? 

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

>>
Sure. But my take on this, is that pigs will fly before NLP turns into 
a predictable "science" :)
>>

You mean like physics (new models every 10 years), biology (same),
medicine (er.. cancer research anyone?), chemistry ("the result could be
verified in 8 of 10 experiments..."). What does predictabiltity mean
to you? What sciences beside mathematics do give you 100% certainty? 

But I guess you are in flame mode anyway now :)

Regards,

Karsten 


-Ursprüngliche Nachricht-
Von: petite_abeille [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 14. November 2003 20:04
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]



On Nov 14, 2003, at 19:50, Chong, Herb wrote:

> if you are handling inter correlation properly, then terms can't cross
> sentence boundaries.

Could you not break down your document along sentences boundary? If you 
manage to figure out what a sentence is, that is.

> if you are not paying attention to sentence boundaries, then you are
> not following rules of linguistics.

Rules of linguistics? Is there such a thing? :)

PA.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i have a program written in Icon that does basic sentence splitting. with about 5 
heuristics and one small lookup table, i can get well over 90% accuracy doing sentence 
boundary detection on email. for well edited English text, like newswires, i can 
manage closer to 99%. this is all that is needed for significantly improving a search 
engine's performance when the query engine respects sentence boundaries. incidentally, 
the GATE Information Extraction framework cites some references that indicate that for 
named entity feature extraction, their system can exceed the ability of trained humans 
to detect and classify named entities if only one person does the detection. 
collaborating humans are still better, but no-one has the time in practical 
applications.

you probably know, since you know about Markov chains, that within sentence term 
correlation, and hence the language model, is different than across sentences. 
linguists have known this for a very long time. it isn't hard to put this capability 
into a search engine, but it absolutely breaks down unless there is sentence boundary 
information stored for use at query time.

Herb

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 5:54 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Well ... Sure, nothing can replace a human mind. But believe it or not, 
there are studies which show that even human experts can significantly 
differ in their opinions on what are key-phrases for a given text. So, 
the results are never clear cut with humans either...

So, in this sense a heuristic tool for sentence splitting and key-phrase 
detection can go long ways. For example, the application I mentioned, 
uses quite a few heuristic rules (+ Markov chains as a heavier 
ammunition :-), and it comes up with the following phrases for your 
email discussion (the text quoted below):

(lang=EN): NLP, trainable rule-based tagging, natural language 
processing, apache, NLP expert

Now, this set of key-phrases does reflect the main noun-phrases in the 
text... which means I have a practical and tangible benefit from NLP. 
QED ;-)

Best regards,
Andrzej

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i am not implying rejection of a match across sentence boundaries, i am saying that it 
receives a lower score than a match within a sentence boundary.

Herb

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:15 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

Isn't that quite strict interpretation, however? There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way;
it may not be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you cannot layer sentence boundary detection on top of Lucene and post process the hit 
list without effectively building a completely new search engine index. if i am going 
to go to this trouble, there is no point to using Lucene at all.

Herb

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 8:30 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

Hmmh? You implied that there are some useful distance heuristics (words
5 words apart or more correlate much less), and others have pointed out Lucene 
has many useful components.

Building more complex system from small components is usually considered a 
Good Thing (tm), not an "ad hoc solution". In fact, I would guess most 
experienced people around here start with Lucene defaults, and build their 
own systems gradually customizing more and more of pieces.
It may be there are actual fundamental problems with Lucene, regarding 
approach you'd prefer, but I don't think it makes sense to brush off 
suggestions regarding distance  & fuzzy/sloppy queries by claiming they are 
"just hacks".

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
respecting sentence boundaries and using them to affect a document's score in the 
ranking algorithm requires linguistic knowledge, not NLP knowledge. think about it.

Herb

-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 9:13 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


What you can do is use a pos tagger (i.e. a maximum entropy model based 
or  Brill tagger if you just have english) and use a data mining 
algorithm for weight your terms.
May be you can use a hidden Markov model for that.

You can build this on top of lucene, shouldn't be that difficult.

But may be I understand you wrong.. ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
now you're talking. this is one way of doing it. you need to work out a heuristic to 
increment the counter enough that a misrecognized long sentence won't trigger this. 
however, one can argue that a sentence that contains 1000 words can't possibly be 
about one topic.

Herb

-Original Message-
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Saturday, November 15, 2003 7:16 AM
To: Lucene Users List
Subject: AW: inter-term correlation [was Re: Vector Space Model in
Lucene?]

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Philippe Laflamme
There is already an implementation in the Java API for sentence boundary
detection. The BreakIterator in the java.text package has this to say about
sentence splitting:

"Sentence boundary analysis allows selection with correct interpretation of
periods within numbers and abbreviations, and trailing punctuation marks
such as quotation marks and parentheses."
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM:
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn more about it's efficiency.

Regards,
Phil

> -Original Message-
> From: Chong, Herb [mailto:[EMAIL PROTECTED]
> Sent: November 17, 2003 08:53
> To: Lucene Users List
> Subject: RE: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> i have a program written in Icon that does basic sentence
> splitting. with about 5 heuristics and one small lookup table, i
> can get well over 90% accuracy doing sentence boundary detection
> on email. for well edited English text, like newswires, i can
> manage closer to 99%. this is all that is needed for
> significantly improving a search engine's performance when the
> query engine respects sentence boundaries. incidentally, the GATE
> Information Extraction framework cites some references that
> indicate that for named entity feature extraction, their system
> can exceed the ability of trained humans to detect and classify
> named entities if only one person does the detection.
> collaborating humans are still better, but no-one has the time in
> practical applications.
>
> you probably know, since you know about Markov chains, that
> within sentence term correlation, and hence the language model,
> is different than across sentences. linguists have known this for
> a very long time. it isn't hard to put this capability into a
> search engine, but it absolutely breaks down unless there is
> sentence boundary information stored for use at query time.
>
> Herb
>
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 5:54 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> Well ... Sure, nothing can replace a human mind. But believe it or not,
> there are studies which show that even human experts can significantly
> differ in their opinions on what are key-phrases for a given text. So,
> the results are never clear cut with humans either...
>
> So, in this sense a heuristic tool for sentence splitting and key-phrase
> detection can go long ways. For example, the application I mentioned,
> uses quite a few heuristic rules (+ Markov chains as a heavier
> ammunition :-), and it comes up with the following phrases for your
> email discussion (the text quoted below):
>
> (lang=EN): NLP, trainable rule-based tagging, natural language
> processing, apache, NLP expert
>
> Now, this set of key-phrases does reflect the main noun-phrases in the
> text... which means I have a practical and tangible benefit from NLP.
> QED ;-)
>
> Best regards,
> Andrzej
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
i don't know what the Java implementation is like but the C++ one is very fast.

Herb

-Original Message-
From: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:39 AM
To: Lucene Users List
Subject: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


I personally do not have any experience with the BreakIterator in Java. Has
anyone used it in any production environment? I'd be very interested to
learn more about it's efficiency.

Regards,
Phil

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Karsten Konrad

Hi,

it actually is quite nice and it can be used in production for such things as have 
been discussed
lately in this group. 

If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. 
March), the precision
of the algorithm can be increased if you never break after a number.

The implementation is fast.

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-Ursprüngliche Nachricht-
Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 17. November 2003 15:39
An: Lucene Users List
Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


There is already an implementation in the Java API for sentence boundary detection. 
The BreakIterator in the java.text package has this to say about sentence splitting:

"Sentence boundary analysis allows selection with correct interpretation of periods 
within numbers and abbreviations, and trailing punctuation marks such as quotation 
marks and parentheses." 
http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html

The whole i18n Java API is based on the ICU framework from IBM: 
http://oss.software.ibm.com/icu/index.html
It supports many languages.

I personally do not have any experience with the BreakIterator in Java. Has anyone 
used it in any production environment? I'd be very interested to learn more about it's 
efficiency.

Regards,
Phil

> -Original Message-
> From: Chong, Herb [mailto:[EMAIL PROTECTED]
> Sent: November 17, 2003 08:53
> To: Lucene Users List
> Subject: RE: inter-term correlation [was Re: Vector Space Model in 
> Lucene?]
>
>
> i have a program written in Icon that does basic sentence splitting. 
> with about 5 heuristics and one small lookup table, i can get well 
> over 90% accuracy doing sentence boundary detection on email. for well 
> edited English text, like newswires, i can manage closer to 99%. this 
> is all that is needed for significantly improving a search engine's 
> performance when the query engine respects sentence boundaries. 
> incidentally, the GATE Information Extraction framework cites some 
> references that indicate that for named entity feature extraction, 
> their system can exceed the ability of trained humans to detect and 
> classify named entities if only one person does the detection.
> collaborating humans are still better, but no-one has the time in
> practical applications.
>
> you probably know, since you know about Markov chains, that within 
> sentence term correlation, and hence the language model, is different 
> than across sentences. linguists have known this for a very long time. 
> it isn't hard to put this capability into a search engine, but it 
> absolutely breaks down unless there is sentence boundary information 
> stored for use at query time.
>
> Herb
>
> -Original Message-
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 5:54 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model in 
> Lucene?]
>
>
> Well ... Sure, nothing can replace a human mind. But believe it or 
> not, there are studies which show that even human experts can 
> significantly differ in their opinions on what are key-phrases for a 
> given text. So, the results are never clear cut with humans either...
>
> So, in this sense a heuristic tool for sentence splitting and 
> key-phrase detection can go long ways. For example, the application I 
> mentioned, uses quite a few heuristic rules (+ Markov chains as a 
> heavier ammunition :-), and it comes up with the following phrases for 
> your email discussion (the text quoted below):
>
> (lang=EN): NLP, trainable rule-based tagging, natural language 
> processing, apache, NLP expert
>
> Now, this set of key-phrases does reflect the main noun-phrases in the 
> text... which means I have a practical and tangible benefit from NLP. 
> QED ;-)
>
> Best regards,
> Andrzej
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Tatu Saloranta
On Monday 17 November 2003 07:40, Chong, Herb wrote:
> i don't know what the Java implementation is like but the C++ one is very
> fast.
...
>> I personally do not have any experience with the BreakIterator in Java. Has
>> anyone used it in any production environment? I'd be very interested to
>> learn more about it's efficiency.

Even if that implementation wasn't fast (which it should be), it should be 
fairly easy to implement it to be pretty much as efficient as any of basic 
tokenizers; ie. not much slower than full scanning speed over text data and 
token creation overhead.

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-18 Thread Philippe Laflamme
> Even if that implementation wasn't fast (which it should be), it
> should be
> fairly easy to implement it to be pretty much as efficient as any
> of basic
> tokenizers; ie. not much slower than full scanning speed over
> text data and
> token creation overhead.

In terms of speed I would tend to agree with you.

My question regarding efficiency was directed more towards the quality of
the results it provides. Is the BreakIterator breaking on correct sentence
boundaries or is it being confused by dots at the end of acronyms and such.

Karsten was mentioning that it's results are of higher quality when you
prevent it from breaking after a number. Are there any other tips you can
provide?

Has anybody tested the implementation to estimate its precision?

Regards,
Phil

> -Original Message-
> From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
> Sent: November 17, 2003 22:00
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model
> in Lucene?]
>
>
> On Monday 17 November 2003 07:40, Chong, Herb wrote:
> > i don't know what the Java implementation is like but the C++
> one is very
> > fast.
> ...
> >> I personally do not have any experience with the BreakIterator
> in Java. Has
> >> anyone used it in any production environment? I'd be very interested to
> >> learn more about it's efficiency.
>
> Even if that implementation wasn't fast (which it should be), it
> should be
> fairly easy to implement it to be pretty much as efficient as any
> of basic
> tokenizers; ie. not much slower than full scanning speed over
> text data and
> token creation overhead.
>
> -+ Tatu +-
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-18 Thread Chong, Herb
i haven't tested, but i know that it is not too hard to change the tables that specify 
the break decisions.

Herb

-Original Message-
From: Philippe Laflamme [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 18, 2003 9:53 AM
To: Lucene Users List
Subject: RE: inter-term correlation [was Re: Vector Space Model in Lucene?]


In terms of speed I would tend to agree with you.

My question regarding efficiency was directed more towards the quality of
the results it provides. Is the BreakIterator breaking on correct sentence
boundaries or is it being confused by dots at the end of acronyms and such.

Karsten was mentioning that it's results are of higher quality when you
prevent it from breaking after a number. Are there any other tips you can
provide?

Has anybody tested the implementation to estimate its precision?

Regards,
Phil

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Karsten Konrad wrote:
I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries
This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
any arbitrary number you pick will be broken by some document someone puts into the 
system.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Chong, Herb
you could use the negative of the actual value.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]


This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-17 Thread Doug Cutting
Ah, I see.  You have an absoulte interpretation.  I am more relative.  I 
think we're talking about a heuristic, not a law.

Matches within a sentence are scored higher than those that are not. 
And the closer matching the terms are, whether within the same sentence 
or not, the greater the score.  Given these two principals, at some 
point, as sentences get longer, a close match across sentence boundaries 
should probably score substantially higher than a very distant match 
within a sentence.  Thus missing some distant yet still within-sentence 
matches in very long sentences probably won't substantially alter the 
ranking.  Is 100 long enough?  Perhaps not.  But 1000 is certainly 
plenty long.

Doug

Chong, Herb wrote:
any arbitrary number you pick will be broken by some document someone puts into the system.

Herb

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
This is exactly the sort of approach I was advocating in earlier 
messages.  (Although I think you'd only need to increase the position 
counter by 101 for the first word in each sentence.)  Herb Chong didn't 
seem to think this was appropriate, but I never understood why.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


understanding IR topics on this list [was: Re: Vector Space Model in Lucene?]

2003-11-15 Thread Gerret Apelt
Dror --

I just completed an introductory course in IR. I can recommend the 
textbook we used: "Managing Gigabytes: Compressing and Indexing 
Documents and Images". When I don't understand posts on this list I can 
typically look up the theory in that book, then come back to the list 
and have a better idea of whats going on. "Managing Gigabytes" appears 
to be getting good reviews from most readers, but I can't compare it to 
similar works as I haven't read any.

I've spent some time searching for websites that introduce advanced IR 
topics at a level that is less rigorous than academic papers. But I 
haven't really found anything I can recommend. Suggestions welcome :)

cheers,
Gerret
**
Dror Matalon wrote:
Hi,

I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ...
TIA,

Dror

On Fri, Nov 14, 2003 at 12:52:15PM -0500, Chong, Herb wrote:
 

i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far, it's not been promising. the indexing engine and file format needs to be changed. there are very few search engines that incorporate interterm correlation in any mathematically and linguistically rigorous manner. i designed a couple, but they were all research experiments.

if you are familiar with the TREC automatic adhoc track? my experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average precision by proper use of interterm correlation. my project at the time was cancelled after TREC-7 and so there haven't been any new developments.

Herb

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Herb

Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
curious because one of my projects (already using Lucene) could benefit 
from such feature. Right now I'm using a bastardized version of Markov 
chains, but it's more of a hack...

--
Best regards,
Andrzej Bialecki
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: understanding IR topics on this list [was: Re: Vector Space Model in Lucene?]

2003-11-16 Thread Magnus Johansson
I would also like to recommend "Modern Information Retrieval"
by Ricardo Baeza-Yates 

/magnus 

Gerret Apelt writes: 

Dror -- 

I just completed an introductory course in IR. I can recommend the 
textbook we used: "Managing Gigabytes: Compressing and Indexing Documents 
and Images". When I don't understand posts on this list I can typically 
look up the theory in that book, then come back to the list and have a 
better idea of whats going on. "Managing Gigabytes" appears to be getting 
good reviews from most readers, but I can't compare it to similar works as 
I haven't read any. 

I've spent some time searching for websites that introduce advanced IR 
topics at a level that is less rigorous than academic papers. But I 
haven't really found anything I can recommend. Suggestions welcome :) 

cheers,
Gerret
**
Dror Matalon wrote: 

Hi, 

I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ... 

TIA, 

Dror 

On Fri, Nov 14, 2003 at 12:52:15PM -0500, Chong, Herb wrote:
  

i don't know of any open source search engine that incorporates 
interterm correlation. i have been looking into how to do this in Lucene 
and so far, it's not been promising. the indexing engine and file format 
needs to be changed. there are very few search engines that incorporate 
interterm correlation in any mathematically and linguistically rigorous 
manner. i designed a couple, but they were all research experiments. 

if you are familiar with the TREC automatic adhoc track? my experiments 
with the TREC-5 to TREC-7 questions produced about 0.05 to 0.10 
improvement in average precision by proper use of interterm correlation. 
my project at the time was cancelled after TREC-7 and so there haven't 
been any new developments. 

Herb 

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene? 

Herb 

Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
curious because one of my projects (already using Lucene) could benefit 
from such feature. Right now I'm using a bastardized version of Markov 
chains, but it's more of a hack... 

--
Best regards,
Andrzej Bialecki 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED] 



  

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])

2003-11-14 Thread Otis Gospodnetic
Hello Herb,

I don't approve of several teasing, mean, etc. emails I saw from a few
people.  This is a serious and polite email. :)

It sounds like you know about NLP and see places where Lucene could be
improved.  Lucene is open source and free, and could benefit from
knowledgeable people like you.  Are you interested in contributing some
computational linguistics smarts, either as improvement of Lucene core
(if improvements are such that they don't make Lucene use more
difficult and its code significantly more complex and harder to
maintain), or as an add-on module, or some kind of an extension, or
even just as application built on top of Lucene, all of which could and
would live outside of Lucene's core?

Otis



--- "Chong, Herb" <[EMAIL PROTECTED]> wrote:
> you're describing ad-hoc solutions to a problem that have an effect,
> but not one that is easily predictable. one can concoct all sorts of
> combinations of the query operators that would have something of the
> effect that i am describing. crossing sentence boundaries, however,
> can't be done without having some sentence boundaries as a reference.
> on top of this, there is a relatively simple concept which, if
> implemented, takes away all the ad-hocness of the solutions and
> replaces it with a something that is both linguistically and
> mathematically sound and on top of which won't materially make the
> engine core more complicated. that concept is that multiword queries
> are mostly multiword terms and they can't cross sentence boundaries
> according to the rules of English.
> 
> Herb
> 
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 14, 2003 3:33 PM
> To: Lucene Users List
> Subject: Re: inter-term correlation [was Re: Vector Space Model in
> Lucene?]
> 
> 
> Certainly there are lots of scoring algorithms that one cannot easily
> 
> implement with Lucene.  I'm just not yet clear on what you need to do
> 
> that Lucene cannot support.



__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space Model in Lucene?])

2003-11-17 Thread Chong, Herb
i am stuck with company policy with respect to open source project participation. this 
is why i am dropping some fairly detailed hints of what has to be done instead of 
doing it myself. this policy may change in the next year, but by then, i will have to 
be working with a solution and not just looking for one.

Herb

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 6:45 PM
To: Lucene Users List
Subject: Contributing to Lucene (was RE: inter-term correlation [was Re: Vector Space 
Model in Lucene?])


Hello Herb,

I don't approve of several teasing, mean, etc. emails I saw from a few
people.  This is a serious and polite email. :)

It sounds like you know about NLP and see places where Lucene could be
improved.  Lucene is open source and free, and could benefit from
knowledgeable people like you.  Are you interested in contributing some
computational linguistics smarts, either as improvement of Lucene core
(if improvements are such that they don't make Lucene use more
difficult and its code significantly more complex and harder to
maintain), or as an add-on module, or some kind of an extension, or
even just as application built on top of Lucene, all of which could and
would live outside of Lucene's core?

Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]