Hello,

>>
There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way; 
it may not be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?
>>

There is a computational linguistic theory that deals with such questions, 
Rhetorical Structure Theory, see http://www.sil.org/~mannb/rst/. Basically, 
each text is seen as a hierarchical structure fromed from on a few rhetorical 
relations. Interestingly, some relations are not too hard to guess once your 
text is semi-structured already (the relation between a paragraph header and 
its paragraph is a rhetorical one for instance, a HTML list is a sequence 
of sentences connected by the list relation and so forth). 

Applying such theories to Lucene would require quite a lot of work while
analysing the texts, but I doubt whether Lucene could not be convinced to work
on such structures and boost the relation of terms more if they appear
within closer RST-structure connections.

Regards,

Karsten

Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com


-----Ursprüngliche Nachricht-----
Von: Tatu Saloranta [mailto:[EMAIL PROTECTED] 
Gesendet: Samstag, 15. November 2003 02:15
An: Lucene Users List
Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


On Friday 14 November 2003 11:50, Chong, Herb wrote:
> if you are handling inter correlation properly, then terms can't cross 
> sentence boundaries. if you are not paying attention to sentence 
> boundaries, then you are not following rules of linguistics.

Isn't that quite strict interpretation, however? There are many cases where 
linguistically separate sentences do have strong dependendies; in web world 
simple things like list items may be very closely related. Put another way; it may not 
be trivially easy to detect sentence boundaries, nor is it certain 
that what (from language viewpoint) is a boundary really is hard boundary 
from semantic perspective? And are there not varying levels of separation 
(sentences close to each other often are related, back references being 
common), not just one, between sentences?

As to storing boundaries in index; am I naive if I suggested just marker 
tokens that could easily be used to mark boundaries (sentence, paragraph, 
section)? Code that uses that information would obviously need to know 
details of marking used, but would it be infeasible to use such in-band 
information?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to