Hello,

I have been working on some robust algorithms for text summarization and 
matching. A very approximate (and misleading, but this is not important now) 
description is :

a) turn documents into lists of sentences
b) turn sentences into lists of words
c) estimate words statistics
d) turn sentences into lists of features, represented by numerical ids
c) compare sentences

For this purpose, dictionaries which keys are strings are used (implemented 
with tries). I have recently begun to study the database and have been 
wondering it would not be better to used it for that purpose. The idea would be 
:

a database containing document, sentences, words, features, ...

And it would be possible to get the number of occurrences of one word in all or 
one document, or the sentences which contain a certain word or feature for 
example.

Does this sound reasonable?

Is the database fast enough? The basic statistic is word occurrence, which 
means that each word encountered would have to be searched within the external 
symbols : finding the value associated to a string in a trie remains fast 
whatever the numbers of keys. 

Is it possible to automatically propagate some information within the database? 
For example, when a word is read, its occurrence number have to be incremented, 
but also the occurrences of its related features.

Numerical ids have been only used for speed reasons, determining the equality 
of two numbers is much faster than for two strings. This makes nevertheless 
debugging unpleasant. What about comparing two external symbols?

Thanks

Denis




                                          --
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Reply via email to