Hello, I have been working on some robust algorithms for text summarization and matching. A very approximate (and misleading, but this is not important now) description is :
a) turn documents into lists of sentences b) turn sentences into lists of words c) estimate words statistics d) turn sentences into lists of features, represented by numerical ids c) compare sentences For this purpose, dictionaries which keys are strings are used (implemented with tries). I have recently begun to study the database and have been wondering it would not be better to used it for that purpose. The idea would be : a database containing document, sentences, words, features, ... And it would be possible to get the number of occurrences of one word in all or one document, or the sentences which contain a certain word or feature for example. Does this sound reasonable? Is the database fast enough? The basic statistic is word occurrence, which means that each word encountered would have to be searched within the external symbols : finding the value associated to a string in a trie remains fast whatever the numbers of keys. Is it possible to automatically propagate some information within the database? For example, when a word is read, its occurrence number have to be incremented, but also the occurrences of its related features. Numerical ids have been only used for speed reasons, determining the equality of two numbers is much faster than for two strings. This makes nevertheless debugging unpleasant. What about comparing two external symbols? Thanks Denis -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe