Hi Denis, > I have been working on some robust algorithms for text summarization and > matching. A very approximate (and misleading, but this is not important now) > description is : > > a) turn documents into lists of sentences > b) turn sentences into lists of words > c) estimate words statistics > d) turn sentences into lists of features, represented by numerical ids > c) compare sentences > > For this purpose, dictionaries which keys are strings are used (implemented > with tries). I have recently begun to study the database and have been > wondering it would not be better to used it for that purpose. The idea would > be : > > a database containing document, sentences, words, features, ... > > And it would be possible to get the number of occurrences of one word in all > or one document, or the sentences which contain a certain word or feature for > example. > > Does this sound reasonable?
Yes, it does. Using the database has two advantages: (1) You get persistence of your data, and (2) it will automatically use B-Tree indexes. > Is the database fast enough? Yes. The PicoLisp DB works in such a way that all objects once fetched from the DB files are cached in memory, so that further operations run at full speed. > Is it possible to automatically propagate some information within the > database? For example, when a word is read, its occurrence number have > to be incremented, but also the occurrences of its related features. Yes, this is what the entity/relation daemons in the database are all about. For example, each class of objects maintains its private count, and each index tree too. In addition, you can define an 'upd>' method for an entity class which fires when an object is modified. ♪♫ Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe