Hi Denis,

> I have been working on some robust algorithms for text summarization and 
> matching. A very approximate (and misleading, but this is not important now) 
> description is :
> 
> a) turn documents into lists of sentences
> b) turn sentences into lists of words
> c) estimate words statistics
> d) turn sentences into lists of features, represented by numerical ids
> c) compare sentences
> 
> For this purpose, dictionaries which keys are strings are used (implemented 
> with tries). I have recently begun to study the database and have been 
> wondering it would not be better to used it for that purpose. The idea would 
> be :
> 
> a database containing document, sentences, words, features, ...
> 
> And it would be possible to get the number of occurrences of one word in all 
> or one document, or the sentences which contain a certain word or feature for 
> example.
> 
> Does this sound reasonable?

Yes, it does. Using the database has two advantages: (1) You get
persistence of your data, and (2) it will automatically use B-Tree
indexes.


> Is the database fast enough?

Yes. The PicoLisp DB works in such a way that all objects once fetched
from the DB files are cached in memory, so that further operations run
at full speed.


> Is it possible to automatically propagate some information within the
> database? For example, when a word is read, its occurrence number have
> to be incremented, but also the occurrences of its related features.

Yes, this is what the entity/relation daemons in the database are all
about. For example, each class of objects maintains its private count,
and each index tree too. In addition, you can define an 'upd>' method
for an entity class which fires when an object is modified.

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Reply via email to