I would like the opinion of list members on the best way to approach the 
following task: Being able to perform boolean and proximity searches on a large 
body of text (say 400,000 words) returning paragraphs where the hits are 
located (clickable to give section containing paragraph desired.)

Of course by boolean I mean "find  where 'word a' AND 'word b' occur in the 
same paragraph". Same with OR.
By proximity I mean "find where 'word a' is within x words of 'word b'.

In thinking about this for a while and considering how to store the text and 
ways to search it I have come to the conclusion that a database of words 
contained in the text is the way to go. By this I mean effectively indexing 
every word and its position in the text and then using database operations on 
this index file to produce a list of hits. The hits giving me either chunk 
expressions to display the relevant text blocks or record ID if I also store 
paragraphs as individual records in a further database file. 

For example with the boolean AND, get a selection of 'word a' hits and then a 
selection of 'word b' hits and find where they intersect based on the paragraph 
numbers. This would result in a list of paragraphs only containing both words.

Do members think this an overkill?
Has anybody else looked at this?

Any comments would be appreciated.



James




_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to