I would like the opinion of list members on the best way to approach the following task: Being able to perform boolean and proximity searches on a large body of text (say 400,000 words) returning paragraphs where the hits are located (clickable to give section containing paragraph desired.)
Of course by boolean I mean "find where 'word a' AND 'word b' occur in the same paragraph". Same with OR. By proximity I mean "find where 'word a' is within x words of 'word b'. In thinking about this for a while and considering how to store the text and ways to search it I have come to the conclusion that a database of words contained in the text is the way to go. By this I mean effectively indexing every word and its position in the text and then using database operations on this index file to produce a list of hits. The hits giving me either chunk expressions to display the relevant text blocks or record ID if I also store paragraphs as individual records in a further database file. For example with the boolean AND, get a selection of 'word a' hits and then a selection of 'word b' hits and find where they intersect based on the paragraph numbers. This would result in a list of paragraphs only containing both words. Do members think this an overkill? Has anybody else looked at this? Any comments would be appreciated. James _______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution