On Thu, 2009-03-05 at 12:54 -0800, Jeff FW wrote: > Well, then, that is quite a strange use case :-) Nevermind my simple > methods. Malcom's suggestion of an extension for postgres seems like > a good idea--writing functions in various languages (like Python!) is > _really_ easy in postgres. > > Just out of curiosity (for either of you,) what is a search like that > used for? I've had a lot of strange requests from a lot of (generally > strange) clients, but that's a pretty weird one.
It's not that weird at all. It simply depends on the domains you're working in. No idea how it might apply to article headlines, although finding "related matches" could well use something like this. It's very common for finding overlaps in sequences of strings, though. The almost "standard" example is DNA sequences where you're trying to find if one sequenced set of data (bases extracted from a genetic sample) correspond to anything else already in the database. Since there can be damage at the extremeties of extractions, or even in the middle (or mutations), finding the longest common substring is the standard approach. There's a whole related area of reasearch in finding the longest palindrome sequences, too, for similar matching and folding purposes. Plagarism or even "similar article" testing is another case like this. Finding all "reasonably long" common sequences between a set of source documents and a candidate document is a start. One case I built something for was a compressed storage and transmissiong system for PDF and ODF documents. That required doing, essentially, a context-aware diff'ing process and pulling out any large chunks of commonality was the first step. Finally, not quite the same problem, but highly related, is the issue of, say, quickly finding all tags or other keywords or phrases that appear in a collection documents. Sometimes partial matching is an appropriate place for generating new phrases, so a modified Aho-Corasick search (just to give you a term to search on if you care) is a starting point. This whole domain is a very interesting area for algorithms and implementation. Regards, Malcolm --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---