Clearly, you get to work on cooler projects than I :-) I had thought of the keywords/phrases case, but the other ones are far more interesting. Thanks for the explanation!
-Jeff On Mar 5, 7:02 pm, Malcolm Tredinnick <malc...@pointy-stick.com> wrote: > On Thu, 2009-03-05 at 12:54 -0800, Jeff FW wrote: > > Well, then, that is quite a strange use case :-) Nevermind my simple > > methods. Malcom's suggestion of an extension for postgres seems like > > a good idea--writing functions in various languages (like Python!) is > > _really_ easy in postgres. > > > Just out of curiosity (for either of you,) what is a search like that > > used for? I've had a lot of strange requests from a lot of (generally > > strange) clients, but that's a pretty weird one. > > It's not that weird at all. It simply depends on the domains you're > working in. No idea how it might apply to article headlines, although > finding "related matches" could well use something like this. > > It's very common for finding overlaps in sequences of strings, though. > The almost "standard" example is DNA sequences where you're trying to > find if one sequenced set of data (bases extracted from a genetic > sample) correspond to anything else already in the database. Since there > can be damage at the extremeties of extractions, or even in the middle > (or mutations), finding the longest common substring is the standard > approach. There's a whole related area of reasearch in finding the > longest palindrome sequences, too, for similar matching and folding > purposes. > > Plagarism or even "similar article" testing is another case like this. > Finding all "reasonably long" common sequences between a set of source > documents and a candidate document is a start. > > One case I built something for was a compressed storage and > transmissiong system for PDF and ODF documents. That required doing, > essentially, a context-aware diff'ing process and pulling out any large > chunks of commonality was the first step. > > Finally, not quite the same problem, but highly related, is the issue > of, say, quickly finding all tags or other keywords or phrases that > appear in a collection documents. Sometimes partial matching is an > appropriate place for generating new phrases, so a modified Aho-Corasick > search (just to give you a term to search on if you care) is a starting > point. > > This whole domain is a very interesting area for algorithms and > implementation. > > Regards, > Malcolm --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---