Clearly, you get to work on cooler projects than I :-)  I had thought
of the keywords/phrases case, but the other ones are far more
interesting.  Thanks for the explanation!

-Jeff

On Mar 5, 7:02 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> On Thu, 2009-03-05 at 12:54 -0800, Jeff FW wrote:
> > Well, then, that is quite a strange use case :-)  Nevermind my simple
> > methods.  Malcom's suggestion of an extension for postgres seems like
> > a good idea--writing functions in various languages (like Python!) is
> > _really_ easy in postgres.
>
> > Just out of curiosity (for either of you,) what is a search like that
> > used for?  I've had a lot of strange requests from a lot of (generally
> > strange) clients, but that's a pretty weird one.
>
> It's not that weird at all. It simply depends on the domains you're
> working in. No idea how it might apply to article headlines, although
> finding "related matches" could well use something like this.
>
> It's very common for finding overlaps in sequences of strings, though.
> The almost "standard" example is DNA sequences where you're trying to
> find if one sequenced set of data (bases extracted from a genetic
> sample) correspond to anything else already in the database. Since there
> can be damage at the extremeties of extractions, or even in the middle
> (or mutations), finding the longest common substring is the standard
> approach. There's a whole related area of reasearch in finding the
> longest palindrome sequences, too, for similar matching and folding
> purposes.
>
> Plagarism or even "similar article" testing is another case like this.
> Finding all "reasonably long" common sequences between a set of source
> documents and a candidate document is a start.
>
> One case I built something for was a compressed storage and
> transmissiong system for PDF and ODF documents. That required doing,
> essentially, a context-aware diff'ing process and pulling out any large
> chunks of commonality was the first step.
>
> Finally, not quite the same problem, but highly related, is the issue
> of, say, quickly finding all tags or other keywords or phrases that
> appear in a collection documents. Sometimes partial matching is an
> appropriate place for generating new phrases, so a modified Aho-Corasick
> search (just to give you a term to search on if you care) is a starting
> point.
>
> This whole domain is a very interesting area for algorithms and
> implementation.
>
> Regards,
> Malcolm
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to