On Thu, 2009-03-05 at 12:54 -0800, Jeff FW wrote:
> Well, then, that is quite a strange use case :-)  Nevermind my simple
> methods.  Malcom's suggestion of an extension for postgres seems like
> a good idea--writing functions in various languages (like Python!) is
> _really_ easy in postgres.
> 
> Just out of curiosity (for either of you,) what is a search like that
> used for?  I've had a lot of strange requests from a lot of (generally
> strange) clients, but that's a pretty weird one.

It's not that weird at all. It simply depends on the domains you're
working in. No idea how it might apply to article headlines, although
finding "related matches" could well use something like this.

It's very common for finding overlaps in sequences of strings, though.
The almost "standard" example is DNA sequences where you're trying to
find if one sequenced set of data (bases extracted from a genetic
sample) correspond to anything else already in the database. Since there
can be damage at the extremeties of extractions, or even in the middle
(or mutations), finding the longest common substring is the standard
approach. There's a whole related area of reasearch in finding the
longest palindrome sequences, too, for similar matching and folding
purposes.

Plagarism or even "similar article" testing is another case like this.
Finding all "reasonably long" common sequences between a set of source
documents and a candidate document is a start.

One case I built something for was a compressed storage and
transmissiong system for PDF and ODF documents. That required doing,
essentially, a context-aware diff'ing process and pulling out any large
chunks of commonality was the first step.

Finally, not quite the same problem, but highly related, is the issue
of, say, quickly finding all tags or other keywords or phrases that
appear in a collection documents. Sometimes partial matching is an
appropriate place for generating new phrases, so a modified Aho-Corasick
search (just to give you a term to search on if you care) is a starting
point.

This whole domain is a very interesting area for algorithms and
implementation.

Regards,
Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to