Re: Current state of work (2013-11-25)

Martin Desruisseaux Tue, 26 Nov 2013 16:19:29 -0800

Hello Adam

Thanks for the links, I was not aware of them. There is currently noprobability value for matching string(s). The current heuristic rulesare based on known practices, like ESRI adding the "D_" prefix fordatum, spaces replaced by '_' and non-alphanumeric characters ignored. Ihave not yet found a need to match strings that are only similar. Fornow I have seen either exact match with above rules, or completelydifferent names (e.g. "International 1924" and "Hayford 1909" are thesame ellipsoid).

Lucene of course have a role, and actually we do use it, but rather insome layers on top of metadata. I think it will come to SIS later,presumably in a separated module...


    Martin



Le 26/11/13 18:49, Adam Estrada a écrit :

Martin,

Is there a probability value that is returned for the matching
string(s)? I actually just came across a blog post[1] that does
something similar to what you are working towards. They use the
verbiage "best partial" for determining strings of noticeably
different lengths. This appears to be similar to using a Jaccard
index[2] for string comparison but on smaller bodies of text like the
titles of said aliases. Would this be an application for using a
Lucene index that already has all the info retrieval goodness built in
to it?

Adam

[1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
[2] http://en.wikipedia.org/wiki/Jaccard_index

Re: Current state of work (2013-11-25)

Reply via email to