Hello Adam

Thanks for the links, I was not aware of them. There is currently no probability value for matching string(s). The current heuristic rules are based on known practices, like ESRI adding the "D_" prefix for datum, spaces replaced by '_' and non-alphanumeric characters ignored. I have not yet found a need to match strings that are only similar. For now I have seen either exact match with above rules, or completely different names (e.g. "International 1924" and "Hayford 1909" are the same ellipsoid).

Lucene of course have a role, and actually we do use it, but rather in some layers on top of metadata. I think it will come to SIS later, presumably in a separated module...

    Martin



Le 26/11/13 18:49, Adam Estrada a écrit :
Martin,

Is there a probability value that is returned for the matching
string(s)? I actually just came across a blog post[1] that does
something similar to what you are working towards. They use the
verbiage "best partial" for determining strings of noticeably
different lengths. This appears to be similar to using a Jaccard
index[2] for string comparison but on smaller bodies of text like the
titles of said aliases. Would this be an application for using a
Lucene index that already has all the info retrieval goodness built in
to it?

Adam

[1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
[2] http://en.wikipedia.org/wiki/Jaccard_index

Reply via email to