Hello Adam
Thanks for the links, I was not aware of them. There is currently no
probability value for matching string(s). The current heuristic rules
are based on known practices, like ESRI adding the "D_" prefix for
datum, spaces replaced by '_' and non-alphanumeric characters ignored. I
have not yet found a need to match strings that are only similar. For
now I have seen either exact match with above rules, or completely
different names (e.g. "International 1924" and "Hayford 1909" are the
same ellipsoid).
Lucene of course have a role, and actually we do use it, but rather in
some layers on top of metadata. I think it will come to SIS later,
presumably in a separated module...
Martin
Le 26/11/13 18:49, Adam Estrada a écrit :
Martin,
Is there a probability value that is returned for the matching
string(s)? I actually just came across a blog post[1] that does
something similar to what you are working towards. They use the
verbiage "best partial" for determining strings of noticeably
different lengths. This appears to be similar to using a Jaccard
index[2] for string comparison but on smaller bodies of text like the
titles of said aliases. Would this be an application for using a
Lucene index that already has all the info retrieval goodness built in
to it?
Adam
[1] http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
[2] http://en.wikipedia.org/wiki/Jaccard_index