On 06/17/2014 02:53 PM, Tom Lane wrote:
> Josh Berkus <j...@agliodbs.com> writes:
>> On 06/17/2014 02:36 PM, Tom Lane wrote:
>>> Another issue is whether to print only those having exactly the minimum
>>> observed Levenshtein distance, or to print everything less than some
>>> cutoff.  The former approach seems to me to be placing a great deal of
>>> faith in something that's only a heuristic.
> 
>> Well, that depends on what the cutoff is.  If it's high, like 0.5, that
>> could be a LOT of columns.  Like, I plan to test this feature with a
>> 3-table join that has a combined 300 columns.  I can completely imagine
>> coming up with a string which is within 0.5 or even 0.3 of 40 columns names.
> 
> I think Levenshtein distances are integers, though that's just a minor
> point.

I was giving distance/length ratios.  That is, 0.5 would mean that up to
50% of the characters could be replaced/changed.  0.2 would mean that
only one character could be changed at lengths of five characters.  Etc.

The problem with these ratios is that they behave differently with long
strings than short ones.  I think realistically we'd need a double
threshold, i.e. ( distance >= 2 OR ratio <= 0.4 ).  Otherwise the
obvious case, getting two characters wrong in a 4-character column name
(or one in a two character name), doesn't get a HINT.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to