On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> Looking over the latest patch, I think we could simplify the code so
>> that you don't need multiple FuzzyAttrMatchState objects.  Instead of
>> creating a separate one for each RTE and then merging them, just have
>> one.  When you find an inexact-RTE name match, set a field inside the
>> FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
>> Levenshtein distance between the RTEs.  Then call scanRTEForColumn()
>> and pass down the same state object.  Now let
>> updateFuzzyAttrMatchState() work out what it needs to do based on the
>> observed inter-column distance and the currently-in-force RTE penalty.
>
> I'm afraid I don't follow. I think doing things that way makes things
> less clear. Merging is useful because it allows us to consider that an
> exact match might exist, which this searchRangeTableForCol() is
> already tasked with today. We now look for the best match
> exhaustively, or magically return immediately in the event of an exact
> match, without caring about the alias correctness or distance.
>
> Having a separate object makes this pattern apparent from the top
> level, within searchRangeTableForCol(). I feel that's better.
> updateFuzzyAttrMatchState() is the wrong place to put that, because
> that task rightfully belongs in searchRangeTableForCol(), where the
> high level diagnostic-report-generating control flow lives.
>
> To put it another way, creating a separate object obfuscates
> scanRTEForColumn(), since it's the only client of
> updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
> function, and right now I am only making it slightly less clear by
> tasking it with caring about distance of names on top of strict binary
> equality of attribute names. I don't want to push it any further.

I don't buy this.  What you're essentially doing is using the
FuzzyAttrMatchState object in two ways that are not entirely
compatible with each other - updateFuzzyAttrMatchState doesn't set the
RTE fields, so searchRangeTableForCol has to do it.  So there's an
unspoken contract that in some parts of the code, you can rely on
those fields being set, and in others, you can't.  That pretty much
defeats the whole point of making the state its own object, AFAICS.
Furthermore, you end up with two copies of the state-combining logic,
one in FuzzyAttrMatchState and a second in searchRangeTableForCol.
That's ugly and unnecessary.

I decided to rework this patch myself today; my version is attached.
I believe that this version is significantly easier to understand than
yours, both as to the code and the comments.  I put quite a bit of
work into both.  I also suspect it's more efficient, because it avoids
computing the Levenshtein distances for column names when we already
know that those column names can't possibly be sufficiently-good
matches for us to care about the details; and when it does compute the
Levenshtein distance it keeps the max-distance threshold as low as
possible.  That may not really matter much, but it can't hurt.  More
importantly, essentially all of the fuzzy-matching logic is now
isolated in FuzzyAttrMatchState(); the volume of change in
scanRTEForColumn is the same as in your version, but the volume of
change in searchRangeTableForCol is quite a bit less, so the patch is
smaller overall.

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment: column-hint-rmh.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to