On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmh...@gmail.com> wrote: >> Looking over the latest patch, I think we could simplify the code so >> that you don't need multiple FuzzyAttrMatchState objects. Instead of >> creating a separate one for each RTE and then merging them, just have >> one. When you find an inexact-RTE name match, set a field inside the >> FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the >> Levenshtein distance between the RTEs. Then call scanRTEForColumn() >> and pass down the same state object. Now let >> updateFuzzyAttrMatchState() work out what it needs to do based on the >> observed inter-column distance and the currently-in-force RTE penalty. > > I'm afraid I don't follow. I think doing things that way makes things > less clear. Merging is useful because it allows us to consider that an > exact match might exist, which this searchRangeTableForCol() is > already tasked with today. We now look for the best match > exhaustively, or magically return immediately in the event of an exact > match, without caring about the alias correctness or distance. > > Having a separate object makes this pattern apparent from the top > level, within searchRangeTableForCol(). I feel that's better. > updateFuzzyAttrMatchState() is the wrong place to put that, because > that task rightfully belongs in searchRangeTableForCol(), where the > high level diagnostic-report-generating control flow lives. > > To put it another way, creating a separate object obfuscates > scanRTEForColumn(), since it's the only client of > updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important > function, and right now I am only making it slightly less clear by > tasking it with caring about distance of names on top of strict binary > equality of attribute names. I don't want to push it any further.
I don't buy this. What you're essentially doing is using the FuzzyAttrMatchState object in two ways that are not entirely compatible with each other - updateFuzzyAttrMatchState doesn't set the RTE fields, so searchRangeTableForCol has to do it. So there's an unspoken contract that in some parts of the code, you can rely on those fields being set, and in others, you can't. That pretty much defeats the whole point of making the state its own object, AFAICS. Furthermore, you end up with two copies of the state-combining logic, one in FuzzyAttrMatchState and a second in searchRangeTableForCol. That's ugly and unnecessary. I decided to rework this patch myself today; my version is attached. I believe that this version is significantly easier to understand than yours, both as to the code and the comments. I put quite a bit of work into both. I also suspect it's more efficient, because it avoids computing the Levenshtein distances for column names when we already know that those column names can't possibly be sufficiently-good matches for us to care about the details; and when it does compute the Levenshtein distance it keeps the max-distance threshold as low as possible. That may not really matter much, but it can't hurt. More importantly, essentially all of the fuzzy-matching logic is now isolated in FuzzyAttrMatchState(); the volume of change in scanRTEForColumn is the same as in your version, but the volume of change in searchRangeTableForCol is quite a bit less, so the patch is smaller overall. I'm prepared to commit this version if nobody finds a problem with it. It passes the additional regression tests you wrote. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
column-hint-rmh.patch
Description: binary/octet-stream
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers