> On Dec21, 2010, at 11:37 , t...@fuzzy.cz wrote: >> I doubt there is a way to this decision with just dist(A), dist(B) and >> dist(A,B) values. Well, we could go with a rule >> >> if [dist(A) == dist(A,B)] the [A => B] >> >> but that's very fragile. Think about estimates (we're not going to work >> with exact values of dist(?)), and then about data errors (e.g. a city >> matched to an incorrect ZIP code or something like that). > > Huh? The whole point of the F(A,B)-exercise is to avoid precisely this > kind of fragility without penalizing the non-correlated case...
Yes, I understand the intention, but I'm not sure how exactly do you want to use the F(?,?) function to compute the P(A,B) - which is the value we're looking for. If I understand it correctly, you proposed something like this IF (F(A,B) > F(B,A)) THEN P(A,B) := c*P(A); ELSE P(A,B) := d*P(B); END IF; or something like that (I guess c=dist(A)/dist(A,B) and d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may easily happen due to data errors / imprecise estimate. And this actually matters, because P(A) and P(B) may be actually significantly different. So this would be really vulnerable to slight changes in the estimates etc. >> This is the reason why they choose to always combine the values (with >> varying weights). > > There are no varying weights involved there. What they do is to express > P(A=x,B=y) once as > > ... > > P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 > = dist(A)*P(A=x)/(2*dist(A,B)) + > dist(B)*P(B=x)/(2*dist(A,B)) > = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) > > That averaging steps add *no* further data-dependent weights. Sorry, by 'varying weights' I didn't mean that the weights are different for each value of A or B. What I meant is that they combine the values with different weights (just as you explained). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers