Re: [HACKERS] GiST support for inet datatypes

Andreas Karlsson Thu, 30 Jan 2014 17:59:00 -0800

On 01/23/2014 11:22 PM, Emre Hasegeli wrote:
> inet-gist
> ---------
> I realized that create extension btree_gist fails with the patch.


ERROR:  could not make operator class "gist_inet_ops" be default for type inet
DETAIL:  Operator class "inet_ops" already is the default.

I think making the new operator class default makes more sense. Name conflict
can be a bigger problem. Should I rename the new operator class?

I agree that the new operator class should be the default one, it ismore useful and also not buggy. No idea about the naming.

The only thing is that I think your index would do poorly on the case where
all values share a prefix before the netmask but have different values after
the netmask, since gist_union and gist_penalty does not care about the bits
after the netmask. Am I correct? Have you thought anything about this?


Yes, I have tried to construct the index with ip_maxbits() instead of ip_bits()
but I could not make it work with the cidr type. It can be done by adding
operator as a parameter for union, penalty and picksplit functions on the
GiST framework. I am not sure it worths the trouble. It would only help
the basic comparison operators and it would slow down other operators because
network length comparison before network address would not be possible for
the intermediate nodes. I mentioned this behavior of the operator class on
the paragraph added to the documentation.

I think this is fine since it does not seem like a very useful case tosupport to me. Would be worth doing only if it is easy to do.

A separate concern: we still have not come to a decision about operatorsfor inet here. I do not like the fact that the operators differ betweenranges and inet. But I feel this might be out of scope for this patch.


Does any third party want to chime in with an opinion?

Current inet/cidr
<<        is contained within
<<=       is contained within or equals
>>        contains
>>=       contains or equals

Range types
@>   contains range
<@   range is contained by
&&      overlap (have points in common)
<<        strictly left of
>>        strictly right of

inet-selfuncs
-------------

Here I have to honestly admit I am not sure if I can tell if your
selectivity function is correct. Your explanation sounds convincing and the
general idea of looking at the histogram is correct. I am just have no idea
if the details are right.


I tried to improve it in this version. I hope it is more readable now. I used
the word inclusion instead of overlap, made some changes on the algorithm
to make it more suitable to the other operators.

Using the histogram for inclusion which is originally for basic comparison,
is definitely not correct. It is an empirical approach. I have tried several
versions of it, tested them with different data sets and thought it is better
than nothing.

Thanks for the updates. The code in this version is cleaner and easierto follow.

I am not convinced of your approach to calculating the selectivity fromthe histogram. The thing I have the problem with is the clever trickeryinvolved with how you handle different operator types. I prefer theclearer code of the range types with how calc_hist_selectivity_scalar isused. Is there any reason for why that approach would not work here orresult in worse code?

I have attached a patch where I improved the language of your commentdescribing the workings of the selectivity estimation. Could you pleasecheck it so I did not change the meaning of any of it?

I see from the tests that you still are missing selectivity functionsfor operators, what is your plan for this?


--
Andreas Karlsson

diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
new file mode 100644
index 87a7390..3b99afe
*** a/src/backend/utils/adt/network_selfuncs.c
--- b/src/backend/utils/adt/network_selfuncs.c
*************** inet_opr_order(Oid operator)
*** 172,181 ****
   *
   * Calculates histogram selectivity for the subnet inclusion operators of
   * the inet type. In the normal case, the return value is between 0 and 1.
!  * It should be corrected with MVC selectivity and null fraction. If
   * the constant is less than the first element or greater than the last
   * element of the histogram the return value will be 0. If the histogram
!  * does not available, the return value will be -1.
   *
   * The histogram is originally for the basic comparison operators. Only
   * the common bits of the network part and the lenght of the network part
--- 172,181 ----
   *
   * Calculates histogram selectivity for the subnet inclusion operators of
   * the inet type. In the normal case, the return value is between 0 and 1.
!  * It should be corrected with the MVC selectivity and null fraction. If
   * the constant is less than the first element or greater than the last
   * element of the histogram the return value will be 0. If the histogram
!  * is not available, the return value will be -1.
   *
   * The histogram is originally for the basic comparison operators. Only
   * the common bits of the network part and the lenght of the network part
*************** inet_opr_order(Oid operator)
*** 186,218 ****
   * To avoid this problem, comparison with the left and the right side of the
   * buckets used together.
   *
!  * Histogram bucket matches calculated in 3 forms. If the constant matches
!  * the both sides, bucket considered as fully matched. If the constant
!  * matches only the right side, bucket does not considered as matched at all.
!  * In that case, the ratio for only 1 value in the column added to
!  * the selectivity.
   *
!  * The ratio for only 1 value, calculated with the ndistinct variable,
!  * if greater than 0. 0 can be given to it if this behavior does not desired.
!  * This ratio can be big enough not to disregard for addresses with small
!  * masklen's. See pg_statistic for more information about it.
   *
   * When the constant matches only the right side of the bucket, it will match
   * the next bucket, unless the bucket is the last one. If these buckets would
!  * considered as matched, it would lead unfair multiple matches for some
   * constants.
   *
!  * The third form is to match the bucket partially. Two dividers for both
!  * of the boundries tried to be calculated, in this case. If the address
!  * family of the boundry does not match the constant or comparison of
!  * the lenght of the network parts does not true by the operator, the divider
!  * for the boundry would not taken into account. If both of the dividers
!  * can be calculated, some kind of geometrical mean will be used.
   *
!  * For partial match with the buckets which have different address families
!  * on the left and right sides, only the boundry with the same address
!  * family, taken into consideration. This can cause more mistake for these
!  * buckets if their masklen's of their boundries are also disperate. It can
   * only be the case for one bucket, if there are addresses with different
   * families on the column. It seems as a better option than not considering
   * these buckets.
--- 186,218 ----
   * To avoid this problem, comparison with the left and the right side of the
   * buckets used together.
   *
!  * Histogram bucket matches are calculated in 3 forms. If the constant
!  * matches both sides the bucket is considered as fully matched. If the
!  * constant matches only the right side the bucket is not considered as
!  * matched at all. In that case the ratio for only one value in the column
!  * is added to the selectivity.
   *
!  * The ratio for only one value is calculated with the ndistinct variable
!  * if greater than 0. 0 can be given if this behavior is not desired.
!  * This ratio can be big enough to not disregard addresses with small
!  * masklens. See pg_statistic for more information about it.
   *
   * When the constant matches only the right side of the bucket, it will match
   * the next bucket, unless the bucket is the last one. If these buckets would
!  * be considered as matched it would lead to unfair multiple matches for some
   * constants.
   *
!  * The third form is to match the bucket partially. We try to calculate
!  * dividers for both of the boundaries. If the address family of the boundary
!  * does not match the constant or comparison of the lenght of the network
!  * parts is not true by the operator the divider for the boundary would not
!  * taken into account. If both of the dividers can be calculated some kind
!  * of geometrical mean of them will be used.
   *
!  * For partial match with buckets which have different address families
!  * on the left and right sides only the boundary with the same address
!  * family is taken into consideration. This can cause more mistakes for these
!  * buckets if the masklens of their boundaries are also disparate. It can
   * only be the case for one bucket, if there are addresses with different
   * families on the column. It seems as a better option than not considering
   * these buckets.
*************** inet_masklen_inclusion_cmp(inet *left, i
*** 375,383 ****
  /*
   * Inet histogram partial match divider calculation
   *
!  * First, the families and the lenghts of the network parts are compared
!  * using the subnet inclusion operator. If they are not -1 returned which
!  * means divider not available.
   *
   * The divider is imagined as the distance between the decisive bits and
   * the common bits of the addresses. The divider will be used as power of two
--- 375,383 ----
  /*
   * Inet histogram partial match divider calculation
   *
!  * First the families and the lenghts of the network parts are compared
!  * using the subnet inclusion operator. If they are not equal -1 returned is
!  * which means a divider not available.
   *
   * The divider is imagined as the distance between the decisive bits and
   * the common bits of the addresses. The divider will be used as power of two

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST support for inet datatypes

Reply via email to