Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. If it succeeds we'll just do it again when we recurse, I think. If you move the other two cases then you could advance t and p before entering the recursion. Yeah. Since I have removed the "_" case I believe it's now safe there to use BYTEEQ/NextByte, and since they are sufficiently cheap it's not worth worrying about. Attached is a patch version that I think draws together all the threads of discussion so far. It's in fact quite a lot simpler than the existing code, with no special UTF8 case - this should improve LIKE/ILIKE processing for all charsets. More eyeballs please for nasty corner cases. cheers andrew ? src/backend/utils/adt/.deps Index: src/backend/utils/adt/like.c === RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 - 1.68 --- src/backend/utils/adt/like.c 21 May 2007 16:50:48 - *** *** 28,48 #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,48 #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int Generic_Text_IC_like(text *str, text *pat); + /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static inline int wchareq(char *p1, char *p2) { int p1_len; *** *** 72,86 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 72,84 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *** *** 89,122 while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #include "like_match.c" - #undef CHAREQ - #undef ICHAREQ - #undef NextChar - #undef CopyAdvChar - #undef MatchText - #undef MatchTextIC - #undef do_like_escape - /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) ! #define NextChar(p, plen) ((p)++, (plen)--) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) #include "like_match.c" ! /* And some support for BYTEA */ ! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--) ! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--) /* * interface routines called by the function manage
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > But why are we doing that CHAREQ? To avoid the cost of the recursive call, just like it says. > If it succeeds we'll > just do it again when we recurse, I think. If you move the other two cases then you could advance t and p before entering the recursion. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
[EMAIL PROTECTED] wrote: Doh, you're right ... but on third thought, what happens with a pattern containing "%_"? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need to make sure the '%' advances full characters. In my suggestion the test if '_' (or '\') come after the '%' is done once and it select which of the two loops to use, the one that do byte stepping or the one with NextChar. It's difficult to know for sure that we have thought about all the corner cases. I hope the gain is worth the effort.. :-) Yes, I came to the same conclusion about how to restructure the code. The current code contains this: while (tlen > 0) { /* * Optimization to prevent most recursion: don't recurse * unless first pattern char might match this text char. */ if (CHAREQ(t, p) || (*p == '\\') || (*p == '_')) { int matched = MatchText(t, tlen, p, plen); if (matched != LIKE_FALSE) return matched; /* TRUE or ABORT */ } NextChar(t, tlen); } The code appears to date from v 1.23 of like.c way back in 2001. I'm not sure I agree with the comment, though. In the first place, the invariant tests should not be in the loop, I think, and I'll hoist them out as Dennis suggests. But why are we doing that CHAREQ? If it succeeds we'll just do it again when we recurse, I think. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
> Doh, you're right ... but on third thought, what happens with a pattern > containing "%_"? If % tries to advance bytewise then we'll be trying to > apply NextChar in the middle of a data character, and bad things ensue. Right, when you have '_' after a '%' you need to make sure the '%' advances full characters. In my suggestion the test if '_' (or '\') come after the '%' is done once and it select which of the two loops to use, the one that do byte stepping or the one with NextChar. It's difficult to know for sure that we have thought about all the corner cases. I hope the gain is worth the effort.. :-) /Dennis ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after "_", and I think we probably need to, don't we? If so we can't just marry the cases. Doh, you're right ... but on third thought, what happens with a pattern containing "%_"? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. I think we need to go back to the scheme with SB_ and MB_ variants and no special case for UTF8. My head is spinning with all these variants. I'll look at ti tomorrow. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> On the strength of this analysis, shouldn't we drop the separate >> UTF8 match function and just use SB_MatchText for UTF8? > We still call NextChar() after "_", and I think we probably need to, > don't we? If so we can't just marry the cases. Doh, you're right ... but on third thought, what happens with a pattern containing "%_"? If % tries to advance bytewise then we'll be trying to apply NextChar in the middle of a data character, and bad things ensue. I think we need to go back to the scheme with SB_ and MB_ variants and no special case for UTF8. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? We still call NextChar() after "_", and I think we probably need to, don't we? If so we can't just marry the cases. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Andrew Dunstan <[EMAIL PROTECTED]> writes: >>> Yeah, quite possibly. I'm also wondering if we are wasting effort >>> downcasing what will in most cases be the same pattern over and over >>> again. Maybe we need to look at memoizing that somehow, or at least test >>> to see if that would be a gain. >> >> Someone (Itagaki-san IIRC) suggested that we ought to convert >> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early >> stage, definitely before constant-folding in the planner. >> > Sounds like a TODO item. I'm already concerned a bit about scope creep > for this item. Agreed, I don't want to tackle this right now --- I'm just suggesting it's probably a better answer than memoizing at runtime. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would be a gain. Someone (Itagaki-san IIRC) suggested that we ought to convert "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early stage, definitely before constant-folding in the planner. That would take care of that issue without any run-time mechanism, and would open opportunities for making use of an index on lower(x). I recall thinking at the time that there were some potential downsides, but right at the moment I'm darned if I can see any --- especially if we're going to make ILIKE do this uniformly at runtime anyway. Sounds like a TODO item. I'm already concerned a bit about scope creep for this item. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Yeah, quite possibly. I'm also wondering if we are wasting effort > downcasing what will in most cases be the same pattern over and over > again. Maybe we need to look at memoizing that somehow, or at least test > to see if that would be a gain. Someone (Itagaki-san IIRC) suggested that we ought to convert "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early stage, definitely before constant-folding in the planner. That would take care of that issue without any run-time mechanism, and would open opportunities for making use of an index on lower(x). I recall thinking at the time that there were some potential downsides, but right at the moment I'm darned if I can see any --- especially if we're going to make ILIKE do this uniformly at runtime anyway. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? Possibly - IIRC I looked at that and there was some reason I didn't, but I'll look again. It strikes me that we may be overcomplicating matters in another way too. If you believe that the %-scan code is now the bottleneck, that is, the key loop is where we have pattern '%foo' and we are trying to match 'f' to each successive data position, then you should be bothered that SB_MatchTextIC is applying tolower() to 'f' again for each data character. Worst-case we could have O(N^2) applications of tolower() during a match. I think there's a fair case to be made that we should get rid of SB_MatchTextIC and implement *all* the case-insensitive variants by means of an initial lower() call. This would leave us with just two match functions and allow considerable unification of the setup logic. Yeah, quite possibly. I'm also wondering if we are wasting effort downcasing what will in most cases be the same pattern over and over again. Maybe we need to look at memoizing that somehow, or at least test to see if that would be a gain. We're getting quite a long way from the original patch :-) cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Are you sure? The big remaining char-matching bottleneck will surely > be in the code that scans for a place to start matching a %. But > that's exactly where we can't use byte matching for cases where the > charset might include AB and BA as characters - the pattern might > contain %BA and the string AB. However, this isn't a danger for UTF8, > which leads me to think that we do indeed need a special case for > UTF8, but for a different improvement from that proposed in the > original patch. I'll post an updated patch shortly. > Here is a patch that implements this. Please analyse for possible > breakage. On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? It strikes me that we may be overcomplicating matters in another way too. If you believe that the %-scan code is now the bottleneck, that is, the key loop is where we have pattern '%foo' and we are trying to match 'f' to each successive data position, then you should be bothered that SB_MatchTextIC is applying tolower() to 'f' again for each data character. Worst-case we could have O(N^2) applications of tolower() during a match. I think there's a fair case to be made that we should get rid of SB_MatchTextIC and implement *all* the case-insensitive variants by means of an initial lower() call. This would leave us with just two match functions and allow considerable unification of the setup logic. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] UTF8MatchText
oops. patch attached this time Andrew Dunstan wrote: I wrote: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for the two cases. Other than this part that I think can be optimized I don't see anything wrong with the idea behind the patch. To make the '%' case fast might be an important optimization for a lot of use cases. It's not uncommon that '%' matches a bigger part of the string than the rest of the pattern. Are you sure? The big remaining char-matching bottleneck will surely be in the code that scans for a place to start matching a %. But that's exactly where we can't use byte matching for cases where the charset might include AB and BA as characters - the pattern might contain %BA and the string AB. However, this isn't a danger for UTF8, which leads me to think that we do indeed need a special case for UTF8, but for a different improvement from that proposed in the original patch. I'll post an updated patch shortly. Here is a patch that implements this. Please analyse for possible breakage. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Index: src/backend/utils/adt/like.c === RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 - 1.68 --- src/backend/utils/adt/like.c 20 May 2007 14:16:22 - *** *** 28,48 #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,51 #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static int SB_MatchTextIC(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int UTF8_MatchText(char *t, int tlen, char *p, int plen); + + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int mbtexticlike(text *str, text *pat); + /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static inline int wchareq(char *p1, char *p2) { int p1_len; *** *** 72,86 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 75,87 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *** *** 89,122 while (__l-- > 0) \ *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #i
Re: [PATCHES] UTF8MatchText
I wrote: It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for the two cases. Other than this part that I think can be optimized I don't see anything wrong with the idea behind the patch. To make the '%' case fast might be an important optimization for a lot of use cases. It's not uncommon that '%' matches a bigger part of the string than the rest of the pattern. Are you sure? The big remaining char-matching bottleneck will surely be in the code that scans for a place to start matching a %. But that's exactly where we can't use byte matching for cases where the charset might include AB and BA as characters - the pattern might contain %BA and the string AB. However, this isn't a danger for UTF8, which leads me to think that we do indeed need a special case for UTF8, but for a different improvement from that proposed in the original patch. I'll post an updated patch shortly. Here is a patch that implements this. Please analyse for possible breakage. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] UTF8MatchText
Dennis Bjorklund wrote: Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for the two cases. Other than this part that I think can be optimized I don't see anything wrong with the idea behind the patch. To make the '%' case fast might be an important optimization for a lot of use cases. It's not uncommon that '%' matches a bigger part of the string than the rest of the pattern. Are you sure? The big remaining char-matching bottleneck will surely be in the code that scans for a place to start matching a %. But that's exactly where we can't use byte matching for cases where the charset might include AB and BA as characters - the pattern might contain %BA and the string AB. However, this isn't a danger for UTF8, which leads me to think that we do indeed need a special case for UTF8, but for a different improvement from that proposed in the original patch. I'll post an updated patch shortly. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Tom Lane skrev: You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. It is only when you have a pattern like '%_' when this is a problem and we could detect this and do byte by byte when it's not. Now we check (*p == '\\') || (*p == '_') in each iteration when we scan over characters for '%', and we could do it once and have different loops for the two cases. Other than this part that I think can be optimized I don't see anything wrong with the idea behind the patch. To make the '%' case fast might be an important optimization for a lot of use cases. It's not uncommon that '%' matches a bigger part of the string than the rest of the pattern. It's easy to make a misstake when one is used to think about the simple fixed size characters like ascii. Strange that this simple topic can be so difficult to think about... :-) /Dennis ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodings; If we have '[AB][AB]' in a table and search it with LIKE '%[BA]%', we judge that they are matched by mistake. AFAICS, the patch does *not* make that mistake because % will not advance over a fractional character. Unless I hear differently, my present intention is to apply the suggested improvement universally. I'll wait a day or two before completing the patch. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use "inline", which we #define appropriately at need. OK. (this was from upstream patch.) I thought we'd concluded that this explanation is pseudo-science? [...] spellcheck... Right. I'm waiting on a consensus about the UTF8-ness of the solution before I adjust comments. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Attached is my current WIP patch. A few quick eyeball comments: > ! static __inline__ int Under *no* circumstances use __inline__, as it will certainly break every non-gcc compiler. Use "inline", which we #define appropriately at need. > ! * UTF8 has disjoint representations for first-bytes and > ! * not-first-bytes of MB characters, and thus it is > ! * impossible to make a false match in which an MB pattern > ! * character is matched to the end of one data character > ! * plus the start of another. > ! * In character sets without that property, we have to use the > ! * slow way to ensure we don't make out-of-sync matches. I thought we'd concluded that this explanation is pseudo-science? > ! * Branch for non-utf8 multi-byte charsets and also for > single-byte > ! * charsets which don't gain any benefir from the above > optimisation. spellcheck... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodings; If we have '[AB][AB]' in a table and search it with LIKE '%[BA]%', we judge that they are matched by mistake. AFAICS, the patch does *not* make that mistake because % will not advance over a fractional character. Yeah, I think that's right. Attached is my current WIP patch. If we decide that this optimisation can in fact be applied to all backend encodings, that will be easily incorporated. It will simplify the code further. Note that all the common code in the MatchText and do_like_escape functions has been factored - and the bytea functions just call the single-byte text versions - AFAICS the effect will be identical to having the specialised versions. (I'm always happy when code volume can be reduced.) cheers andrew Index: src/backend/utils/adt/like.c === RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v retrieving revision 1.68 diff -c -r1.68 like.c *** src/backend/utils/adt/like.c 27 Feb 2007 23:48:08 - 1.68 --- src/backend/utils/adt/like.c 18 May 2007 02:47:41 - *** *** 28,48 #define LIKE_ABORT (-1) ! static int MatchText(char *t, int tlen, char *p, int plen); ! static int MatchTextIC(char *t, int tlen, char *p, int plen); ! static int MatchBytea(char *t, int tlen, char *p, int plen); ! static text *do_like_escape(text *, text *); ! static int MBMatchText(char *t, int tlen, char *p, int plen); ! static int MBMatchTextIC(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static int wchareq(char *p1, char *p2) { int p1_len; --- 28,50 #define LIKE_ABORT (-1) ! static int SB_MatchText(char *t, int tlen, char *p, int plen); ! static int SB_MatchTextIC(char *t, int tlen, char *p, int plen); ! static text *SB_do_like_escape(text *, text *); ! static int MB_MatchText(char *t, int tlen, char *p, int plen); static text *MB_do_like_escape(text *, text *); + static int UTF8_MatchText(char *t, int tlen, char *p, int plen); + static int GenericMatchText(char *s, int slen, char* p, int plen); + static int mbtexticlike(text *str, text *pat); + /* * Support routine for MatchText. Compares given multibyte streams * as wide characters. If they match, returns 1 otherwise returns 0. * */ ! static __inline__ int wchareq(char *p1, char *p2) { int p1_len; *** *** 72,86 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. - * - * Note that MBMatchText and MBMatchTextIC do exactly the same thing now. - * Is it worth refactoring to avoid duplicated code? They might become - * different again in the future. */ /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) - #define ICHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ --- 74,86 * of getting a single character transformed to the system's wchar_t format. * So now, we just downcase the strings using lower() and apply regular LIKE * comparison. This should be revisited when we install better locale support. */ + #define NextByte(p, plen) ((p)++, (plen)--) + #define BYTEEQ(p1, p2) (*(p1) == *(p2)) + /* Set up to compile like_match.c for multibyte characters */ #define CHAREQ(p1, p2) wchareq(p1, p2) #define NextChar(p, plen) \ do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0) #define CopyAdvChar(dst, src, srclen) \ *** *** 90,122 *(dst)++ = *(src)++; \ } while (0) ! #define MatchText MBMatchText ! #define MatchTextIC MBMatchTextIC #define do_like_escape MB_do_like_escape #include "like_match.c" - #undef CHAREQ - #undef ICHAREQ - #undef NextChar - #undef CopyAdvChar - #undef MatchText - #undef MatchTextIC - #undef do_like_escape - /* Set up to compile like_match.c for single-byte characters */ ! #define CHAREQ(p1, p2) (*(p1) == *(p2)) ! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2))) ! #define NextChar(p, plen) ((p)++, (plen)--) #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
Re: [PATCHES] UTF8MatchText
ITAGAKI Takahiro <[EMAIL PROTECTED]> writes: > Yes, I only used the 'disjoint representations for first-bytes and > not-first-bytes of MB characters' feature in UTF8. Other encodings > allows both [AB] and [BA] for MB character patterns. UTF8Match() does > not cope with those encodings; If we have '[AB][AB]' in a table and > search it with LIKE '%[BA]%', we judge that they are matched by mistake. AFAICS, the patch does *not* make that mistake because % will not advance over a fractional character. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Tom Lane <[EMAIL PROTECTED]> wrote: > On the strength of this closer reading, I would say that the patch isn't > relying on UTF8's first-byte-vs-not-first-byte property after all. > All that it's relying on is that no MB character is a prefix of another > one, which seems like a necessary property for any sane encoding; plus > that characters are considered equal only if they're bytewise equal. > So are we sure it doesn't work for non-UTF8 encodings? Maybe that > earlier conclusion was based on a misunderstanding of what the patch > really does. Yes, I only used the 'disjoint representations for first-bytes and not-first-bytes of MB characters' feature in UTF8. Other encodings allows both [AB] and [BA] for MB character patterns. UTF8Match() does not cope with those encodings; If we have '[AB][AB]' in a table and search it with LIKE '%[BA]%', we judge that they are matched by mistake. I've also misunderstood it before, and Dennis Bjorklund pointed out. http://archives.postgresql.org/pgsql-hackers/2007-03/msg01377.php Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Is it legal to follow escape by anything other than _ % or escape? Certainly, but once you've compared the first byte you can handle any remaining bytes via the main loop. And in fact the code is already depending on being able to do that --- the use of CHAREQ rather than BYTEEQ is just wasted cycles. > One more thing - I'm thinking of rolling up the bytea matching routines > as well as the text routines to eliminate all the duplication of logic. +1, I think, but I wonder why we had the duplication in the first place. Is there any likelihood that bytea's semantics should diverge from the text case? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already paid the price of determining the character length once, not to mention matching the bytes of the characters once, and then throw that knowledge away. I think BYTEEQ would make more sense in the backslash path. Probably, although the use of CHAREQ is in the present code. Is it legal to follow escape by anything other than _ % or escape? So the actual optimization here is that we do bytewise comparison and advancing, but only when we are either at the start of a character (on both sides, and the pattern char is not wildcard) or we are in the middle of a character (on both sides) and we've already proven that both sides matched for the previous byte(s) of the character. I think that's correct. On the strength of this closer reading, I would say that the patch isn't relying on UTF8's first-byte-vs-not-first-byte property after all. All that it's relying on is that no MB character is a prefix of another one, which seems like a necessary property for any sane encoding; plus that characters are considered equal only if they're bytewise equal. So are we sure it doesn't work for non-UTF8 encodings? Maybe that earlier conclusion was based on a misunderstanding of what the patch really does. Indeed. One more thing - I'm thinking of rolling up the bytea matching routines as well as the text routines to eliminate all the duplication of logic. I can do it by a little type casting from bytea* to text* and back again, or if that's not acceptable by some preprocessor magic. I think the casting is likely to be safe enough in this case - I don't think a null byte will hurt us anywhere in this code - and presumably the varlena stuff is all the same. Does that sound reasonable? cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > From my WIP patch, here's where the difference appears to be - note > that UTF8 branch has two NextByte calls at the bottom, unlike the other > branch: Oh, I see: NextChar is still "real" but the patch is willing to have t and p pointing into the middle of an MB character. That's a bit risky. I think it works but it's making at least the following undocumented assumptions: * At a pattern backslash, it applies CHAREQ() but then advances byte-by-byte over the matched characters (implicitly assuming that none of these bytes will look like the magic characters). While that works for backend-safe encodings, it seems a bit strange; you've already paid the price of determining the character length once, not to mention matching the bytes of the characters once, and then throw that knowledge away. I think BYTEEQ would make more sense in the backslash path. * At pattern % or _, it's critical that we do NextChar not NextByte on the data side. Else t is pointing into the middle of an MB sequence when p isn't, and we have various out-of-sync conditions to worry about, notably possibly calling NextChar when t is not pointing at the first byte of the character, which will result in a wrong answer about the character length. * We *must* do NextChar not NextByte for _ since we have to match it to exactly one logical character, not byte. You could imagine trying to do % a byte at a time (and indeed that's what I'd been thinking it did) but that gets you out of sync which breaks the _ case. So the actual optimization here is that we do bytewise comparison and advancing, but only when we are either at the start of a character (on both sides, and the pattern char is not wildcard) or we are in the middle of a character (on both sides) and we've already proven that both sides matched for the previous byte(s) of the character. On the strength of this closer reading, I would say that the patch isn't relying on UTF8's first-byte-vs-not-first-byte property after all. All that it's relying on is that no MB character is a prefix of another one, which seems like a necessary property for any sane encoding; plus that characters are considered equal only if they're bytewise equal. So are we sure it doesn't work for non-UTF8 encodings? Maybe that earlier conclusion was based on a misunderstanding of what the patch really does. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. That's not what I see in (what I think is) the latest submission, which includes this snippet: [ scratches head... ] OK, then I think I totally missed what this patch is trying to accomplish; because this code looks just the same as the existing multibyte-character operations. Where does the performance improvement come from? That's what bothered me. The trouble is that we have so much code that looks *almost* identical. From my WIP patch, here's where the difference appears to be - note that UTF8 branch has two NextByte calls at the bottom, unlike the other branch: #ifdef UTF8_OPT /* * UTF8 is optimised to do byte at a time matching in most cases, * thus saving expensive calls to NextChar. * * UTF8 has disjoint representations for first-bytes and * not-first-bytes of MB characters, and thus it is * impossible to make a false match in which an MB pattern * character is matched to the end of one data character * plus the start of another. * In character sets without that property, we have to use the * slow way to ensure we don't make out-of-sync matches. */ else if (*p == '_') { NextChar(t, tlen); NextByte(p, plen); continue; } else if (!BYTEEQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then * time to quit... */ return LIKE_FALSE; } NextByte(t, tlen); NextByte(p, plen); #else /* * Branch for non-utf8 multi-byte charsets and also for single-byte * charsets which don't gain any benefit from the above optimisation. */ else if ((*p != '_') && !CHAREQ(t, p)) { /* * Not the single-character wildcard and no explicit match? Then * time to quit... */ return LIKE_FALSE; } NextChar(t, tlen); NextChar(p, plen); #endif /* UTF8_OPT */ cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Except that the entire point of this patch is to dumb down NextChar to >> be the same as NextByte for UTF8 strings. > That's not what I see in (what I think is) the latest submission, which > includes this snippet: [ scratches head... ] OK, then I think I totally missed what this patch is trying to accomplish; because this code looks just the same as the existing multibyte-character operations. Where does the performance improvement come from? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior. From what I can see the code is quite careful about when it calls NextByte vs NextChar, and after _ it calls NextChar. Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. That's not what I see in (what I think is) the latest submission, which includes this snippet: + /* Set up for utf8 characters */ + #define CHAREQ(p1, p2)wchareq(p1, p2) + #define NextChar(p, plen) \ + do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0) + + /* + * UTF8MatchText -- specialized version of MBMatchText for UTF8 + */ + static int + UTF8MatchText(char *t, int tlen, char *p, int plen) Am I looking at the wrong thing? This is from around April 9th I think. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Wait a second ... I just thought of a counterexample that destroys the >> entire concept. Consider the pattern 'A__B', which clearly is supposed >> to match strings of four *characters*. With the proposed patch in >> place, it would match strings of four *bytes*. Which is not the correct >> behavior. > From what I can see the code is quite careful about when it calls > NextByte vs NextChar, and after _ it calls NextChar. Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior. From what I can see the code is quite careful about when it calls NextByte vs NextChar, and after _ it calls NextChar. So I'll test for this, but I think it's safe. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] UTF8MatchText
Tom Lane wrote: UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property, we have to use the slow way to ensure we don't make out-of-sync matches. Thanks. I will include this info in the comments. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] UTF8MatchText
Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Ok, I have studied some more and I think I understand what's going on. > AIUI, we are switching from some expensive char-wise comparisons to > cheap byte-wise comparisons in the UTF8 case because we know that in > UTF8 the magic characters ('_', '%' and '\') aren't a part of any other > character sequence. Is that putting it too mildly? Do we need stronger > conditions than that? If it's correct, are there other MBCS for which > this is true? I don't think this is a correct analysis. If it were correct then we could use the optimization for all backend charsets because none of them allow MB characters to contain non-high-bit-set bytes. But it was stated somewhere upthread that that doesn't actually work. Clearly it's a necessary property that we not falsely detect the magic pattern characters, but that's not sufficient. I think the real issue is that UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property, we have to use the slow way to ensure we don't make out-of-sync matches. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
I wrote: ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive. That would make for a total of six functions, but in the case of both UTF8 and other MBCS we don't need a special Case Insensitive function - instead we downcase both the string and the pattern and then use the Case Sensitive function. That leaves a total of four functions. What is not clear to me is why the UTF8 optimisation work, and why it doesn't apply to other MBCS. At the very least we need a comment on that. I also find the existing function naming convention somewhat annoying - having foo() and MB_foo() is less than clear. I'd rather have SB_foo() and MB_foo(). That's not your fault, of course. If you supply me with some explanation on the UTF8 optimisation issue, I'll prepare a revised patch along these lines. Ok, I have studied some more and I think I understand what's going on. AIUI, we are switching from some expensive char-wise comparisons to cheap byte-wise comparisons in the UTF8 case because we know that in UTF8 the magic characters ('_', '%' and '\') aren't a part of any other character sequence. Is that putting it too mildly? Do we need stronger conditions than that? If it's correct, are there other MBCS for which this is true? cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Itagaki, I find this still fairly unclean. It certainly took me some time to get me head around what's going on. ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive. That would make for a total of six functions, but in the case of both UTF8 and other MBCS we don't need a special Case Insensitive function - instead we downcase both the string and the pattern and then use the Case Sensitive function. That leaves a total of four functions. What is not clear to me is why the UTF8 optimisation work, and why it doesn't apply to other MBCS. At the very least we need a comment on that. I also find the existing function naming convention somewhat annoying - having foo() and MB_foo() is less than clear. I'd rather have SB_foo() and MB_foo(). That's not your fault, of course. If you supply me with some explanation on the UTF8 optimisation issue, I'll prepare a revised patch along these lines. cheers andrew ITAGAKI Takahiro wrote: Bruce Momjian <[EMAIL PROTECTED]> wrote: I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case should be ignored? The same is true of MBMatchText() and MBMatchTextIC(). Now, I'll split the patch into two changes. 1. DropMBMatchTextIC.patch Drop MBMatchTextIC() and use MBMatchText() instead. 2. UTF8MatchText.patch Add UTF8MatchText() as a specialized version of MBMatchText(). As a future work, it might be good to research the performance of rewriting "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that we can avoid to call lower() for constant pattern in the right-hand side and can use functional indexes (lower(col)). I think we never need MBMatchTextIC() in the future unless we move to wide-character server encoding :) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] UTF8MatchText
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- ITAGAKI Takahiro wrote: > Bruce Momjian <[EMAIL PROTECTED]> wrote: > > > > I do not understand this patch. You have defined two functions, > > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > > > is that one calls CHAREQ and the other calls ICHAREQ, but just above > > > those two functions you define the macros identically: > > > > Why are there two functions? Also, can't you use one function and just > > pass a boolean to indicate whether case should be ignored? > > The same is true of MBMatchText() and MBMatchTextIC(). > Now, I'll split the patch into two changes. > > 1. DropMBMatchTextIC.patch > Drop MBMatchTextIC() and use MBMatchText() instead. > > 2. UTF8MatchText.patch > Add UTF8MatchText() as a specialized version of MBMatchText(). > > > As a future work, it might be good to research the performance of rewriting > "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that > we can avoid to call lower() for constant pattern in the right-hand side and > can use functional indexes (lower(col)). I think we never need MBMatchTextIC() > in the future unless we move to wide-character server encoding :) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] [ Attachment, skipping... ] -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] UTF8MatchText
Patch removed, updated version submitted. --- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 1 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > ---+---+---+---+--- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1| 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP| 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] UTF8MatchText
Bruce Momjian <[EMAIL PROTECTED]> wrote: > > I do not understand this patch. You have defined two functions, > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > > is that one calls CHAREQ and the other calls ICHAREQ, but just above > > those two functions you define the macros identically: > > Why are there two functions? Also, can't you use one function and just > pass a boolean to indicate whether case should be ignored? The same is true of MBMatchText() and MBMatchTextIC(). Now, I'll split the patch into two changes. 1. DropMBMatchTextIC.patch Drop MBMatchTextIC() and use MBMatchText() instead. 2. UTF8MatchText.patch Add UTF8MatchText() as a specialized version of MBMatchText(). As a future work, it might be good to research the performance of rewriting "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that we can avoid to call lower() for constant pattern in the right-hand side and can use functional indexes (lower(col)). I think we never need MBMatchTextIC() in the future unless we move to wide-character server encoding :) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center DropMBMatchTextIC.patch Description: Binary data UTF8MatchText.patch Description: Binary data ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] UTF8MatchText
Bruce Momjian wrote: > > I do not understand this patch. You have defined two functions, > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > is that one calls CHAREQ and the other calls ICHAREQ, but just above > those two functions you define the macros identically: > > #define CHAREQ(p1, p2)wchareq(p1, p2) > #define ICHAREQ(p1, p2) wchareq(p1, p2) > > Why are there two functions? Also, can't you use one function and just > pass a boolean to indicate whether case it be ignored? Sorry, typo: Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case should be ignored? -- > > --- > > ITAGAKI Takahiro wrote: > > "Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > > > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > > ITAGAKI> by bytes, not only UTF-8. > > > > > > This is false, particularly for EUC. > > > > Umm, I see. I updated the optimization to be used only for UTF8 case. > > I also added some inlining hints that are useful on my machine (Pentium 4). > > > > > > x1000 of LIKE '%foo% on 1 rows tables [ms] > > encoding | HEAD | P1 | P2 | P3 > > ---+---+---+---+--- > > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > > LATIN1| 7083 | 7130 | 7057 | 7031 > > UTF8 | 17974 | 10859 | 10839 | 9682 > > EUC_JP| 17032 | 17557 | 17599 | 15240 > > > > - P1: UTF8MatchText() > > - P2: P1 + __inline__ GenericMatchText() > > - P3: P2 + __inline__ wchareq() > > (The attached patch is P3.) > > > > Regards, > > --- > > ITAGAKI Takahiro > > NTT Open Source Software Center > > > > [ Attachment, skipping... ] > > > > > ---(end of broadcast)--- > > TIP 7: You can help support the PostgreSQL project by donating at > > > > http://www.postgresql.org/about/donate > > -- > Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---(end of broadcast)--- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] UTF8MatchText
I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: #define CHAREQ(p1, p2)wchareq(p1, p2) #define ICHAREQ(p1, p2) wchareq(p1, p2) Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case it be ignored? --- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 1 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > ---+---+---+---+--- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1| 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP| 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] UTF8MatchText
I assume this replaces all your earlier multi-byte LIKE patches. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 1 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > ---+---+---+---+--- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1| 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP| 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] UTF8MatchText
"Andrew - Supernews" <[EMAIL PROTECTED]> wrote: > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > ITAGAKI> by bytes, not only UTF-8. > > This is false, particularly for EUC. Umm, I see. I updated the optimization to be used only for UTF8 case. I also added some inlining hints that are useful on my machine (Pentium 4). x1000 of LIKE '%foo% on 1 rows tables [ms] encoding | HEAD | P1 | P2 | P3 ---+---+---+---+--- SQL_ASCII | 7094 | 7120 | 7063 | 7031 LATIN1| 7083 | 7130 | 7057 | 7031 UTF8 | 17974 | 10859 | 10839 | 9682 EUC_JP| 17032 | 17557 | 17599 | 15240 - P1: UTF8MatchText() - P2: P1 + __inline__ GenericMatchText() - P3: P2 + __inline__ wchareq() (The attached patch is P3.) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center utf8matchtext.patch Description: Binary data ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate