Re: [HACKERS] patch: utf8_to_unicode (trivial)
Robert Haas robertmh...@gmail.com writes: Anyway, it's not really important enough to me to have a protracted argument about it. Let's wait and see if anyone else has an opinion, and perhaps a consensus will emerge. Well, nobody else seems to care, so I went ahead and committed the shorter form of the patch, ie just rename export the function. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Joseph Adams joeyadams3.14...@gmail.com writes: I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that belongs in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. I'm not sure whether I like the old patch better or the new one. FWIW, I *don't* like this version, specifically because it fails to utilize the pg_wchar datatype. The function in question is neither big enough nor mutable enough that it's urgent to not duplicate it between the backend and psql, so I don't see much value in moving it to src/port. I think the correct things to do are to apply the original patch (modulo a stylistic change, namely put the new function where the old one was to miminize the size of the diff), and to back-patch the bug fix in mbprint.c. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Joseph Adams joeyadams3.14...@gmail.com writes: I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that belongs in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. I'm not sure whether I like the old patch better or the new one. FWIW, I *don't* like this version, specifically because it fails to utilize the pg_wchar datatype. The function in question is neither big enough nor mutable enough that it's urgent to not duplicate it between the backend and psql, so I don't see much value in moving it to src/port. Well, we'd better at least add a comment noting that the two versions should match. But I think it would be better to unify them. However, in the back-branches, I'd just fix the incorrect copy. YMMV. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Robert Haas robertmh...@gmail.com writes: On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: FWIW, I *don't* like this version, specifically because it fails to utilize the pg_wchar datatype. The function in question is neither big enough nor mutable enough that it's urgent to not duplicate it between the backend and psql, so I don't see much value in moving it to src/port. Well, we'd better at least add a comment noting that the two versions should match. But I think it would be better to unify them. However, in the back-branches, I'd just fix the incorrect copy. Yeah, I did the latter part already because I figured it was uncontroversial. What to do in HEAD is still under debate. As for the two versions should match, the only way they'd be likely to diverge would be if the requirements change on one end or the other. It's not unreasonable to suppose, for example, that we might want the backend's version to start throwing an elog instead of just returning -1 for a bad character. It would be a lot harder to do that if we've pushed the code into src/port. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Sun, Aug 15, 2010 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Aug 15, 2010 at 7:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: FWIW, I *don't* like this version, specifically because it fails to utilize the pg_wchar datatype. The function in question is neither big enough nor mutable enough that it's urgent to not duplicate it between the backend and psql, so I don't see much value in moving it to src/port. Well, we'd better at least add a comment noting that the two versions should match. But I think it would be better to unify them. However, in the back-branches, I'd just fix the incorrect copy. Yeah, I did the latter part already because I figured it was uncontroversial. What to do in HEAD is still under debate. As for the two versions should match, the only way they'd be likely to diverge would be if the requirements change on one end or the other. It's not unreasonable to suppose, for example, that we might want the backend's version to start throwing an elog instead of just returning -1 for a bad character. It would be a lot harder to do that if we've pushed the code into src/port. Not really. You'd just write a wrapper to call the version in src/port and then elog if it returned -1. Unless -1 is actually a valid result, I guess. Anyway, it's not really important enough to me to have a protracted argument about it. Let's wait and see if anyone else has an opinion, and perhaps a consensus will emerge. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams joeyadams3.14...@gmail.com wrote: In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 , but no corresponding utf8_to_unicode . However, there is a static function called utf2ucs that does what utf8_to_unicode would do. I'd like this function to be available because the JSON code needs to convert UTF-8 to and from Unicode codepoints, and I'm currently using a separate UTF-8 to codepoint function for that. This patch renames utf2ucs to utf8_to_unicode and makes it public. It also fixes the version of utf2ucs in src/bin/psql/mbprint.c so that it's equivalent to the one in wchar.c . This is a patch against CVS HEAD for application. It compiles and tests successfully. Comments? Thanks, I feel obliged to respond this since I'm supposed to be covering your GSoC project while Magnus is on vacation, but I actually know very little about this topic. What's undeniable, however, is that the coding in the two versions of utf8ucs() in the tree right now don't match. src/backend/utils/mb/wchar.c has: else if ((*c 0xf8) == 0xf0) while src/bin/psql/mbprint.c, which is otherwise identical, has: else if ((*c 0xf0) == 0xf0) I'm inclined to believe that your patch is right to think that the former version is correct, because it used to match the latter version until Tom Lane changed it in 2007, and I suspect he simply failed to update both copies. But I'd like someone who actually understands what this code is doing to confirm that. http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php I suspect we need to not only fix this, but back-patch it at least to 8.2, which is the first release where there are two copies of this function. I am not sure whether earlier releases need to be changed, or not. But again, someone who understands the issues better than I do needs to weigh in here. In terms of making this function non-static, I'm inclined to think that a better approach would be to move it to src/port. That gets rid of the need to have two copies in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that belongs in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. I'm not sure whether I like the old patch better or the new one. Joey Adams utf8-to-unicode-port.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Fri, Aug 13, 2010 at 3:12 AM, Joseph Adams joeyadams3.14...@gmail.com wrote: I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that belongs in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. Well, right now, in addition to having two copies of utf2ucs(), we have two declarations of pg_wchar, one in src/bin/psql/mbprint.c and the other in src/include/mb/pg_wchar.h; so both existing copies of the function are able to use that typedef. It seems like we might want to move the typedef to the same place as the declaration of the renamed utf2ucs(), but I'm not quite sure where that should be. The only header in src/port is pthread-win32.h, and we're sure not going to put it there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Excerpts from Robert Haas's message of vie ago 13 12:00:32 -0400 2010: On Fri, Aug 13, 2010 at 3:12 AM, Joseph Adams joeyadams3.14...@gmail.com wrote: I've attached another patch that moves utf8_to_unicode to src/port per Robert Haas's suggestion. This patch itself is not quite as elegant as the first one because it puts platform-independent code that belongs in wchar.c into src/port . It also uses unsigned int instead of pg_wchar because the typedef of pg_wchar isn't available to the frontend, if I'm not mistaken. Well, right now, in addition to having two copies of utf2ucs(), we have two declarations of pg_wchar, one in src/bin/psql/mbprint.c and the other in src/include/mb/pg_wchar.h; so both existing copies of the function are able to use that typedef. It seems like we might want to move the typedef to the same place as the declaration of the renamed utf2ucs(), but I'm not quite sure where that should be. The only header in src/port is pthread-win32.h, and we're sure not going to put it there. src/include/port.h? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010: On Fri, Aug 13, 2010 at 12:11 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: src/include/port.h? Oh, hey, look at that. Any thought on what to about the fact that our two existing copies of utf2ucs() don't match? (one tests against 0xf8 where the other against 0xf0) I'm not sure why it's masking 0xf8 instead of 0xf0. It seems like c 0xf8 == 0xf8 signals start of a 5-byte sequence which is not valid per RFC 3629, according to wikipedia: http://en.wikipedia.org/wiki/UTF-8#Description (Moreover, 0xf5 to 0xf7 signal start of a 4-byte sequence for codepoints that apparently are not supposed to be valid). So apparently it's good that the code returns an invalid code in those cases, i.e. wchar.c is right and mbprint is wrong. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010: Oh, hey, look at that. Any thought on what to about the fact that our two existing copies of utf2ucs() don't match? (one tests against 0xf8 where the other against 0xf0) I'm not sure why it's masking 0xf8 instead of 0xf0. Because it wants to verify that this is in fact a 4-byte UTF8 code. Compare the code (and comments) for pg_utf_mblen. AFAICS the version in mbprint.c is flat out wrong, and the only reason nobody's noticed is that it should never get passed a more-than-4-byte sequence anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Fri, Aug 13, 2010 at 1:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Robert Haas's message of vie ago 13 12:50:13 -0400 2010: Oh, hey, look at that. Any thought on what to about the fact that our two existing copies of utf2ucs() don't match? (one tests against 0xf8 where the other against 0xf0) I'm not sure why it's masking 0xf8 instead of 0xf0. Because it wants to verify that this is in fact a 4-byte UTF8 code. Compare the code (and comments) for pg_utf_mblen. AFAICS the version in mbprint.c is flat out wrong, and the only reason nobody's noticed is that it should never get passed a more-than-4-byte sequence anyway. Should we fix it, then, and if so how far should we go back? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
Robert Haas robertmh...@gmail.com writes: On Fri, Aug 13, 2010 at 1:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: AFAICS the version in mbprint.c is flat out wrong, and the only reason nobody's noticed is that it should never get passed a more-than-4-byte sequence anyway. Should we fix it, then, and if so how far should we go back? Yes, I'd vote for back-patching, at least to all versions in which the wchar.c version looks like that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: utf8_to_unicode (trivial)
On Fri, Aug 13, 2010 at 12:11 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: src/include/port.h? Oh, hey, look at that. Any thought on what to about the fact that our two existing copies of utf2ucs() don't match? (one tests against 0xf8 where the other against 0xf0) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch: utf8_to_unicode (trivial)
In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 , but no corresponding utf8_to_unicode . However, there is a static function called utf2ucs that does what utf8_to_unicode would do. I'd like this function to be available because the JSON code needs to convert UTF-8 to and from Unicode codepoints, and I'm currently using a separate UTF-8 to codepoint function for that. This patch renames utf2ucs to utf8_to_unicode and makes it public. It also fixes the version of utf2ucs in src/bin/psql/mbprint.c so that it's equivalent to the one in wchar.c . This is a patch against CVS HEAD for application. It compiles and tests successfully. Comments? Thanks, Joey Adams utf8-to-unicode.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers