Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 11/12/13, 1:57 AM, Tatsuo Ishii wrote: Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. I think ISO 2022 is something in that direction, but it's not ASCII-safe, AFAICT. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote: I have been thinking about this for years and I think the key idea for this is, implementing universal encoding. The universal encoding should have following characteristics to implement N2 encoding in a database. 1) no loss of round trip encoding conversion 2) no mapping table is necessary to convert from/to existing encodings Once we implement the universal encoding, other problem such as pg_database with multiple encoding problem can be solved easily. Isn't this essentially what the MULE internal encoding is? Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. This sounds like a terrible idea. In the future people are only going to want more advanced text functions, regular expressions, indexing and making encodings that don't exist anywhere else seems like a way to make a lot of work for little benefit. A better idea seems to me is to (if postgres is configured properly) embed the non-round-trippable characters in the custom character part of the unicode character set. In other words, adjust the mappings tables on demand and voila. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Martijn van Oosterhout klep...@svana.org writes: On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote: Once we implement the universal encoding, other problem such as pg_database with multiple encoding problem can be solved easily. Isn't this essentially what the MULE internal encoding is? MULE is completely evil. It has N different encodings for the same character, not to mention no support code available. Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. This sounds like a terrible idea. In the future people are only going to want more advanced text functions, regular expressions, indexing and making encodings that don't exist anywhere else seems like a way to make a lot of work for little benefit. Agreed. A better idea seems to me is to (if postgres is configured properly) embed the non-round-trippable characters in the custom character part of the unicode character set. In other words, adjust the mappings tables on demand and voila. From the standpoint of what will happen with existing library code (like strcoll), I'm not sure it's all that easy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Isn't this essentially what the MULE internal encoding is? No. MULE is not powerfull enough and overly complicated to deal with different encodings (character sets). Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. This sounds like a terrible idea. In the future people are only going to want more advanced text functions, regular expressions, indexing and making encodings that don't exist anywhere else seems like a way to make a lot of work for little benefit. That is probably a misunderstanding. We don't need to modify existing text handling modules such as text functions, regular expressions, indexing etc. We just convert from the universal encoding X to the original encoding before calling them. The process is pretty easy and fast because it just requires skipping encoding identifier and encoding length part. Basically the encoding X should be used for lower layer modules of PostgreSQL and higher layer module such as living in src/backend/utils/adt should not aware it. A better idea seems to me is to (if postgres is configured properly) embed the non-round-trippable characters in the custom character part of the unicode character set. In other words, adjust the mappings tables on demand and voila. Using Unicode requires overhead for encoding conversion because it needs to look up mapping tables. That will be a huge handicap for large data and that I want to avoid in the first place. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
MULE is completely evil. It has N different encodings for the same character, What's wrong with that? It aims that in the first place. not to mention no support code available. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Tatsuo Ishii is...@postgresql.org writes: MULE is completely evil. It has N different encodings for the same character, What's wrong with that? It aims that in the first place. It greatly complicates comparisons --- at least, if you'd like to preserve the principle that strings that appear the same are equal. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Tatsuo Ishii is...@postgresql.org writes: MULE is completely evil. It has N different encodings for the same character, What's wrong with that? It aims that in the first place. It greatly complicates comparisons --- at least, if you'd like to preserve the principle that strings that appear the same are equal. You don't need to consider it because there's no place in PostgreSQL where a MULE encoded text consists of multiple encodings as far as I know. BTW, same characters are assigned different code points are pretty common in many character sets (Unicode, for example). -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Tatsuo Ishii is...@postgresql.org writes: BTW, same characters are assigned different code points are pretty common in many character sets (Unicode, for example). This is widely considered a security bug; read section 10 in RFC 3629 (the definition of UTF8), and search the CVE database a bit if you still doubt it's a threat. I'm going to push back very hard on any suggestion that Postgres should build itself around a text representation with that kind of weakness designed in. regards, tom lane [1] http://tools.ietf.org/html/rfc3629#section-10 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
I'd be much more impressed by seeing a road map for how we get to a useful amount of added functionality --- which, to my mind, would be the ability to support N different encodings in one database, for N2. But even if you think N=2 is sufficient, we haven't got a road map, and commandeering spec-mandated syntax for an inadequate feature doesn't seem like a good first step. It'll just make our backwards-compatibility problems even worse when somebody does come up with a real solution. I have been thinking about this for years and I think the key idea for this is, implementing universal encoding. The universal encoding should have following characteristics to implement N2 encoding in a database. 1) no loss of round trip encoding conversion 2) no mapping table is necessary to convert from/to existing encodings Once we implement the universal encoding, other problem such as pg_database with multiple encoding problem can be solved easily. Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. At this point the design of the encoding I have in mind is, 1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The encoding identifier's value is between 0x80 and 0xff and is assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so on. The encodings should be limited to database safe encodings. The encoding body is raw characters represented by existing encodings. This form is called word. 2) We also have mutibyte representation of the universal encoding. The first byte represents the lenght of the multibyte character (similar to the first byte of UTF-8). The second byte is the encoding identifier explained in above. The rest of the character is same as above. #1 and #2 are logically same and converted to each other, and we can use one of them whenever we like. The form #1 is easy to handle because each word has fixed length (8 bytes). So probably used in temporary data in memory. The second form can save space and will be used in the data itself. If we want to have a table encoded in an encoding different from the database encoding, the table is encoded in the universal encoding. pg_class should remember the fact to avoid the confusion about what encoding a table is using. I think majority of tables in a database uses the same encoding as the database encoding. Only a few tables want to have different encoding. The design pushes the penalty to such minorities. If we need to join two tables which have different encoding, we need to convert them into the same encoding (this should succeed if the encodings are compatible). If fails, the join will fail too. We could expand the technique above for the design which allow each column has different encoding. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Albe Laurenz laurenz.a...@wien.gv.at In a way, it is similar to using the data type serial. The column will be displayed as integer, and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Yes, I agree with you in that serial being a synonym is almost no problem. But that's because serial is not an SQL-standard data type but a type unique to PostgreSQL. On the other hand, nchar is an established data type in the SQL standard. I think most people will expect to get nchar as output from psql \d and pg_dump as they specified in DDL. If they get char as output for nchar columns from pg_dump, wouldn't they get in trouble if they want to import schema/data from PostgreSQL to other database products? The documentation for pg_dump says that pg_dump pays attention to easing migrating to other DBMSs. I like this idea and want to respect this. http://www.postgresql.org/docs/current/static/app-pgdump.html -- Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products. ... --use-set-session-authorization Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to determine object ownership. This makes the dump more standards-compatible, ... -- Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Albe Laurenz laurenz.a...@wien.gv.at In a way, it is similar to using the data type serial. The column will be displayed as integer, and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Yes, I agree with you in that serial being a synonym is almost no problem. But that's because serial is not an SQL-standard data type but a type unique to PostgreSQL. On the other hand, nchar is an established data type in the SQL standard. I think most people will expect to get nchar as output from psql \d and pg_dump as they specified in DDL. If they get char as output for nchar columns from pg_dump, wouldn't they get in trouble if they want to import schema/data from PostgreSQL to other database products? The documentation for pg_dump says that pg_dump pays attention to easing migrating to other DBMSs. I like this idea and want to respect this. http://www.postgresql.org/docs/current/static/app-pgdump.html -- Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products. ... --use-set-session-authorization Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to determine object ownership. This makes the dump more standards-compatible, ... -- Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
MauMau maumau...@gmail.com writes: On the other hand, nchar is an established data type in the SQL standard. I think most people will expect to get nchar as output from psql \d and pg_dump as they specified in DDL. This argument seems awfully weak. You've been able to say create table nt (nf national character varying(22)); in Postgres since around 1997, but I don't recall one single bug report about how that is displayed as just character varying(22). The other big problem with this line of argument is that you're trying to claim better spec compliance for what is at best a rather narrow interpretation with really minimal added functionality. (In fact, until you have a solution for the problem that incoming and outgoing data must be in the database's primary encoding, you don't actually have *any* added functionality, just syntactic sugar that does nothing useful.) Unless you can demonstrate by lawyerly reading of the spec that the spec requires exactly the behavior this patch implements, you do not have a leg to stand on here. But you can't demonstrate that, because it doesn't. I'd be much more impressed by seeing a road map for how we get to a useful amount of added functionality --- which, to my mind, would be the ability to support N different encodings in one database, for N2. But even if you think N=2 is sufficient, we haven't got a road map, and commandeering spec-mandated syntax for an inadequate feature doesn't seem like a good first step. It'll just make our backwards-compatibility problems even worse when somebody does come up with a real solution. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
MauMau wrote: Let me repeat myself: I think the biggest and immediate issue is that PostgreSQL does not support national character types at least officially. Officially means the description in the manual. So I don't have strong objection against the current (hidden) implementation of nchar types in PostgreSQL which are just synonyms, as long as the official support is documented. Serious users don't want to depend on hidden features. I agree with you there. Actually it is somewhat documented in http://www.postgresql.org/docs/9.3/static/features-sql-standard.html as F421, but that requires that you read the SQL standard. However, doesn't the current synonym approach have any problems? Wouldn't it produce any trouble in the future? If we treat nchar as char, we lose the fact that the user requested nchar. Can we lose the fact so easily and produce irreversible result as below? I don't think that it is a problem. According to the SQL standard, the user requested a CHAR or VARCHAR with an encoding of the choice of the DBMS. PostgreSQL chooses the database encoding. In a way, it is similar to using the data type serial. The column will be displayed as integer, and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Again, for serial the behaviour is well documented, so that seconds your request for more documentation. Would you like to write a patch for that? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote: On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: Implements NCHAR/NVARCHAR as distinct data types, not as synonyms If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET cs, then for some cs, NCHAR(x) must be the same as CHAR(x). Therefore, an implementation as separate data types is wrong. Interesting. Since the point doesn't seem to be getting through, let me try to be more clear: we're not going to accept any form of this patch. A patch that makes some progress toward actually coping with multiple encodings in the same database would be very much worth considering, but adding compatible syntax with incompatible semantics is not of interest to the PostgreSQL project. We have had this debate on many other topics in the past and will no doubt have it again in the future, but the outcome is always the same. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut pete...@gmx.net wrote: On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: Implements NCHAR/NVARCHAR as distinct data types, not as synonyms If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET cs, then for some cs, NCHAR(x) must be the same as CHAR(x). Therefore, an implementation as separate data types is wrong. Since the point doesn't seem to be getting through, let me try to be more clear: we're not going to accept any form of this patch. A patch that makes some progress toward actually coping with multiple encodings in the same database would be very much worth considering, but adding compatible syntax with incompatible semantics is not of interest to the PostgreSQL project. We have had this debate on many other topics in the past and will no doubt have it again in the future, but the outcome is always the same. It doesn't seem that there is any semantics incompatible with the SQL standard as follows: - In the first step, cs is the database encoding, which is used for char/varchar/text. - In the second (or final) step, where multiple encodings per database is supported, cs is the national character encoding which is specified with CREATE DATABASE ... NATIONAL CHARACTER ENCODING cs. If NATIONAL CHARACTER ENCODING clause is omitted, cs is the database encoding as step 1. Let me repeat myself: I think the biggest and immediate issue is that PostgreSQL does not support national character types at least officially. Officially means the description in the manual. So I don't have strong objection against the current (hidden) implementation of nchar types in PostgreSQL which are just synonyms, as long as the official support is documented. Serious users don't want to depend on hidden features. However, doesn't the current synonym approach have any problems? Wouldn't it produce any trouble in the future? If we treat nchar as char, we lose the fact that the user requested nchar. Can we lose the fact so easily and produce irreversible result as below? -- Maybe so. I guess the distinct type for NCHAR is for future extension and user friendliness. As one user, I expect to get national character instead of char character set xxx as output of psql \d and pg_dump when I specified national character in DDL. In addition, that makes it easy to use the pg_dump output for importing data to other DBMSs for some reason. -- Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Arul Shaji Arulappan wrote: Attached is a patch that implements the first set of changes discussed in this thread originally. They are: (i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - Groundwork to implement additional features for NCHAR/NVARCHAR in the future (For eg: separate encoding for nchar columns). (ii) Support for NCHAR/NVARCHAR in ECPG (iii) Documentation changes to reflect the new data type If I understood the discussion correctly the use case is that there are advantages to having a database encoding different from UTF-8, but you'd still want sume UTF-8 columns. Wouldn't it be a better design to allow specifying the encoding per column? That would give you more flexibility. I know that NCHAR/NVARCHAR is SQL Standard, but as I still think that it is a wart. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Albe Laurenz laurenz.a...@wien.gv.at If I understood the discussion correctly the use case is that there are advantages to having a database encoding different from UTF-8, but you'd still want sume UTF-8 columns. Wouldn't it be a better design to allow specifying the encoding per column? That would give you more flexibility. Yes, you are right. In the previous discussion: - That would be nice if available, but it is hard to implement multiple encodings in one database. - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. To invite them to PostgreSQL, it's important to support national character feature syntactically and document it in the manual. This is the first step. - As the second step, we can implement multiple encodings in one database. According to the SQL standard, NCHAR(n) is equivalent to CHAR(n) CHARACTER SET cs, where cs is an implementation-defined character set. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
MauMau wrote: From: Albe Laurenz laurenz.a...@wien.gv.at If I understood the discussion correctly the use case is that there are advantages to having a database encoding different from UTF-8, but you'd still want sume UTF-8 columns. Wouldn't it be a better design to allow specifying the encoding per column? That would give you more flexibility. Yes, you are right. In the previous discussion: - That would be nice if available, but it is hard to implement multiple encodings in one database. Granted. - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. To invite them to PostgreSQL, it's important to support national character feature syntactically and document it in the manual. This is the first step. I looked into the Standard, and it does not have NVARCHAR. The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING or NCHAR VARYING. I guess that the goal of this patch is to support Oracle syntax. But anybody trying to port CREATE TABLE statements from Oracle is already exposed to enough incompatibilities that the difference between NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL. In other words, I doubt that introducing the nonstandard NVARCHAR will have more benefits than drawbacks (new reserved word). Regarding the Standard compliant names of these data types, PostgreSQL already supports those. Maybe some documentation would help. - As the second step, we can implement multiple encodings in one database. According to the SQL standard, NCHAR(n) is equivalent to CHAR(n) CHARACTER SET cs, where cs is an implementation-defined character set. That second step would definitely have benefits. But I don't think that this requires the first step that your patch implements, it is in fact orthogonal. I don't think that there is any need to change NCHAR even if we get per-column encoding, it is just syntactic sugar to support SQL Feature F421. Why not tackle the second step first? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Albe Laurenz laurenz.a...@wien.gv.at I looked into the Standard, and it does not have NVARCHAR. The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING or NCHAR VARYING. OUch, that's just a mistake in my mail. You are correct. I guess that the goal of this patch is to support Oracle syntax. But anybody trying to port CREATE TABLE statements from Oracle is already exposed to enough incompatibilities that the difference between NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL. In other words, I doubt that introducing the nonstandard NVARCHAR will have more benefits than drawbacks (new reserved word). Agreed. But I'm in favor of supporting other DBMS's syntax if it doesn't complicate the spec or implementation too much, because it can help migrate to PostgreSQL. I understand PostgreSQL has made such efforts like PL/pgSQL which is similar to PL/SQL, text data type, AS in SELECT statement, etc. But I don't think that this requires the first step that your patch implements, it is in fact orthogonal. (It's not my patch.) Regarding the Standard compliant names of these data types, PostgreSQL already supports those. Maybe some documentation would help. I don't think that there is any need to change NCHAR even if we get per-column encoding, it is just syntactic sugar to support SQL Feature F421. Maybe so. I guess the distinct type for NCHAR is for future extension and user friendliness. As one user, I expect to get national character instead of char character set xxx as output of psql \d and pg_dump when I specified national character in DDL. In addition, that makes it easy to use the pg_dump output for importing data to other DBMSs for some reason. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: Implements NCHAR/NVARCHAR as distinct data types, not as synonyms If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET cs, then for some cs, NCHAR(x) must be the same as CHAR(x). Therefore, an implementation as separate data types is wrong. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Peter Eisentraut pete...@gmx.net On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote: 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. But your proposal wouldn't address the first point, because data would have to go client - server - NCHAR. The second point is valid, but it's going to be an awful amount of work for that limited result. I (or, Oracle's use case) meant the following, for example: initdb -E EUC_JP CREATE DATABASE mydb ENCODING EUC_JP NATIONAL ENCODING UTF-8; CREATE TABLE mytable ( col1 char(10), -- EUC_JP text col2 Nchar(10), -- UTF-8 text ); client encoding = EUC_JP That is, 1. Currently, the user is only handling Japanese text. To avoid unnecessary conversion, he uses EUC_JP for both client and server. 2. He needs to store some limited amount of international (non-Japanese) text in a few columns for a new feature of the system. But the international text is limited, so he wants to sacrifice performance and storage cost due to code conversion for most text and more bytes for each character. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Greg Stark st...@mit.edu If it's not lossy then what's the point? From the client's point of view it'll be functionally equivalent to text then. Sorry, what Tatsuo san suggested meant was same or compatible, not lossy. I quote the relevant part below. This is enough for the use case I mentioned in my previous mail several hours ago (actually, that is what Oracle manual describes...). http://www.postgresql.org/message-id/20130920.085853.1628917054830864151.t-is...@sraoss.co.jp [Excerpt] What about limiting to use NCHAR with a database which has same encoding or compatible encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Peter Eisentraut pete...@gmx.net That assumes that the conversion client encoding - server encoding - NCHAR encoding is not lossy. Yes, so Tatsuo san suggested to restrict server encoding - NCHAR encoding combination to those with lossless conversion. I thought one main point of this exercise was the avoid these conversions and be able to go straight from client encoding into NCHAR. It's slightly different. Please see the following excerpt: http://www.postgresql.org/message-id/B1A7485194DE4FDAB8FA781AFB570079@maumau 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote: 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. But your proposal wouldn't address the first point, because data would have to go client - server - NCHAR. The second point is valid, but it's going to be an awful amount of work for that limited result. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Tatsuo Ishii is...@postgresql.org I don't think the bind placeholder is the case. That is processed by exec_bind_message() in postgres.c. It has enough info about the type of the placeholder, and I think we can easily deal with NCHAR. Same thing can be said to COPY case. Yes, I've learned it. Agreed. If we allow an encoding for NCHAR different from the database encoding, we can convert text from the client encoding to the NCHAR encoding in nchar_in() for example. We can retrieve the NCHAR encoding from pg_database and store it in a global variable at session start. Problem is an ordinary query (simple protocol Q message) as you pointed out. Encoding conversion happens at a very early stage (note that fast-path case has the same issue). If a query message contains, say, SHIFT-JIS and EUC-JP, then we are going into trouble because the encoding conversion routine (pg_client_to_server) regards that the message from client contains only one encoding. However my question is, does it really happen? Because there's any text editor which can create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want to use NCHAR as SHIFT-JIS text, the rest of query consist of either SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the client encoding to SJIFT-JIS and everything should be fine. Maumau, is my guess correct? Yes, I believe you are right. Regardless of whether we support multiple encodings in one database or not, a single client encoding will be sufficient for one session. When receiving the Q message, the whole SQL text is converted from the client encoding to the database encoding. This part needs no modification. During execution of the Q message, NCHAR values are converted from the database encoding to the NCHAR encoding. Thank you very much, Tatsuo san. Everybody, is there any other challenge we should consider to support NCHAR/NVARCHAR types as distinct types? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Fri, Sep 20, 2013 at 8:32 PM, MauMau maumau...@gmail.com wrote: I don't think that you'll be able to get consensus around that path on this mailing list. I agree that the fact we have both varchar and text feels like a wart. Is that right? I don't feel varchar/text case is a wart. I think text was introduced for a positive reason to ease migration from other DBMSs. The manual says: http://www.postgresql.org/docs/current/static/datatype-character.html Although the type text is not in the SQL standard, several other SQL database management systems have it as well. And isn't EnterpriseDB doing similar things for Oracle compatibility, although I'm not sure about the details? Could you share your idea why we won't get consensus? Sure, it's EnterpriseDB's policy to add features that facilitate migrations from other databases - particularly Oracle - to our product, Advanced Server, even if those features don't otherwise add any value. However, the community is usually reluctant to add such features to PostgreSQL. Also, at least up until now, the existing aliasing of nchar and nchar varying to other data types has been adequate for the needs of our customers, and we've handled a bunch of other type-name incompatibilities with similar tricks. What you are proposing goes off in a different direction from both PostgreSQL and Advanced Server, and that's why I'm skeptical. If you were proposing something that we were doing in Advanced Server with great success, it would be a bit disingenuous of me to argue against doing the same thing in PostgreSQL, but that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 9/23/13 2:53 AM, MauMau wrote: Yes, I believe you are right. Regardless of whether we support multiple encodings in one database or not, a single client encoding will be sufficient for one session. When receiving the Q message, the whole SQL text is converted from the client encoding to the database encoding. This part needs no modification. During execution of the Q message, NCHAR values are converted from the database encoding to the NCHAR encoding. That assumes that the conversion client encoding - server encoding - NCHAR encoding is not lossy. I thought one main point of this exercise was the avoid these conversions and be able to go straight from client encoding into NCHAR. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com Sure, it's EnterpriseDB's policy to add features that facilitate migrations from other databases - particularly Oracle - to our product, Advanced Server, even if those features don't otherwise add any value. However, the community is usually reluctant to add such features to PostgreSQL. Also, at least up until now, the existing aliasing of nchar and nchar varying to other data types has been adequate for the needs of our customers, and we've handled a bunch of other type-name incompatibilities with similar tricks. What you are proposing goes off in a different direction from both PostgreSQL and Advanced Server, and that's why I'm skeptical. If you were proposing something that we were doing in Advanced Server with great success, it would be a bit disingenuous of me to argue against doing the same thing in PostgreSQL, but that's not the case. Sorry, I didn't mean to imitate EnterpriseDB. My intent is to just increase the popularity of PostgreSQL (or prevent the drop in popularity?). NCHAR is so basic that we can/should accept proper support. Aliasing would be nice to some extent, if its offical support would be documented in PG manual. However, just aliasing loses NCHAR type information through pg_dump. This is contrary to the benefit of pg_dump -- allow migration from PG to other DBMSs, possibly for performance comparison: http://www.postgresql.org/docs/current/static/app-pgdump.html Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products. In addition, distinct types for NCHAR/NVARCHAR allow future extension such as different encoding for NCHAR and UTF-16. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added? Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch. Text operations should work automatically, as in memory all strings will be converted to the database encoding. This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage. Thanks for your idea that sounds interesting, although I don't understand that well. The idea is very simple: CREATE DATABASE utf8_database ENCODING 'utf8'; \c utf8_database CREATE TABLE a( id serial, ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and utf8_to_ascii to write koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and utf8_to_koi8_r to write json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write ); The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions. If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types). I hope I made my point a little bit clearer. Regards, Valentine Gogichashvili
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
I think the point here is that, at least as I understand it, encoding conversion and sanitization happens at a very early stage right now, when we first receive the input from the client. If the user sends a string of bytes as part of a query or bind placeholder that's not valid in the database encoding, it's going to error out before any type-specific code has an opportunity to get control. Look at textin(), for example. There's no encoding check there. That means it's already been done at that point. To make this work, someone's going to have to figure out what to do about *that*. Until we have a sketch of what the design for that looks like, I don't see how we can credibly entertain more specific proposals. I don't think the bind placeholder is the case. That is processed by exec_bind_message() in postgres.c. It has enough info about the type of the placeholder, and I think we can easily deal with NCHAR. Same thing can be said to COPY case. Problem is an ordinary query (simple protocol Q message) as you pointed out. Encoding conversion happens at a very early stage (note that fast-path case has the same issue). If a query message contains, say, SHIFT-JIS and EUC-JP, then we are going into trouble because the encoding conversion routine (pg_client_to_server) regards that the message from client contains only one encoding. However my question is, does it really happen? Because there's any text editor which can create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want to use NCHAR as SHIFT-JIS text, the rest of query consist of either SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the client encoding to SJIFT-JIS and everything should be fine. Maumau, is my guess correct? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote: For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. As far as I can tell the whole reason for introducing NCHAR is to support SHIFT-JIS, there hasn't been call for any other encodings, that I can remember anyway. So rather than this whole NCHAR thing, why not just add a type sjistext, and a few type casts and call it a day... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii is...@postgresql.org wrote: What about limiting to use NCHAR with a database which has same encoding or compatible encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. I think the point here is that, at least as I understand it, encoding conversion and sanitization happens at a very early stage right now, when we first receive the input from the client. If the user sends a string of bytes as part of a query or bind placeholder that's not valid in the database encoding, it's going to error out before any type-specific code has an opportunity to get control. Look at textin(), for example. There's no encoding check there. That means it's already been done at that point. To make this work, someone's going to have to figure out what to do about *that*. Until we have a sketch of what the design for that looks like, I don't see how we can credibly entertain more specific proposals. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Thu, Sep 19, 2013 at 6:42 PM, MauMau maumau...@gmail.com wrote: National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support. We'd have to go back and search the archives to figure out why that item was added to the TODO, but I'd be surprised if anyone ever had it in mind to create additional types that behave just like existing types but with different names. I don't think that you'll be able to get consensus around that path on this mailing list. I am not keen to introduce support for nchar and nvarchar as differently-named types with identical semantics. Similar examples already exist: - varchar and text: the only difference is the existence of explicit length limit - numeric and decimal - int and int4, smallint and int2, bigint and int8 - real/double precison and float I agree that the fact we have both varchar and text feels like a wart. The other examples mostly involve different names for the same underlying type, and so are different from what you are asking for here. I understand your feeling. The concern about incompatibility can be eliminated by thinking the following way. How about this? - NCHAR can be used with any database encoding. - At first, NCHAR is exactly the same as CHAR. That is, implementation-defined character set described in the SQL standard is the database character set. - In the future, the character set for NCHAR can be selected at database creation like Oracle's CREATE DATABAWSE NATIONAL CHARACTER SET AL16UTF16. The default it the database set. Hmm. So under that design, a database could support up to a total of two character sets, the one that you get when you say 'foo' and the other one that you get when you say n'foo'. I guess we could do that, but it seems a bit limited. If we're going to go to the trouble of supporting multiple character sets, why not support an arbitrary number instead of just two? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 9/20/13 2:22 PM, Robert Haas wrote: I am not keen to introduce support for nchar and nvarchar as differently-named types with identical semantics. Similar examples already exist: - varchar and text: the only difference is the existence of explicit length limit - numeric and decimal - int and int4, smallint and int2, bigint and int8 - real/double precison and float I agree that the fact we have both varchar and text feels like a wart. The other examples mostly involve different names for the same underlying type, and so are different from what you are asking for here. Also note that we already have NCHAR [VARYING]. It's mapped to char or varchar, respectively, in the parser, just like int, real, etc. are handled. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Tatsuo Ishii is...@postgresql.org What about limiting to use NCHAR with a database which has same encoding or compatible encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. Thanks for the idea, it sounds flexible for wider use. Your cooperation would be much appreciated to devise implementation with as little code as possible. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Valentine Gogichashvili val...@gmail.com the whole NCHAR appeared as hack for the systems, that did not have it from the beginning. It would not be needed, if all the text would be magically stored in UNICODE or UTF from the beginning and idea of character would be the same as an idea of a rune and not a byte. I guess so, too. PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added? Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch. Text operations should work automatically, as in memory all strings will be converted to the database encoding. This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage. Thanks for your idea that sounds interesting, although I don't understand that well. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Martijn van Oosterhout klep...@svana.org As far as I can tell the whole reason for introducing NCHAR is to support SHIFT-JIS, there hasn't been call for any other encodings, that I can remember anyway. Could you elaborate on this, giving some info sources? So rather than this whole NCHAR thing, why not just add a type sjistext, and a few type casts and call it a day... The main reason for supporting NCHAR types is to ease migration from other DBMSs, not requiring DDL changes. So sjistext does not match that purpose. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com I don't think that you'll be able to get consensus around that path on this mailing list. I agree that the fact we have both varchar and text feels like a wart. Is that right? I don't feel varchar/text case is a wart. I think text was introduced for a positive reason to ease migration from other DBMSs. The manual says: http://www.postgresql.org/docs/current/static/datatype-character.html Although the type text is not in the SQL standard, several other SQL database management systems have it as well. And isn't EnterpriseDB doing similar things for Oracle compatibility, although I'm not sure about the details? Could you share your idea why we won't get consensus? I understand your feeling. The concern about incompatibility can be eliminated by thinking the following way. How about this? - NCHAR can be used with any database encoding. - At first, NCHAR is exactly the same as CHAR. That is, implementation-defined character set described in the SQL standard is the database character set. - In the future, the character set for NCHAR can be selected at database creation like Oracle's CREATE DATABAWSE NATIONAL CHARACTER SET AL16UTF16. The default it the database set. Hmm. So under that design, a database could support up to a total of two character sets, the one that you get when you say 'foo' and the other one that you get when you say n'foo'. I guess we could do that, but it seems a bit limited. If we're going to go to the trouble of supporting multiple character sets, why not support an arbitrary number instead of just two? I agree with you about the arbitrary number. Tatsuo san gave us a good suggestion. Let's consider how to implement that. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii is...@postgresql.org wrote: What about limiting to use NCHAR with a database which has same encoding or compatible encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. I think the point here is that, at least as I understand it, encoding conversion and sanitization happens at a very early stage right now, when we first receive the input from the client. If the user sends a string of bytes as part of a query or bind placeholder that's not valid in the database encoding, it's going to error out before any type-specific code has an opportunity to get control. Look at textin(), for example. There's no encoding check there. That means it's already been done at that point. To make this work, someone's going to have to figure out what to do about *that*. Until we have a sketch of what the design for that looks like, I don't see how we can credibly entertain more specific proposals. OK, I see your point. Let's consider that design. I'll learn the code regarding this. Does anybody, especially Tatsuo san, Tom san, Peter san, have any good idea? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Wed, Sep 18, 2013 at 6:42 PM, MauMau maumau...@gmail.com wrote: It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. No, those are desirable if possible features. What's important is to declare in the manual that PostgreSQL officially supports national character types, as I stated below. That may be what's important to you, but it's not what's important to me. I am not keen to introduce support for nchar and nvarchar as differently-named types with identical semantics. And I think it's an even worse idea to introduce them now, making them work one way, and then later change the behavior in a backward-incompatible fashion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com That may be what's important to you, but it's not what's important to me. National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support. I am not keen to introduce support for nchar and nvarchar as differently-named types with identical semantics. Similar examples already exist: - varchar and text: the only difference is the existence of explicit length limit - numeric and decimal - int and int4, smallint and int2, bigint and int8 - real/double precison and float In addition, the SQL standard itself admits: The key words NATIONAL CHARACTER are used to specify the character type with an implementation- defined character set. Special syntax (N'string') is provided for representing literals in that character set. ... NATIONAL CHARACTER is equivalent to the corresponding character string type with a specification of CHARACTER SET CSN, where CSN is an implementation-defined character set name. A national character string literal is equivalent to a character string literal with the N replaced by introducercharacter set specification, where character set specification is an implementation- defined character set name. And I think it's an even worse idea to introduce them now, making them work one way, and then later change the behavior in a backward-incompatible fashion. I understand your feeling. The concern about incompatibility can be eliminated by thinking the following way. How about this? - NCHAR can be used with any database encoding. - At first, NCHAR is exactly the same as CHAR. That is, implementation-defined character set described in the SQL standard is the database character set. - In the future, the character set for NCHAR can be selected at database creation like Oracle's CREATE DATABAWSE NATIONAL CHARACTER SET AL16UTF16. The default it the database set. Could you tell me what kind of specification we should implement if we officially support national character types? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote: 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. ... 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. Agreed. As to the first one, if we're going to go to the (substantial) trouble of building infrastructure to allow a database to store data in multiple encodings, why limit it to storing UTF-8 in non-UTF-8 databases? What about storing SHIFT-JIS in UTF-8 databases, or Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other combination you might care to name? Whether we go that way or not, I think storing data in one encoding in a database with a different encoding is going to be pretty tricky and require far-reaching changes. You haven't mentioned any of those issues or discussed how you would solve them. What about limiting to use NCHAR with a database which has same encoding or compatible encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, CREATE TABLE t1(t NCHAR(10)) will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Hi, That may be what's important to you, but it's not what's important to me. National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support. the whole NCHAR appeared as hack for the systems, that did not have it from the beginning. It would not be needed, if all the text would be magically stored in UNICODE or UTF from the beginning and idea of character would be the same as an idea of a rune and not a byte. PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added? It would make it possible to have a database, that talks to the clients in UTF8 and stores text and varchar data in the encoding that is the most appropriate for the situation. It will make it impossible (or complicated) to make the database have a non-UTF8 default encoding (I wonder who should need that in this case), as conversions will not be possible from the broader charsets into the default database encoding. One could define an additional DATABASE property like LC_ENCODING that would work for the ENCODING property of a column like LC_COLLATE for COLLATE property of a column. Text operations should work automatically, as in memory all strings will be converted to the database encoding. This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage. Regards, -- Valentine Gogichashvili
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote: 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. ... 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. As to the first one, if we're going to go to the (substantial) trouble of building infrastructure to allow a database to store data in multiple encodings, why limit it to storing UTF-8 in non-UTF-8 databases? What about storing SHIFT-JIS in UTF-8 databases, or Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other combination you might care to name? Whether we go that way or not, I think storing data in one encoding in a database with a different encoding is going to be pretty tricky and require far-reaching changes. You haven't mentioned any of those issues or discussed how you would solve them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Robert Haas robertmh...@gmail.com writes: On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote: 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. ... 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. Another point to keep in mind is that UTF16 is not really any easier to deal with than UTF8, unless you write code that fails to support characters outside the basic multilingual plane. Which is a restriction I don't believe we'd accept. But without that restriction, you're still forced to deal with variable-width characters; and there's nothing very nice about the way that's done in UTF16. So on the whole I think it makes more sense to use UTF8 for this. I share Robert's misgivings about difficulties in dealing with characters that are not representable in the database's principal encoding. Still, you probably won't find out about many of those until you try it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 18.09.2013 16:16, Robert Haas wrote: On Mon, Sep 16, 2013 at 8:49 AM, MauMaumaumau...@gmail.com wrote: 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. ... 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. As to the first one, if we're going to go to the (substantial) trouble of building infrastructure to allow a database to store data in multiple encodings, why limit it to storing UTF-8 in non-UTF-8 databases? What about storing SHIFT-JIS in UTF-8 databases, or Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other combination you might care to name? Whether we go that way or not, I think storing data in one encoding in a database with a different encoding is going to be pretty tricky and require far-reaching changes. You haven't mentioned any of those issues or discussed how you would solve them. I'm not too thrilled about complicating the system for that, either. If you really need to deal with many different languages, you can do that today by using UTF-8 everywhere. Sure, it might not be the most efficient encoding for some characters, but it works. There is one reason, however, that makes it a lot more compelling: we already support having databases with different encodings in the same cluster, but the encoding used in the shared catalogs, for usernames and database names for example, is not well-defined. If we dealt with different encodings in the same database, that inconsistency would go away. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Robert Haas robertmh...@gmail.com On Mon, Sep 16, 2013 at 8:49 AM, MauMau maumau...@gmail.com wrote: 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. ... 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. No, those are desirable if possible features. What's important is to declare in the manual that PostgreSQL officially supports national character types, as I stated below. 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for CHAR/VARCHAR, and ignores N prefix. But this is not documented. 2. Declare support for national character support in the manual. 1 is not sufficient because users don't want to depend on undocumented behavior. This is exactly what the TODO item national character support in PostgreSQL TODO wiki is about. 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - To implement additional features for NCHAR/NVARCHAR in the future, as described below. And when declaring that, we had better implement NCHAR types as distinct types with their own OIDs so that we can extend NCHAR behavior in the future. As the first stage, I think it's okay to treat NCHAR types exactly the same as CHAR/VARCHAR types. For example, in ECPG: switch (type) case OID_FOR_CHAR: case OID_FOR_VARCHAR: case OID_FOR_TEXT: case OID_FOR_NCHAR: /* new code */ case OID_FOR_NVARCHAR: /* new code */ some processing; break; And in JDBC, just call methods for non-national character types. Currently, those national character methods throw SQLException. public void setNString(int parameterIndex, String value) throws SQLException { setString(parameterIndex, value); } Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. As to the first one, if we're going to go to the (substantial) trouble of building infrastructure to allow a database to store data in multiple encodings, why limit it to storing UTF-8 in non-UTF-8 databases? What about storing SHIFT-JIS in UTF-8 databases, or Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other combination you might care to name? Whether we go that way or not, I think storing data in one encoding in a database with a different encoding is going to be pretty tricky and require far-reaching changes. You haven't mentioned any of those issues or discussed how you would solve them. Yes, you are probably right -- I'm not sure UTF-16 has really benefits that UTF-8 doesn't have. But why did Windows and Java choose UTF-16 for internal strings rather than UTF-8? Why did Oracle recommend UTF-16 for NCHAR? I have no clear idea. Anyway, I don't strongly push UTF-16 and complicate the encoding handling. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
From: Tom Lane t...@sss.pgh.pa.us Another point to keep in mind is that UTF16 is not really any easier to deal with than UTF8, unless you write code that fails to support characters outside the basic multilingual plane. Which is a restriction I don't believe we'd accept. But without that restriction, you're still forced to deal with variable-width characters; and there's nothing very nice about the way that's done in UTF16. So on the whole I think it makes more sense to use UTF8 for this. I feel so. I guess why Windows, Java, and Oracle chose UTF-16 is ... it was UCS-2 only with BMP when they chose it. So character handling was easier and faster thanks to fixed-width encoding. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of MauMau Hello, I think it would be nice for PostgreSQL to support national character types largely because it should ease migration from other DBMSs. [Reasons why we need NCHAR] -- 1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. all have NCHAR support. PostgreSQL is probably the only database out of major ones that does not support NCHAR. Sadly, I've read a report from some Japanese government agency that the number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I wouldn't say that is due to NCHAR support, but it might be one reason. I want PostgreSQL to be more popular and regain those users. 2. Enhance the open image of PostgreSQL by implementing more features of SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard now that we have Unicode support, but it is defined in the standard and widely implemented. 3. I have heard that some potential customers didn't adopt PostgreSQL due to lack of NCHAR support. However, I don't know the exact reason why they need NCHAR. The use case we have is for customer(s) who are modernizing their databases on mainframes. These applications are typically written in COBOL which does have extensive support for National Characters. Supporting National Characters as in-built data types in PostgreSQL is, not to exaggerate, an important criteria in their decision to use PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.) 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. -- I think we need to do the following: [Minimum requirements] -- 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for CHAR/VARCHAR, and ignores N prefix. But this is not documented. 2. Declare support for national character support in the manual. 1 is not sufficient because users don't want to depend on undocumented behavior. This is exactly what the TODO item national character support in PostgreSQL TODO wiki is about. 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - To implement additional features for NCHAR/NVARCHAR in the future, as described below. -- Agreed. This is our minimum requirement too. Rgds, Arul Shaji [Optional requirements] -- 1. Implement client driver support, such as: - NCHAR host variable type (e.g. NCHAR var_name[12];) in ECPG, as specified in the SQL standard. - national character methods (e.g. setNString, getNString, setNCharacterStream) as specified in JDBC 4.0. I think at first we can treat these national-character-specific features as the same as CHAR/VARCHAR. 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and minimize the first implementation of NCHAR support. 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. This would definitely be a welcome addition. -- I don't think it is good to implement NCHAR/NVARCHAR types as extensions like contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR OID values. How do you think we should implement NCHAR support? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Hello, I think it would be nice for PostgreSQL to support national character types largely because it should ease migration from other DBMSs. [Reasons why we need NCHAR] -- 1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. all have NCHAR support. PostgreSQL is probably the only database out of major ones that does not support NCHAR. Sadly, I've read a report from some Japanese government agency that the number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I wouldn't say that is due to NCHAR support, but it might be one reason. I want PostgreSQL to be more popular and regain those users. 2. Enhance the open image of PostgreSQL by implementing more features of SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard now that we have Unicode support, but it is defined in the standard and widely implemented. 3. I have heard that some potential customers didn't adopt PostgreSQL due to lack of NCHAR support. However, I don't know the exact reason why they need NCHAR. 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of Oracle Database Globalization Support Guide. -- I think we need to do the following: [Minimum requirements] -- 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for CHAR/VARCHAR, and ignores N prefix. But this is not documented. 2. Declare support for national character support in the manual. 1 is not sufficient because users don't want to depend on undocumented behavior. This is exactly what the TODO item national character support in PostgreSQL TODO wiki is about. 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - To implement additional features for NCHAR/NVARCHAR in the future, as described below. -- [Optional requirements] -- 1. Implement client driver support, such as: - NCHAR host variable type (e.g. NCHAR var_name[12];) in ECPG, as specified in the SQL standard. - national character methods (e.g. setNString, getNString, setNCharacterStream) as specified in JDBC 4.0. I think at first we can treat these national-character-specific features as the same as CHAR/VARCHAR. 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and minimize the first implementation of NCHAR support. 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. -- I don't think it is good to implement NCHAR/NVARCHAR types as extensions like contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR OID values. How do you think we should implement NCHAR support? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Boguk, Maksym maks...@fast.au.fujitsu.com writes: Hi, my task is implementing ANSI NATIONAL character string types as part of PostgreSQL core. No, that's not a given. You have a problem to solve, ie store some UTF8 strings in a database that's mostly just 1-byte data. It is not clear that NATIONAL CHARACTER is the best solution to that problem. And I don't think that you're going to convince anybody that this is an improvement in spec compliance, because there's too much gap between what you're doing here and what it says in the spec. Both of these approach requires dump/restore the whole database which is not always an opinion. That's a disadvantage, agreed, but it's not a large enough one to reject the approach, because what you want to do also has very significant disadvantages. I think it is extremely likely that we will end up rejecting a patch based on NATIONAL CHARACTER altogether. It will require too much duplicative code, it requires too many application-side changes to make use of the functionality, and it will break any applications that are relying on the current behavior of that syntax. But the real problem is that you're commandeering syntax defined in the SQL spec for what is in the end quite a narrow usage. I agree that the use-case will be very handy for some applications ... but if we were ever to try to achieve real spec compliance for the SQL features around character sets, this doesn't look like a step on the way to that. I think you'd be well advised to take a hard look at the specialized-database-encoding approach. From here it looks like a 99% solution for about 1% of the effort; and since it would be quite uninvasive to the system as a whole, it's unlikely that such a patch would get rejected. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
On 03.09.2013 05:28, Boguk, Maksym wrote: Target usage: ability to store UTF8 national characters in some selected fields inside a single-byte encoded database. For sample if I have a ru-RU.koi8r encoded database with mostly Russian text inside, it would be nice to be able store an Japanese text in one field without converting the whole database to UTF8 (convert such database to UTF8 easily could almost double the database size even if only one field in whole database will use any symbols outside of ru-RU.koi8r encoding). Ok. What has been done: 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL CHARACTER VARIABLE. These types differ from the char/varchar data types in one important respect: NATIONAL string types are always have UTF8 encoding even (independent from used database encoding). I don't like the approach of adding a new data type for this. The encoding used for a text field should be an implementation detail, not something that's exposed to users at the schema-level. A separate data type makes an nvarchar field behave slightly differently from text, for example when it's passed to and from functions. It will also require drivers and client applications to know about it. What need to be done: 1)Full set of string functions and operators for NATIONAL types (we could not use generic text functions because they assume that the stings will have database encoding). Now only basic set implemented. 2)Need implement some way to define default collation for a NATIONAL types. 3)Need implement some way to input UTF8 characters into NATIONAL types via SQL (there are serious open problem... it will be defined later in the text). Yeah, all of these issues stem from the fact that the NATIONAL types are separate from text. I think we should take a completely different approach to this. Two alternatives spring to mind: 1. Implement a new encoding. The new encoding would be some variant of UTF-8 that encodes languages like Russian more efficiently. Then just use that in the whole database. Something like SCSU (http://www.unicode.org/reports/tr6/) should do the trick, although I'm not sure if SCSU can be used as a server-encoding. A lot of code relies on the fact that a server encoding must have the high bit set in all bytes that are part of a multi-byte character. That's why SJIS for example can only be used as a client-encoding. But surely you could come up with some subset or variant of SCSU which satisfies that requirement. 2. Compress the column. Simply do ALTER TABLE foo ALTER COLUMN bar SET STORAGE MAIN. That will make Postgres compress that field. That might not be very efficient for compressing short cyrillic text encoded in UTF-8 today, but that could be improved. There has been discussion on supporting more compression algorithms in the past, and one such algorithm could be again something like SCSU. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
Heikki Linnakangas hlinnakan...@vmware.com writes: On 03.09.2013 05:28, Boguk, Maksym wrote: Target usage: ability to store UTF8 national characters in some selected fields inside a single-byte encoded database. I think we should take a completely different approach to this. Two alternatives spring to mind: 1. Implement a new encoding. The new encoding would be some variant of UTF-8 that encodes languages like Russian more efficiently. +1. I'm not sure that SCSU satisfies the requirement (which I read as that Russian text should be pretty much 1 byte/character). But surely we could devise a variant that does. For instance, it could look like koi8r (or any other single-byte encoding of your choice) with one byte value, say 255, reserved as a prefix. 255 means that a UTF8 character follows. The main complication here is that you don't want to allow more than one way to represent a character --- else you break text hashing, for instance. So you'd have to take care that you never emit the 255+UTF8 representation for a character that can be represented in the single-byte encoding. In particular, you'd never encode ASCII that way, and thus this would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule. Ideally we could make a variant like this for each supported single-byte encoding, and thus you could optimize a database for mostly but not entirely LATIN1 text, etc. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.
1)Addition of new string data types NATIONAL CHARACTER and NATIONAL CHARACTER VARIABLE. These types differ from the char/varchar data types in one important respect: NATIONAL string types are always have UTF8 encoding even (independent from used database encoding). I don't like the approach of adding a new data type for this. The encoding used for a text field should be an implementation detail, not something that's exposed to users at the schema-level. A separate data type makes an nvarchar field behave slightly differently from text, for example when it's passed to and from functions. It will also require drivers and client applications to know about it. Hi, my task is implementing ANSI NATIONAL character string types as part of PostgreSQL core. And requirement require drivers and client applications to know about it is reason why it could not be done as add-on (these new types should have a fixed OID for most drivers from my experience). Implementing them as UTF8 data-type is first step which allows have NATIONAL characters with encoding differ from database encoding (and might me even support multiple encoding for common string types in future). 1)Full set of string functions and operators for NATIONAL types (we could not use generic text functions because they assume that the stings will have database encoding). Now only basic set implemented. 2)Need implement some way to define default collation for a NATIONAL types. 3)Need implement some way to input UTF8 characters into NATIONAL types via SQL (there are serious open problem... it will be defined later in the text). Yeah, all of these issues stem from the fact that the NATIONAL types are separate from text. I think we should take a completely different approach to this. Two alternatives spring to mind: 1. Implement a new encoding. The new encoding would be some variant of UTF-8 that encodes languages like Russian more efficiently. Then just use that in the whole database. Something like SCSU (http://www.unicode.org/reports/tr6/) should do the trick, although I'm not sure if SCSU can be used as a server-encoding. A lot of code relies on the fact that a server encoding must have the high bit set in all bytes that are part of a multi-byte character. That's why SJIS for example can only be used as a client-encoding. But surely you could come up with some subset or variant of SCSU which satisfies that requirement. 2. Compress the column. Simply do ALTER TABLE foo ALTER COLUMN bar SET STORAGE MAIN. That will make Postgres compress that field. That might not be very efficient for compressing short Cyrillic text encoded in UTF-8 today, but that could be improved. There has been discussion on supporting more compression algorithms in the past, and one such algorithm could be again something like SCSU. Both of these approach requires dump/restore the whole database which is not always an opinion. Implementing an UTF8 NATIONAL character as new datatype will provide opinion use pg_upgrade to latest version and have required functionality without prolonged downtime. PS: is it possible to reserve some narrow type OID range in PostgreSQL core for the future use? Kind Regards, Maksym -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers