Re: [HACKERS] Proposal - Support for National Characters functionality
Hi everyone, I will try answer on all questions related to proposed National Characters support. 2)Provide support for the new GUC nchar_collation to provide the database with information about the default collation that needs to be used for the new data types. A GUC seems like completely the wrong tack to be taking. In the first place, that would mandate just one value (at a time anyway) of collation, which is surely not much of an advance over what's already possible. In the second place, what happens if you change the value? All your indexes on nchar columns are corrupt, that's what. Actually the data itself would be corrupt, if you intend that this setting determines the encoding and not just the collation. If you really are speaking only of collation, it's not clear to me exactly what this proposal offers that can't be achieved today (with greater security, functionality and spec compliance) by using COLLATE clauses on plain text columns. Actually, you really haven't answered at all what it is you want to do that COLLATE can't do. I think I give a wrong description there... it will be not GUC but GUC-type value which will be initialized during CREATE DATABASE and will be read only after, very similar to the lc_collate. So I think name national_lc_collate will be better. Function of this value - provide information about the default collation for the NATIONAL CHARACTERS inside the database. That's not limits user ability of use an alternative collation for NATIONAL CHARACTERS during create table via COLLATE keyword. E.g. if we have second encoding inside the database - we should have information about used collation somewhere. 4)Because all symbols from non-UTF8 encodings could be represented as UTF8 (but the reverse is not true) comparison between N* types and the regular string types inside database will be performed in UTF8 form. I believe that in some Far Eastern character sets there are some characters that map to the same Unicode glyph, but that some people would prefer to keep separate. So transcoding to UTF8 isn't necessarily lossless. This is one of the reasons why we've resisted adopting ICU or standardizing on UTF8 as the One True Database Encoding. Now this may or may not matter for comparison to strings that were in some other encoding to start with --- but as soon as you base your design on the premise that UTF8 is a universal encoding, you are sliding down a slippery slope to a design that will meet resistance. Will the conversion of both sides to the pg_wchar before comparison fix this problem? Anyway, if the database going to use more than one encoding, a some universal encoding should be used to allow comparison between them. After some analyse I think pg_wchar is better candidate to this role than UTF8. 6)Client input/output of NATIONAL strings - NATIONAL strings will respect the client_encoding setting, and their values will be transparently converted to the requested client_encoding before sending(receiving) to client (the same mechanics as used for usual string types). So no mixed encoding in client input/output will be supported/available. If you have this restriction, then I'm really failing to see what benefit there is over what can be done today with COLLATE. There are two targets for this project: 1. Legacy database with non-utf8 encoding, which should support old non-utf8 applications and new UTF8 applications. In that case the old applications will use the legacy database encoding (and because these applications are legacy they doesn't work with new NATIONAL CHARACTERS data/tables). And the new applications will use client-side UTF8 encoding and they will be able store international texts in NATIONAL CHARACTER columns. Dump/restore of the whole database to change the database encoding to UTF8 not always possible, so there necessity of the some easy to use workaround. 2.Better compatibility with the ANSI SQL standard. Kind Regards, Maksym -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Boguk, Maksym escribió: I think I give a wrong description there... it will be not GUC but GUC-type value which will be initialized during CREATE DATABASE and will be read only after, very similar to the lc_collate. So I think name national_lc_collate will be better. Function of this value - provide information about the default collation for the NATIONAL CHARACTERS inside the database. That's not limits user ability of use an alternative collation for NATIONAL CHARACTERS during create table via COLLATE keyword. This seems a bit odd. I mean, if I want the option for differing encodings, surely I need to be able to set them for each column, not at the database level. Also, as far as I understand what we want to control here is the encoding that the strings are in (the mapping of bytes to characters), not the collation (the way a set of strings are ordered). So it doesn't make sense to set the NATIONAL CHARACTER option using the COLLATE keyword. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Alvaro Herrera alvhe...@2ndquadrant.com writes: Also, as far as I understand what we want to control here is the encoding that the strings are in (the mapping of bytes to characters), not the collation (the way a set of strings are ordered). So it doesn't make sense to set the NATIONAL CHARACTER option using the COLLATE keyword. My thought is that we should simply ignore the NATIONAL CHARACTER syntax, which is not the first nor the last brain-damaged feature design in the SQL standard. It's basically useless for what we want because there's noplace to specify which encoding you mean. Instead, let's consider that COLLATE can define not only the collation but also the encoding of a string datum. Contrary to what I think you meant above, that seems perfectly sensible to me, because after all a collation is necessarily a bunch of rules about how to order a particular set of characters. If the data representation you use is unable to represent that set of characters, it's not a very meaningful combination is it? There's still the problem of how do you get a string of a nondefault encoding into the database in the first place. If you have to convert to DB encoding to get it in there, then what's the use of a further conversion? This consideration may well kill the whole concept. (It certainly kills NATIONAL CHARACTER syntax just as much.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] Boguk, Maksym escribió: I think I give a wrong description there... it will be not GUC but GUC-type value which will be initialized during CREATE DATABASE and will be read only after, very similar to the lc_collate. So I think name national_lc_collate will be better. Function of this value - provide information about the default collation for the NATIONAL CHARACTERS inside the database. That's not limits user ability of use an alternative collation for NATIONAL CHARACTERS during create table via COLLATE keyword. This seems a bit odd. I mean, if I want the option for differing encodings, surely I need to be able to set them for each column, not at the database level. Also, as far as I understand what we want to control here is the encoding that the strings are in (the mapping of bytes to characters), not the collation Yes, that is our idea too. For the sql syntax Create table tbl1 (col1 nchar); What should be the encoding and collation for col1? Because the idea is to have them in separate encoding and collation (if needed) from that of the rest of the table. We have options of a) Having guc variables that will determine the default encoding and collation for nchar/nvarchar columns. Note that the collate variable is default only. Users can still override them per column. b) Having the encoding name and collation as part of the syntax. For ex., (col1 nchar encoding UTF-8 COLLATE C). Ugly, but. c) Be rigid and say nchar/nvarchar columns are by default UTF-8 (or something else). One cannot change the default. But they can override it when declaring the column by having a syntax similar to (b) Rgds, Arul Shaji (the way a set of strings are ordered). So it doesn't make sense to set the NATIONAL CHARACTER option using the COLLATE keyword. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
From: Tom Lane [mailto:t...@sss.pgh.pa.us] Alvaro Herrera alvhe...@2ndquadrant.com writes: Also, as far as I understand what we want to control here is the encoding that the strings are in (the mapping of bytes to characters), not the collation (the way a set of strings are ordered). So it doesn't make sense to set the NATIONAL CHARACTER option using the COLLATE keyword. My thought is that we should simply ignore the NATIONAL CHARACTER syntax, which is not the first nor the last brain-damaged feature design in the SQL standard. It's basically useless for what we want because there's noplace to specify which encoding you mean. Instead, let's consider that COLLATE can define not only the collation but also the encoding of a string datum. Yes, don't have a problem with this. If I understand you correctly, this will be simpler syntax wise, but still get nchar/nvarchar data types into a table, in different encoding from the rest of the table. There's still the problem of how do you get a string of a nondefault encoding into the database in the first place. Yes, that is the bulk of the work. Will need change in a whole lot of places. Is a step-by-step approach worth exploring ? Something similar to: Step 1: Support nchar/nvarchar data types. Restrict them only to UTF-8 databases to begin with. Step 2: Support multiple encodings in a database. Remove the restriction imposed in step1. Rgds, Arul Shaji -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
-Original Message- From: Tatsuo Ishii [mailto:is...@postgresql.org] Also I don't understand why you need UTF-16 support as a database encoding because UTF-8 and UTF-16 are logically equivalent, they are just different represention (encoding) of Unicode. That means if we already support UTF-8 (I'm sure we already do), there's no particular reason we need to add UTF-16 support. Maybe you just want to support UTF-16 as a client encoding? Given below is a design draft for this functionality: Core new functionality (new code): 1)Create and register independent NCHAR/NVARCHAR/NTEXT data types. 2)Provide support for the new GUC nchar_collation to provide the database with information about the default collation that needs to be used for the new data types. 3)Create encoding conversion subroutines to convert strings between the database encoding and UTF8 (from national strings to regular strings and back). PostgreSQL already have all required support (used for conversion between the database encoding and client_encoding), so amount of the new code will be minimal there. 4)Because all symbols from non-UTF8 encodings could be represented as UTF8 (but the reverse is not true) comparison between N* types and the regular string types inside database will be performed in UTF8 form. To achieve this feature the new IMPLICIT casts may need to be created: NCHAR - CHAR NVARCHAR - VARCHAR NTEXT - TEXT. Casting in the reverse direction will be available too but only as EXPLICIT. However, these casts could fail if national strings could not be represented in the used database encoding. All these casts will use subroutines created in 3). Casting/conversion between N* types will follow the same rules/mechanics as used for casting/conversion between usual (CHAR(N)/VARCHAR(N)/TEXT) string types. 5)Comparison between NATIONAL string values will be performed via specialized UTF8 optimized functions (with respect of the nchar_collation setting). 6)Client input/output of NATIONAL strings - NATIONAL strings will respect the client_encoding setting, and their values will be transparently converted to the requested client_encoding before sending(receiving) to client (the same mechanics as used for usual string types). So no mixed encoding in client input/output will be supported/available. 7)Create set of the regression tests for these new data types. Additional changes: 1)ECPG support for these new types 2) Support in the database drivers for the data types. Rgds, Arul Shaji -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Arulappan, Arul Shaji a...@fast.au.fujitsu.com writes: Given below is a design draft for this functionality: Core new functionality (new code): 1)Create and register independent NCHAR/NVARCHAR/NTEXT data types. 2)Provide support for the new GUC nchar_collation to provide the database with information about the default collation that needs to be used for the new data types. A GUC seems like completely the wrong tack to be taking. In the first place, that would mandate just one value (at a time anyway) of collation, which is surely not much of an advance over what's already possible. In the second place, what happens if you change the value? All your indexes on nchar columns are corrupt, that's what. Actually the data itself would be corrupt, if you intend that this setting determines the encoding and not just the collation. If you really are speaking only of collation, it's not clear to me exactly what this proposal offers that can't be achieved today (with greater security, functionality and spec compliance) by using COLLATE clauses on plain text columns. Actually, you really haven't answered at all what it is you want to do that COLLATE can't do. 4)Because all symbols from non-UTF8 encodings could be represented as UTF8 (but the reverse is not true) comparison between N* types and the regular string types inside database will be performed in UTF8 form. I believe that in some Far Eastern character sets there are some characters that map to the same Unicode glyph, but that some people would prefer to keep separate. So transcoding to UTF8 isn't necessarily lossless. This is one of the reasons why we've resisted adopting ICU or standardizing on UTF8 as the One True Database Encoding. Now this may or may not matter for comparison to strings that were in some other encoding to start with --- but as soon as you base your design on the premise that UTF8 is a universal encoding, you are sliding down a slippery slope to a design that will meet resistance. 6)Client input/output of NATIONAL strings - NATIONAL strings will respect the client_encoding setting, and their values will be transparently converted to the requested client_encoding before sending(receiving) to client (the same mechanics as used for usual string types). So no mixed encoding in client input/output will be supported/available. If you have this restriction, then I'm really failing to see what benefit there is over what can be done today with COLLATE. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote: Does support for alternative multi-byte encodings have something to do with the Han unification controversy? I don't know terribly much about this, so apologies if that's just wrong. There's a famous problem regarding conversion between Unicode and other encodings, such as Shift Jis. There are lots of discussion on this. Here is the one from Microsoft: http://support.microsoft.com/kb/170559/EN-US Apart from Shift-JIS not being a well defined (it's more a family of encodings) it has the unusual feature of providing multiple ways to encode the same character. This is not even a Han unification issue, they have largely been addressed. For example, the square-root symbol exists twice (0x8795 and 0x81E3) and many other mathmatical symbols also. Here's the code page which you can browse online: http://msdn.microsoft.com/en-us/goglobal/cc305152 Which means to be round-trippable Unicode would have to double those characters, but this would make it hard/impossible to round-trip with any other character set that had those characters. No easy solution here. Something that has been done before [1] is to map the doubles to the custom area of the unicode space (0xe000-0x). It gives you round-trip support at the expense of having to handle those characters yourself. But since postgres doesn't do anything meaningful with unicode characters this might be acceptable. [1] Python does a similar trick to handle filenames coming from disk in an unknown encoding: http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Proposal - Support for National Characters functionality
On Mon, Jul 15, 2013 at 4:37 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. What cannot be represented as UTF-8? UTF-8 can represent every character in the Unicode character set, whereas UTF-16 can encode characters 0 to 0x10. Does support for alternative multi-byte encodings have something to do with the Han unification controversy? I don't know terribly much about this, so apologies if that's just wrong. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Mon, Jul 15, 2013 at 4:37 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. What cannot be represented as UTF-8? UTF-8 can represent every character in the Unicode character set, whereas UTF-16 can encode characters 0 to 0x10. Does support for alternative multi-byte encodings have something to do with the Han unification controversy? I don't know terribly much about this, so apologies if that's just wrong. There's a famous problem regarding conversion between Unicode and other encodings, such as Shift Jis. There are lots of discussion on this. Here is the one from Microsoft: http://support.microsoft.com/kb/170559/EN-US -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. If we chose to do it, I think that per-column encoding support would end up looking a lot like per-column collation support: it would be yet another per- column property along with typoid, typmod, and typcollation. I'm not entirely sure it's worth it, although FWIW I do believe Oracle has something like this. Yes, the idea is that users will be able to declare columns of type NCHAR or NVARCHAR which will use the pre-determined encoding type. If we say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding irrespective of the database encoding. It will be up to us to restrict what Unicode encodings we want to support for NCHAR/NVARCHAR columns. This is based on my interpretation of the SQL standard. As you allude to above, Oracle has a similar behaviour (they support UTF-16 as well). Support for UTF-16 will be difficult without linking with some external libraries such as ICU. Can you please elaborate more on this? Why do you exactly need ICU? Also I don't understand why you need UTF-16 support as a database encoding because UTF-8 and UTF-16 are logically equivalent, they are just different represention (encoding) of Unicode. That means if we already support UTF-8 (I'm sure we already do), there's no particular reason we need to add UTF-16 support. Maybe you just want to support UTF-16 as a client encoding? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Mon, Jul 15, 2013 at 8:58 AM, Tatsuo Ishii is...@postgresql.org wrote: Also I don't understand why you need UTF-16 support as a database encoding because UTF-8 and UTF-16 are logically equivalent, they are just different represention (encoding) of Unicode. That means if we already support UTF-8 (I'm sure we already do), there's no particular reason we need to add UTF-16 support. To be fair, there is a small reason to support UTF-16 even with UTF-8 available. I personally do not find it compelling, but perhaps I am not best placed to judge such things. As Wikipedia says on the the English UTF-8 article: Characters U+0800 through U+ use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text but rarely for HTML documents. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space in UTF-16 than in UTF-8. This is the only advantage of UTF-16 over UTF-8 as a server encoding. I'm inclined to take the fact that there has been so few (no?) complaints from PostgreSQL's large Japanese user-base about the lack of UTF-16 support as suggesting that that isn't considered to be a compelling feature in the CJK realm. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On 7/15/13 1:26 AM, Arulappan, Arul Shaji wrote: Yes, the idea is that users will be able to declare columns of type NCHAR or NVARCHAR which will use the pre-determined encoding type. If we say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding irrespective of the database encoding. It will be up to us to restrict what Unicode encodings we want to support for NCHAR/NVARCHAR columns. I would try implementing this as an extension at first, with a new data type that is internally encoded differently. We have citext as precedent for successfully implementing text-like data types in user space. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. If we chose to do it, I think that per-column encoding support would end up looking a lot like per-column collation support: it would be yet another per-column property along with typoid, typmod, and typcollation. I'm not entirely sure it's worth it, although FWIW I do believe Oracle has something like this. At any rate, it seems like quite a lot of work. Another idea would be to do something like what we do for range types - i.e. allow a user to declare a type that is a differently-encoded version of some base type. But even that seems pretty hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:35 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Well, not *everything* can be represented as UTF-8; I think this is particularly an issue with Asian languages. If we chose to do it, I think that per-column encoding support would end up looking a lot like per-column collation support: it would be yet another per- column property along with typoid, typmod, and typcollation. I'm not entirely sure it's worth it, although FWIW I do believe Oracle has something like this. Yes, the idea is that users will be able to declare columns of type NCHAR or NVARCHAR which will use the pre-determined encoding type. If we say that NCHAR is UTF-8 then the NCHAR column will be of UTF-8 encoding irrespective of the database encoding. It will be up to us to restrict what Unicode encodings we want to support for NCHAR/NVARCHAR columns. This is based on my interpretation of the SQL standard. As you allude to above, Oracle has a similar behaviour (they support UTF-16 as well). Support for UTF-16 will be difficult without linking with some external libraries such as ICU. At any rate, it seems like quite a lot of work. Thanks for putting my mind at ease ;-) Rgds, Arul Shaji Another idea would be to do something like what we do for range types - i.e. allow a user to declare a type that is a differently-encoded version of some base type. But even that seems pretty hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Ishii san, Thank you for your positive and early response. -Original Message- From: Tatsuo Ishii [mailto:is...@postgresql.org] Sent: Friday, 5 July 2013 3:02 PM To: Arulappan, Arul Shaji Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Proposal - Support for National Characters functionality Arul Shaji, NCHAR support is on our TODO list for some time and I would like to welcome efforts trying to implement it. However I have a few questions: This is a proposal to implement functionalities for the handling of National Characters. [Introduction] The aim of this proposal is to eventually have a way to represent 'National Characters' in a uniform way, even in non-UTF8 encoded databases. Many of our customers in the Asian region who are now, as part of their platform modernization, are moving away from mainframes where they have used National Characters representation in COBOL and other databases. Having stronger support for national characters representation will also make it easier for these customers to look at PostgreSQL more favourably when migrating from other well known RDBMSs who all have varying degrees of NCHAR/NVARCHAR support. [Specifications] Broadly speaking, the national characters implementation ideally will include the following - Support for NCHAR/NVARCHAR data types - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8 databases I think this is not a trivial work because we do not have framework to allow mixed encodings in a database. I'm interested in how you are going to solve the problem. I would be lying if I said I have the design already speced out. I will be working on this in the coming weeks and hope to design a working solution consulting with the community. - Support for UTF16 column encoding and representing NCHAR and NVARCHAR columns in UTF16 encoding in all databases. Why do yo need UTF-16 as the database encoding? UTF-8 is already supported, and any UTF-16 character can be represented in UTF-8 as far as I know. Yes, that's correct. However there are advantages in using UTF-16 encoding for those characters that are always going to take atleast two-bytes to represent. Having said that, my intention is to use UTF-8 for NCHAR as well. Supporting UTF-16 will be even more complicated as it is not supported natively in some Linux platforms. I only included it to give an option. - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's encoding is fixed to UTF-8? If we are going to only support UTF-8 for NCHAR, then we don't need the GUC variable obviously. Rgds, Arul Shaji The above points are at the moment a 'wishlist' only. Our aim is to tackle them one-by-one as we progress. I will send a detailed proposal later with more technical details. The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. Rgds, Arul Shaji P.S.: It has been quite some time since I send a correspondence to this list. Our mail server adds a standard legal disclaimer to all outgoing mails, which I know that this list is not a huge fan of. I used to have an exemption for the mails I send to this list. If the disclaimer appears, apologies in advance. I will rectify that on the next one. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
-Original Message- From: Claudio Freire [mailto:klaussfre...@gmail.com] Sent: Friday, 5 July 2013 3:41 PM To: Tatsuo Ishii Cc: Arulappan, Arul Shaji; PostgreSQL-Dev Subject: Re: [HACKERS] Proposal - Support for National Characters functionality On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii is...@postgresql.org wrote: - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's encoding is fixed to UTF-8? Not only that, but I don't think it can be a GUC. Maybe a compile-time switch, but if it were a GUC, how do you handle an existing database in UTF-8 when the setting is switched to UTF-16? Re-encode everything? Store the encoding along each value? It's a mess. Either fix it at UTF-8, or make it a compile-time thing, I'd say. Agreed, that to begin with we only support UTF-8 encoding for NCHAR columns. If that is the case, do we still need a compile time option to turn on/off NCHAR functionality ? ? Rgds, Arul Shaji -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On 07/05/2013 02:12 AM, Arulappan, Arul Shaji wrote: - Support for UTF16 column encoding and representing NCHAR and NVARCHAR columns in UTF16 encoding in all databases. Why do yo need UTF-16 as the database encoding? UTF-8 is already supported, and any UTF-16 character can be represented in UTF-8 as far as I know. Yes, that's correct. However there are advantages in using UTF-16 encoding for those characters that are always going to take atleast two-bytes to represent. Any suggestion to store data in utf-16 is likely to be a complete non-starter. I suggest you research our previously stated requirements for server side encodings. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On 7/4/13 10:11 PM, Arulappan, Arul Shaji wrote: The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. For European languages, I think everyone has moved to using Unicode, so the demand for supporting multiple encodings is approaching zero. The CJK realm might have difference requirements. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Yes, what I know almost all use utf8 without problems. Long time I didn't see any request for multi encoding support. Dne 5.7.2013 20:28 Peter Eisentraut pete...@gmx.net napsal(a): On 7/4/13 10:11 PM, Arulappan, Arul Shaji wrote: The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. For European languages, I think everyone has moved to using Unicode, so the demand for supporting multiple encodings is approaching zero. The CJK realm might have difference requirements. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal - Support for National Characters functionality
This is a proposal to implement functionalities for the handling of National Characters. [Introduction] The aim of this proposal is to eventually have a way to represent 'National Characters' in a uniform way, even in non-UTF8 encoded databases. Many of our customers in the Asian region who are now, as part of their platform modernization, are moving away from mainframes where they have used National Characters representation in COBOL and other databases. Having stronger support for national characters representation will also make it easier for these customers to look at PostgreSQL more favourably when migrating from other well known RDBMSs who all have varying degrees of NCHAR/NVARCHAR support. [Specifications] Broadly speaking, the national characters implementation ideally will include the following - Support for NCHAR/NVARCHAR data types - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8 databases - Support for UTF16 column encoding and representing NCHAR and NVARCHAR columns in UTF16 encoding in all databases. - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. The above points are at the moment a 'wishlist' only. Our aim is to tackle them one-by-one as we progress. I will send a detailed proposal later with more technical details. The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. Rgds, Arul Shaji P.S.: It has been quite some time since I send a correspondence to this list. Our mail server adds a standard legal disclaimer to all outgoing mails, which I know that this list is not a huge fan of. I used to have an exemption for the mails I send to this list. If the disclaimer appears, apologies in advance. I will rectify that on the next one. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
Arul Shaji, NCHAR support is on our TODO list for some time and I would like to welcome efforts trying to implement it. However I have a few questions: This is a proposal to implement functionalities for the handling of National Characters. [Introduction] The aim of this proposal is to eventually have a way to represent 'National Characters' in a uniform way, even in non-UTF8 encoded databases. Many of our customers in the Asian region who are now, as part of their platform modernization, are moving away from mainframes where they have used National Characters representation in COBOL and other databases. Having stronger support for national characters representation will also make it easier for these customers to look at PostgreSQL more favourably when migrating from other well known RDBMSs who all have varying degrees of NCHAR/NVARCHAR support. [Specifications] Broadly speaking, the national characters implementation ideally will include the following - Support for NCHAR/NVARCHAR data types - Representing NCHAR and NVARCHAR columns in UTF-8 encoding in non-UTF8 databases I think this is not a trivial work because we do not have framework to allow mixed encodings in a database. I'm interested in how you are going to solve the problem. - Support for UTF16 column encoding and representing NCHAR and NVARCHAR columns in UTF16 encoding in all databases. Why do yo need UTF-16 as the database encoding? UTF-8 is already supported, and any UTF-16 character can be represented in UTF-8 as far as I know. - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's encoding is fixed to UTF-8? The above points are at the moment a 'wishlist' only. Our aim is to tackle them one-by-one as we progress. I will send a detailed proposal later with more technical details. The main aim at the moment is to get some feedback on the above to know if this feature is something that would benefit PostgreSQL in general, and if users maintaining DBs in non-English speaking regions will find this beneficial. Rgds, Arul Shaji P.S.: It has been quite some time since I send a correspondence to this list. Our mail server adds a standard legal disclaimer to all outgoing mails, which I know that this list is not a huge fan of. I used to have an exemption for the mails I send to this list. If the disclaimer appears, apologies in advance. I will rectify that on the next one. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal - Support for National Characters functionality
On Fri, Jul 5, 2013 at 2:02 AM, Tatsuo Ishii is...@postgresql.org wrote: - Support for NATIONAL_CHARACTER_SET GUC variable that will determine the encoding that will be used in NCHAR/NVARCHAR columns. You said NCHAR's encoding is UTF-8. Why do you need the GUC if NCHAR's encoding is fixed to UTF-8? Not only that, but I don't think it can be a GUC. Maybe a compile-time switch, but if it were a GUC, how do you handle an existing database in UTF-8 when the setting is switched to UTF-16? Re-encode everything? Store the encoding along each value? It's a mess. Either fix it at UTF-8, or make it a compile-time thing, I'd say. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers