On Sun, Jun 27, 2010 at 5:18 PM, Georg Seifert <georg.seif...@gmx.de> wrote:
> Hi, > > Does anyone has information on how to use Unicode code points higher than > 0xFFFF. > I need to add some supplementary multilingual plane code points to a > NSString. > > I can use something like this: > NSString *aString = @"\\u0001ABCD"; //this prints fine but the > [aString length] is 2 > > But if I have the unicode value as a int (unichar is to small) > int Char = 0x1ABCD; > NSString *aString = [NSString stringWithFormat:@"%C", Char]; //The > resulting string contains one character with a unicode value of "ABCD". > > What is the recommended way to use/create UTF-32 strings in Cocoa. > Others have pointed out some other solutions to this problem, but I thought I'd toss in my $0.14143^2. Taken from http://regexkit.sourceforge.net/RegexKitLite/index.html#RegexKitLiteCookbook, specifically the "Enhanced Copy To Clipboard Functionality" section, which is only visible when using Safari, so be sure that's the browser you're using. Use C99 \u character escapes Normally, Unicode characters are embedded in string literals as the characters UTF-8 byte sequence using \ddd octal escapes. When this option is enabled, the C99 \u and \U character escape sequences are used instead. gcc will issue a warning if \u character escape sequences are present and the compiler is not configured to use the C99 (or later) standard (i.e., gcc -std=(c|gnu)99). Under the C99 standard, \u and \U are used to specify a universal character name, which is a character encoded in the ISO/IEC 10646 character set (essentially identical to Unicode in this context). Ultimately, a universal character name is translated in to a sequence of bytes needed to represent the designated character in the C environments execution character set. Usually, although certainly not always, a string literal should be encoded as UTF-8, which happens to be the default execution character set for gcc. This is an important point to remember because the more convenient and easier to use \u escape sequences are not guaranteed to convert in to a specific sequence of bytes, unlike an octal \ddd or hex \xhh escape sequence. There is currently no way to specify that a particular string literal should always be translated using a specific character set encoding. This may result in undefined behavior if the \u universal character name is not translated in to the expected character set, which in this case must be UTF-8. Escaped Unicode in NSString literals Prior to Xcode 3.0, gcc only supported the use of ASCII characters (i.e., characters ≤ 127) in constant NSString literals. If one needed to include Unicode characters in an NSString, one would typically convert the string in to UTF-8, and then create a NSString at run time using the stringWithUTF8String: method, with the UTF-8 encoded C string passed as the argument. For example, "€1.99", which contains the € euro symbol, would be created using the following: NSString *euroString = [NSString stringWithUTF8String:"\342\202\2541.99"]; // or with C99 \u character escapes: NSString *euroString = [NSString stringWithUTF8String:"\u20ac1.99"]; One of the obvious disadvantages of this approach is that it instantiates a new, autoreleased NSString each time it's used, unlike a constant NSString literal like @"$1.99". Beginning with Xcode 3.0 and gcc 4.0, constant NSString literals that contain Unicode characters can be specified directly in source-code using the standard @"" syntax. For example: NSString *euroString = @"\342\202\2541.99"; // or with C99 \u character escapes: NSString *euroString = @"\u20ac1.99"; The compiler converts these strings to UTF-16 using the endianness of the target architecture. Since Mach-O object file format allows for multiple architectures, this allows each architecture to encode the string as native UTF-16 byte ordering for that architecture, so there are no issues with proper byte ordering. Within the object file itself, these strings are essentially identical to their ASCII-only counterparts: effectively they are pre-instantiated objects. The only real difference is that the compiler sets some internal CFString bits differently so that the CFString object knows that the strings data is encoded as UTF-16 and not simple 8-bit data. Although this functionality has been present since the release of 10.5, it has only recently been documented in The Objective-C 2.0 Programming Language - Compiler Directives, under the @"string" entry. A copy of the relevant text is provided below: On Mac OS X v10.4 and earlier, the string must be 7-bit ASCII-encoded. On Mac OS X v10.5 and later (with Xcode 3.0 and later), you can also use UTF-16 encoded strings. (The runtime from Mac OS X v10.2 and later supports UTF-16 encoded strings, so if you use Mac OS X v10.5 to compile an application for Mac OS X v10.2 and later, you can use UTF-16 encoded strings.) -------- Some other points: To encode a Unicode Code Point that is > 0xFFFF, you have two options if you are using Xcode >= 3.0: 1) The easy way- Use \U (note \U, or a uppercase U), which takes the form of "\UHHHHHHHH", where H is [0-9a-fA-F]. 2) The hard way- Use \u (note \u, or a lowercase u), which requires you to manually convert the code point in to UTF-16 surragate pairs. An example of the character "𝄞", or U+1D11E, MUSICAL SYMBOL G CLEF: 1) @"\U0001D11E" 2) @"\uD834\uDD1E" Note: There is only a single \, or backslash, in the above. Some people have said you should use @"\\U0001D11E", which would give you the string "\U0001D11E", not quite what you want. If you're stuck using a tool chain where Xcode < 3.0, the "preferred" way (for some value of preferred), is to use the UTF-8 encoding of the character as follows: 3) [NSString stringWithUTF8String:"\360\235\204\236"] Of course, if you're using Xcode >= 3.0 -AND- your source code is kept in UTF8 (it might work for some other encodings, but UTF8 is the preferred and recommend source code encoding anyways), the absolute easiest way by far is: 4) @"𝄞" In other words, you can just paste text that can be represented in Unicode directly in to your source code. The compiler will convert it to UTF-16 encoded constant NSStrings and store them in your object file. _______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com