Re: Nicest UTF
From: Asmus Freytag [EMAIL PROTECTED] I use ... and UTF-32 for most internal processing that I write myself. Let people say UTF-32 is wasteful if they want; I don't tend to store huge amounts of text in memory at once, so the overhead is much less important than one code unit per character. For performance-critical applications on the other hand, you need to use whichever UTF gives you the correct balance in speed and average storage size for your data. If you have very large amounts of data, you'll be sensitive to cache overruns. Enough so, that UTF-32 may be disqualified from the start. I have encountered systems for which that was true. For both of these, I'd recommend UTF-8. Its compact, especially when parsing source code! which is mostly ASCII even if it contains other languages)... and its fast to process. Just use the byte processing functions. I've done natural language word processing functions on UTF-8 also, and its damn fast even there, even despite being case insensitive! My test was to do word counting to see word frequencies. I did this with UTF-8. All strings were entered as UTF-8 into a special scanner which I invented. All strings were entered as both uppercase and lowercase. The scanner would then have both uppercase and lowercase variants of UTF-8. The scanner however, only does byte (case sensitive) searching. So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the counter must look up through thousands of words (Every single word its come across in the text), on every single word lookup. Anyhow, from my experience, UTF-8 is great for speed and RAM.
Re: Public Review Issues Update
Mark Davis wrote: All comments are reviewed at the next UTC meeting. Due to the volume, we don't reply to each and every one what the disposition was. If actions were taken, they are recorded in the minutes of the meetings. But what if an action was not taken. Do I have to keep reporting a particular problem until it's gone? Also, there is no way of telling which problems already have been reported a dozen times before. Assuming the comments reported are archived, why can't this archive be made accessible to the unicode list? Theo P.S. I know the Unicode Consortium is a non-profit organization and you are all very very busy, and I look up to the people behind it. But I often find it hard to see the part where the Unicode Consortium is actually promoting use of the Unicode Standard. Sometimes it almost seems to discourage developers to use the Standard. Take the Book for instance, one is not allowed to print the online version, so I bought the book to find out that 2 of its 3 kilograms is just tables of glyphs which could have been on a CD. So you pay $75 to get value for $25.
Re: Public Review Issues Update
Rick McGowan wrote: If you have comments for official UTC consideration, please post them by submitting your comments through our feedback reporting page: http://www.unicode.org/reporting.html Hi Rick, Please tell us, why does one never get feedback on submitted comments regarding Public Review Issues? The UTC is seeking public feedback, but apparently it's not considered necessary to give responders feedback on their comments. There isn't even an accessible list of reported 'problems' or comments (besides the errata list). Why should I even bother responding if I'm not sure it isn't all in vain? Best regards, Theo
UAX 15 hangul composition
Don't know if this has been asked/reported before, but is the example code for hangul composition in UAX 15 correct? The code is: public static String composeHangul(String source) { int len = source.length(); if (len == 0) return ; StringBuffer result = new StringBuffer(); char last = source.charAt(0);// copy first char result.append(last); for (int i = 1; i len; ++i) { char ch = source.charAt(i); // 1. check to see if two current characters are L and V int LIndex = last - LBase; if (0 = LIndex LIndex LCount) { int VIndex = ch - VBase; if (0 = VIndex VIndex VCount) { // make syllable of form LV last = (char)(SBase + (LIndex * VCount + VIndex) * TCount); result.setCharAt(result.length()-1, last); // reset last continue; // discard ch } } // 2. check to see if two current characters are LV and T int SIndex = last - SBase; if (0 = SIndex SIndex SCount (SIndex % TCount) == 0) { int TIndex = ch - TBase; if (0 = TIndex TIndex = TCount) { // make syllable of form LVT last += TIndex; result.setCharAt(result.length()-1, last); // reset last continue; // discard ch } } // if neither case was true, just add the character last = ch; result.append(ch); } return result.toString(); } Suppose I feed it 0xAC00 0x11C3. 0xAC00 is an LV. This will do step 2: SIndex = 0xAC00 - 0xAC00 = 0 TIndex = 0x11C3 - 0x11A7 = 28 Which causes the (0 = TIndex TIndex = TCount) to be true. And the resulting output is 0xAC00 + 28 = 0xAC1C which is not an LVT but an LV syllable! The TIndex = TCount should be TIndex TCount I think. IMO the example would be more clear if the Hangul_Syllable_Type property would be used. A somewhat related question. I know next to nothing about Hangul [de]composition so forgive me for asking silly questions. In the UnicodeData.txt file there are much more than the 19 L, 21 V, and 28 L jamos. Are the other jamos not use to compose syllables, or does the syllable block represent an incomplete set of compatibility characters? What's is it? Theo
2nd attempt: final_sigma vs final_cased
Hi, Is there somebody out there who can answer this question? Casing context Final_Sigma is being used in SpecialCasing.txt, but its specification is no longer present in the standard (at least I can't find it). Obviously this context is now called Final_Cased, but the specification for Final_Cased (section 3.13) is not identical to that of Final_Sigma (UAX 21, superseded). regexp Final_Cased: Before C[{cased=true}][{wordBoundary!=true}]* After C !([{wordBoundary!=true}]*[{cased=true}]) regexp Final_Sigma: Before Ccased case-ignorable* After C !(case-ignorable* cased) Is the old specification of Final_Sigma still valid for determining the final sigma casing context, or are there situations where it is inadequate? What I mean is are these specification actually the same WRT final sigma? Theo PS. I sometimes have the feeling that I'm on the wrong list here. Most discussions are about characters, almost never about implementation issues. Is there a unicode developers list perhaps that I'm unaware of?
final_sigma vs final_cased
Hi, Casing context Final_Sigma is being used in SpecialCasing.txt, but its specification is no longer present in the standard (at least I can't find it). Obviously this context is now called Final_Cased, but the specification for Final_Cased (section 3.13) is not identical to that of Final_Sigma (UAX 21, superseded). regexp Final_Cased: Before C[{cased=true}][{wordBoundary!=true}]* After C !([{wordBoundary!=true}]*[{cased=true}]) regexp Final_Sigma: Before Ccased case-ignorable* After C !(case-ignorable* cased) Is the old specification of Final_Sigma still valid for determining the final sigma casing context, or are there situations where it is inadequate? What I mean is are these specification actually the same WRT final sigma? Theo
base character
According to the definition a base character is: A character that does not graphically combine with preceding characters, and that is neither a control nor a format character. What is this expressed in terms of properties? Something like this? cc==0 AND GG!=Cc AND GC!=Cf AND GC!=Cn Theo
A binary file format for storing character properties
At this time there are about 160 different character properties defined in the UCD. In practice most applications probably only use a limited set of properties to work with. Nevertheless applications should be able to lookup all the properties of a code point. Compiling-in lookup tables for all defined properties (including Unihan) makes small applications become rather big. This made me decide to create a binary file format for storing character properties and initialize property lookup tables on demand. Benefits of using run-time loadable lookup tables initialized from binary files are: - no worries about total table size, since data will only be loaded on demand - initializing lookup tables from a binary file is relatively fast - property lookup files can be locale specific (useful for character names and case mappings for example) - new properties can be added quickly and never affect layout or content of other tables - any number of properties can be supported including custom (non-Unicode) properties - by initializing a lookup table from two sources (UCD-based and vendor-based), applications can overload the default property values assigned to PUA characters with private property values The file format I've implemented is capable of storing any type of property. Each file contains property values for one property (no more squeezing as much property values as possible in as few bits as possible). The format is called UPR (Unicode PRoperties). I have written a tool to generate the necessary UPR files from the UCD. A small C-library for reading a UPR file into a property lookup table, and a high-level library which provides property lookup functions for *all* Unicode properties in 4.0.0 are also available. For more information on the file format and related software see: http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. My primary development platform is UNIX/Linux, but you can compile and run it under Windows as well (less tested however). Current version supports UCD 4.0.0, I will add support for 4.0.1 soon. Please check it out. Feedback is welcome. Regards, Theo Veenker
Re: Suggestion: use of symbolic links in the FTP site
Tom Emerson wrote: Philippe Verdy writes: Symbolic links is a bad idea on FTP. They are resolved by the client... Really? Depends on your server: proftpd handles them fine. I think it would also give the false feeling that a new 4.01 file exist when in fact it's the same as 4.00. No, the filename of the link would match that of the file in the 4.00 directory. The link only provides you with a way to access the datafile. I would not recommend renaming the link to 4.01 if it didn't change. I better suggest listing files with a simple parsable fiel which lists all files that are part of a Unicode version, with their respective version. The problem here is that you cannot easily grab all the files for a given release without first downloading that file and parsing the information out and constructing pathnames. It would not be difficult to write script which automatically downloads the required files (using wget) for a given unicode version if there was such a file as suggested by Philipppe. Theo
Re: Downloading UCD 4.0.0
Asmus Freytag wrote: At 08:42 AM 4/19/2004, Theo Veenker wrote: Hi, Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, but in http://www.unicode.org/ucd/ I read this: The complete set of all files for a given version of the UCD consists of the files in the update directory for that version, together with all the files unchanged from earlier versions, which are kept in their respective update directories. Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? Yes. :-( And depending on what version of the UCD you are trying to piece together you may need potentially versions of some files from several earlier updates. A./ PS: we are looking into ways to make access to older versions more straightforward. Please do! In the current situation if you have a UCD parser which only handles a particular Unicode version, the set of required input files for this parser is more or less taken away when a new Unicode version is released. This makes it difficult to share such software. It is good to gently push developers to use the latest release, but making it hard to use an earlier release is IMO counter-productive. Theo
Downloading UCD 4.0.0
Hi, Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, but in http://www.unicode.org/ucd/ I read this: The complete set of all files for a given version of the UCD consists of the files in the update directory for that version, together with all the files unchanged from earlier versions, which are kept in their respective update directories. Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? Theo
Re: ISO 8859-11 (Thai) cross-mapping table
Marco Cimarosti wrote: John Aurelio Cowan wrote:) Marco Cimarosti scripsit: Talking about the format of mapping tables, I always wondered why not using ranges. In the case of ISO 8859-11, the table would become as compact as three lines: Well, that wins for 8859-1 and 8859-11 and ISCII-88, where Unicode copied existing layouts precisely. But it wouldn't help other 8859-x much if at all, All 8859 tables would be more succint. Non-Latin sections use contiguous ranges of letters in alphabetical order or, however, in the same order used by Unicode; this is also true for most other non-ISO charsets. Latin sections are a worse case, but they still benefit slightly, because characters shared with Latin-in stay the same positions. and it requires binary search rather than direct array access, which would be a terrible lossage in CJK, where the real costs are. I agree. In the case of CJK it simply doesn't pay. If I may add my two cents; IMO using search algorithms to reduce table size doesn't pay in any case. If one uses fast one/two-stage lookup tables for both mappings (legacy to unicode and v.v.) then most tables require about 3 kb or less of storage space. Approx. times 10..30 for CJK encodings. Compared to the 256 Mb in a typical PC each lookup table would consume 0.001% (or 0.01-0.03% for CJK) of main memory. My point is it is better to concentrate on processing speed than on table foot print. Theo
\p{} and \g{} in regexp
Hi, I have a few questions regarding unicode regular expressions. 1) I'm working on a regexp matcher and I'd like to know which properties are never needed in a \p{...} item. Currently I have included the properties listed below, but for efficiency reasons I'd like to trough out what isn't really necessary: general category bidi class? canonical combining class ? decomposition type line break east asian width arabic joining type ? arabic joining group ? script name block name age numeric type all binary properties So can anyone tell me if the marked properties are really usefull in a \p{...} item? 2) About grapheme clusters in a bracketed expression. It is clear what is meant by an expression like [a-z\g{aa}]. But how do I interprete something like [a-z\g{aa} \p{foo}]. This reads as: accept any character in range a-z or grapheme cluster aa, provided it has the foo property. The problem is that \p{...} only applies to single code points, not to grapheme clusters. I can do three things: 1. try if NFC of characters in \g{...} yields a single character and work with that, otherwise fail 2. only test first (base) character of the cluster 3. don't allow use of operators and - (i.e. ^) in a bracketed expression in which one or more \g{...} are used What would be the most appropiate thing to do? Regards, Theo
Re: unidata is big
andreas palsson wrote: Hi. I would just like to know if someone could give me a tip on how to structure all the unicode-information in memory? All the UNIDATA does contain quite a bit of information and I can't see any obvious method of which is memory-efficient and gives fast access. You might want to evaluate some of the open source libraries mentioned under Enabled Products on the unicode site. For my own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/) I've created a seperate table builder tool for each property or mapping. The tools organize data in planes, and for each plane all possible trie setups are determined (about 80 combinations of one, two or three stage tables). Then the cheapest setup is used. This still requires over 230kb to store all data (except character names and comments) from the following files: UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt, Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt, BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt, DerivedNormalizationProperties.txt, and DerivedJoiningType.txt. For some mappings I've stored 32 bit code points where 16 bit would have been enough, but I decided API uniformness is more important than memory efficiency. I wouldn't bother too much about memory efficiency; it's irrelevant these days. Even your mobile phone has enough memory to store all unicode data 10..20 times. Same thing for lookup speed. All you have to do to get it fast is to wait (a few seasons). Theo
Re: Whence UniData.txt? (was Re: unidata is big)
[EMAIL PROTECTED] wrote: Theo's comment leads me to a question I've pondered recently: Assumptions: Many apps, from independent sources, need to access the Unicode character data, A lot of these apps aren't overly concerned with the slight overhead of parsing the data as needed from Unicode-supplied data files directly. Similarly, such apps benefit from being able to easily upgrade to new Unicode releases by simply replacing the data files. It isn't very user-friendly to for every such app to store their own private copy of the character data files when a single shared copy would take up less space and be easier to maintain. It would seem to me that there is some value in establishing either (1) a standard location where programs can expect to find (or install) a local copy of the Unicode data files, or (2) a standard way to discover where such a local copy of these files exist. My preference would be (2), which would make it easy to configure a network of machines to share a single copy of the data files. Something as simple as an environment variable could work if developers were to agree on its name and semantics. For applications that eat raw UCD files, this shouldn't be to difficult to achieve. Any well designed app will/should have some parameter or env. variable that you can set (no?). But for apps/libraries that like their UCD files cooked it is a different story because there is no recommended binary format for representing (compact) unicode character data. Personally I would appreciate seeing such a recommendation including your point (2). However apps/libs which enrich the character data with custom properties, would still need their own copy of the data. The subject reminds me of the TZ database. Here you have a large text based database containing information on time zones and daylight saving times. You can compile the data into a binary format by running a utility included with the tz sources. Well, they don't give any recommendation on where to store the (text and/or binary) data, but at least there is a 'standard' format, which allows for sharing data. Would be nice to have something like this for the UCD. (I understand there may be different mechanisms for different platforms, but it would be even better if a standard mechanism were cross platform). So, are there any conventions for this evolving? Or would anyone like to propose one? Please, go ahead :o) Theo
grapheme length
Hi, I'd like to know if there is something like a longest grapheme length. From the UTR-18 I see there is no limit, but in practice, can someone give an estimate of howmany code points a longest grapheme would occupy, roughly? Theo
UCD 3.2.0
Hi all, I'd like to make a few remarks about the UCD files. The following things I ran into when checking out the 3.2.0 release: o In PropertyValueAliases-3.2.0.txt line 79: ccc; 202; ATBL ; Attached_Below_Left whereas in UnicodeData-3.2.0.html I read: 200: Below left attached 202: Below attached What is is correct value for attached below left, 200 or 202? o In SpecialCasing-3.2.0.txt lines 234 and 235 are missing the closing semicolon. This problem also appeared in 3.1.1. o Typo in UnicodeCharacterDatabase-3.2.0.html: DerivedNormalizationProperties, should be DerivedNormalizationProps. Minor points that I find a bit annoying: o Many of the UCD files have a comment header with lines longer than 80 characters. Viewing these files using the page utility on a 80 column terminal window to gives ugly output due to the forced line wrapping. o All UCD files except CaseFolding-3.2.0.txt and SpecialCasing-3.2.0.txt *separate* columns by semicolons. For the two exceptions the semicolon *terminates* a column, why not keep it the same for all UCD files? o UnicodeData-3.2.0.txt still uses this notation: 1234;Blah, First;Lo;0;L;N; 5678;Blah, Last;Lo;0;L;N; instead of 1234..5678;Blah, First..Blah, Last;Lo;0;L;N; Since all other UCD files use the latter notation why not change this one too? IMHO backward compatibility with existing UCD file parsers shouldn't be an issue in this particular case. Regards, Theo
mnemonic input
Hi all, Suppose I want to enable mnemonic input in my software. Using mnemonics allows one to write e' (of course embedded in some escape sequence) instead of \u00e9 or eacute; Which sets of mnemonics are being used or should I use? I found the ISO-10646 charmap file which gives mnemonic.ds symbolnames, but I can find no further references. I suppose only a minimal set of mnemonics is really useful, because to me it seems almost impossible to remember them all even though there's logic behind the construction of mnemonics. My questions are: - Which mnemonic sets are available and actually used by people? - Are there any recommended escape sequences to enter a mnemonic (like name; does for SGML character entity names)? - Does it make sense to use mnemonics for ideographic scripts? - Is support for input of full unicode character names (as I believe is supported in perl) considered useful? Thanks, Theo
Re: mnemonic input
Marco Cimarosti wrote: Ooops! Of course, I was replying to a different question: Does it make sense to use mnemonics for ideographic scripts? I hadn't even noticed you quoted the wrong question, but I understood it anyway. Wat I meant was; I can use mnemonic characters in a plain ASCII text file (instead of or in addition to writing \u), but could I also include Japanese this way (assuming a knew any Japanese), without using an input method to compose the characters? From RCF-1345 referred to by Keld Simonsen I see I can (as least for most BMP stuff). Many thanks for the information. Regards, Theo
UCD 3.2.0
Dear list, I don't know if this is the right place to ask this question, but I'll ask it anyway. I'm updating my software for the UCD 3.2.0 (beta) files, but in file ArabicShaping-5d2.txt at line 202 I read 0718; WAW; R; SYRIAC WAW where SYRIAC WAW represents the joining group. Is this correct, or should it just read WAW? Another question, when can I expect a new edition of the unicode book? Regards, Theo Veenker
Writing/finding a UTF8, UTF16, UTF32 converter
Hi UniCode list, I am dealing with unicode for XML. I'm sorry if this bothers a few people, but reading the technical information is not very easy. The crossings out and underlinings don't help, the information seems a bit scattered, and the usually interesting information is not linked to in easy to find places. I think I have finally found what I wanted, the table: Table 3.1. UTF-8 Bit Distribution on http://www.unicode.org/unicode/reports/tr27/ Basically, I want to write some code that can convert UTF8, UTF16, and UTF32 to any of the other two formats. I suppose I could use UTF32 as a go-between to reduce the conversion possibilities. Anyhow, does anyone know of any existing source code that does this transformation? I don't feel like using Apple's UniCode converter because it seems so complex it will probably take MORE work for me to access it, than just write the conversion code myself. And even then I hear it doesn't do UTF32, so there is no use. And even then I have to compile my code for Win32 also, so its even more no use. If anyone knows of some existing code that does the transformation, that would help. I might end up re-writing it myself and just use the code as a working example. All that bitshifting and bitmasking such should slow down my UTF8/UTF16 processing, is there any accepted good way to speed this up? Some form of table perhaps? -- This email was probably cleaned with Email Cleaner, by: Theodore H. Smith - Macintosh Consultant / Contractor. My website: www.elfdata.com/
Unicode string functions
Hi people, this is my first post to this list. I want to write many string functions for dealing with unicode. I am writing both an XML engine, and an XML editor (Which I have imaginatively called XML Engine and XML Editor). Now, of course XML 1.0 specifies that I need to deal with Unicode, which my engine doesn't currently do. I've written a lot of string functions before, most of my programs deal with data in fact, especially string data. I find dealing with strings quite fun for some reason, and I've written a plugin for my language (REALbasic) that speeds up many things to do with strings, and adds loads of new string features to REALbasic. Anyhow. I find this UniCode system rather confusing at first. I hope it will be easy to deal with, but I am not sure. Please correct me on the basic facts of Unicode here. I am just repeating them in order to see if anyone tells me I am wrong :o). Ok: From what I can tell, UniCode is simply a numbering system, and encoding system for these numbers (a very crude summary, but it is enough for me, for now). Unicode allows for compression, in UTF8, and UTF16, so that you can fit lower ascii into 1 byte, while other numbers that aren't lower ascii, may need many more bytes. UTF32 represents the numbers as is, no compression needed. Now, I have heard that dealing with UTF8/UTF16 is a real nuisance, because say, to extract characters 6 though 8, of a string, you need to start at the beginning, then loop the entire way through, detecting which bytes are multiple bytes. I'd rather avoid dealing with UTF compression schemes, for both RAM sake (more code), speed sake (more to do), and bugs sake (complex code is buggier code). OK, so I have an easy solution. My language REALbasic, has a TextConverter object, that lets you convert from a huge range of text encoding schemes! This feature is provided by Apple and MicroSoft, and REALbasic simply gives you access to it. So, I can convert from UTF8/UTF16 to UTF32, and use some string functions I will write, that specifically deal with UTF32. OK, so in short: If I am writing some string functions for UTF32 (with XML in mind), are there stumbling blocks I may come across? I know almost nothing about Unicode, can I just treat each char as a 4 byte character, and not care about any other UniCode special features? If there are special features (0 width characters, etc), what do I have to do about them? The string functions I'll need are: *Character set functions (searches through a string, based upon if each character of this string is inside a character set, and returns the position of the first character found that is, or isn't in the character set.) *String searching functions (searches through one string, for another) *Character getting functions (Get one character at a time) *string extraction functions (get a segment of one string, and create a new string that is a copy of it) *string creation functions (create a string, that is just a number of characters long, each of the same character value) And then of course, I'll have to reconvert back to UTF8/UTF16 on saving! Quite a bit of work, but nothing compared to what I have done so far. Is it possible I forgoe all characters that can't be expressed in UTF16 without taking more than 2 bytes? That way, I can halve the RAM needed for large XML documents. I'd rather not code two string function versions, one is enough. So it is 2 bytes, or 4 bytes. I read on unicode.org that all characters of almost all languages can be contained in 2 bytes. What languages need 4 bytes or more to describe? If they are some unheard of tribe with 100 living members, perhaps I can just forgo them? -- This email was probably cleaned with Email Cleaner, by: Theodore H. Smith - Macintosh Consultant / Contractor. My website: www.elfdata.com/