Re: Pure python implementation of string-like class
Ross Ridge wrote: Xavier Morel wrote: Not if you're still within Unicode / Universal Character Set code space. Akihiro Kayama in his original post made it clear that he wanted to use a character set larger than entire Unicode code space. Ross Ridge He implies that, but in later messages he 1. Implies that he wants to use the Unicode private spaces, which are in the Unicode code space 2. Says explicitly that his needs concern Kanji encoding, which do fit in the existing Unicode code space, even if you take the largest estimates of the number of existing Kanjis (~8), and which are (I think) already partially represented by the CJK Unified Ideograms and CJK Unified Ideograms extension A sets of regular Unicode. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Ross Ridge wrote: Akihiro Kayama in his original post made it clear that he wanted to use a character set larger than entire Unicode code space. Xavier Morel wrote: He implies that ... He explictly said that character set he wanted to use wouldn't fit in UTF-16. ... but in later messages he 1. Implies that he wants to use the Unicode private spaces, which are in the Unicode code space He explictly said that he wanted to use the U+6000...U+7FFF range which is outside of the Unicode code space, despite him mistakenly calling them Unicode characters. 2. Says explicitly that his needs concern Kanji encoding... I have no clue whether he really needs such a large character set, but if he does then it makes sense for him to want to use an encoding that's wider than UTF-16. As for the problem he actually posed, I'd suggest using tuples rather than lists, since tuples are immutable like strings. That would make it easier for the class to be used as key in a dictionary. Hmm... thiking about it, it might actually make sense to use strings as the internal representation as a lot operations can be implemented by using the standard string operation but multipling the offsets and lengths by 4. Ross Ridge -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Hi Ross. Thanks a lot for your clarifying. I didn't think my post could be an Unicode frame. I don't know this mailing list is the right place talking about Unicode issue, but as for me, a million codespace which UTF-16 brings is not enough. It presume that same characters has a same codepoint. But differs from the simple and beauty Roman Alphabet, it is sometimes difficult to decide two kanji characters are same or not. Because its glyph swings with various reason(ex. who, when and where it's wrote). So first of all we assign codepoints, and next we consider that this character which appears in this Chinese historical book may be the same character as this character in Unicode CJK Extension A. Such an identifying characters is also one of my project's tasks. I think this can be explanation why UTF-16 is enough for majority but not for all. Anyway, I suppose that implementing string-like classes is a generic python issue. For example, it will be useful if a rich text class which has style attributes like bold on each characters has also string-like methods and can be dealt with like a string. In article [EMAIL PROTECTED], Ross Ridge [EMAIL PROTECTED] writes: rridge thiking about it, it might actually make sense to use strings as the rridge internal representation as a lot operations can be implemented by using rridge the standard string operation but multipling the offsets and lengths by rridge 4. Ah, COOL! It sounds very nice. I'll try it. Thanks again. -- kayama -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Akihiro KAYAMA wrote: As the character set is wider than UTF-16(U+10), I can't use Python's native unicode string class. Have you tried using Python compiled in Wide Unicode mode (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then, which should be enough for most purposes. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Akihiro KAYAMA wrote: Hi all. I would like to ask how I can implement string-like class using tuple or list. Does anyone know about some example codes of pure python implementation of string-like class? Because I am trying to use Python for a text processing which is composed of a large character set. As the character set is wider than UTF-16(U+10), I can't use Python's native unicode string class. Wider than UTF-16 doesn't make sense. So I want to prepare my own string class, which provides convenience string methods such as split, join, find and others like usual string class, but it uses a sequence of integer as a internal representation instead of a native string. Obviously, subclassing of str doesn't help. The implementation of each string methods in the Python source tree(stringobject.c) is far from python code, so I have started from scratch, like below: def startswith(self, prefix, start=-1, end=-1): assert start 0, not implemented assert end 0, not implemented if isinstance(prefix, (str, unicode)): prefix = MyString(prefix) n = len(prefix) return self[0:n] == prefix but I found it's not a trivial task for myself to achive correctness and completeness. It smells reinventing the wheel also, though I can't find any hints in google and/or Python cookbook. I don't care efficiency as a starting point. Any comments are welcome. Thanks. The UTF-16 encoding is capable of representing the whole of Unicode. There should be no need to do anything special to use UTF-16. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Hi bearophile. In article [EMAIL PROTECTED], [EMAIL PROTECTED] writes: bearophileHUGS Maybe you can create your class using an array of 'L' with the array bearophileHUGS standard module. Thanks for your suggestion. I'm currently using an usual list as a internal representation. According to my understanding, as compared to list, array module offers efficiency but no convenient function to implement various string methods. As Python's list is already enough fast, I want to speed up my coding work first. -- kayama -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Hi And. In article [EMAIL PROTECTED], [EMAIL PROTECTED] writes: and-google Akihiro KAYAMA wrote: and-google As the character set is wider than UTF-16(U+10), I can't use and-google Python's native unicode string class. and-google and-google Have you tried using Python compiled in Wide Unicode mode and-google (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then, and-google which should be enough for most purposes. From my quick survey, Python's Unicode support is restricted to UTF-16 range(U+...U+10) intentionally, regardless of --enable-unicode=ucs4 option. Python 2.4.1 (#2, Sep 3 2005, 22:35:47) [GCC 2.95.4 20020320 [FreeBSD]] on freebsd4 Type help, copyright, credits or license for more information. u\U0010 u'\U0010' len(u\U0010) 1 u\U0011 UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character Simple patch to unicodeobject.c which disables unicode range checking could solve this, but I don't want to maintenance specialized Python binary for my project. -- kayama -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Steve Holden wrote: Wider than UTF-16 doesn't make sense. It makes perfect sense. Ross Ridge -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Steve Holden wrote: Wider than UTF-16 doesn't make sense. Ross Ridge wrote It makes perfect sense. Alan Kennedy wrote: UTF-16 is a Unicode Transcription Format, meaning that it is a mechanism for representing all unicode code points, even the ones with ordinals greater than 0x, using series of 16-bit values. It's an encoding format that only supports encoding 1,112,064 different characters making it a little more than 20-bits wide. While this enough to encode all code points currently assigned by Unicode, it's not sufficient to encode the private use area of ISO 10646-1 that Akihiro Kayama wants to use. Ross Ridge -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Akihiro KAYAMA wrote: Sorry for my terrible English. I am living in Japan, and we have a large number of characters called Kanji. UTF-16(U+...U+10) is enough for practical use in this country also, but for academic purpose, I need a large codespace over 20-bits. I wish I could use unicode's private space (U+6000...U+7FFF) in Python. -- kayama I think the Kanji are part of the Han script as far as Unicode is concerned, you should check it (CJK unified ideograms and CJK unified ideograms extension A), they may not all be there, but the 27502 characters from these two tables should be enough for most uses. Oh, by the way, the Unicode code space only goes up to 10, while UCS-4's encoding allows code values up to and including 7FFF the upper Unicode private space is Plane Sixteen (10–10), the other private spaces being a part of the Basic Multilingual Plane (U+E000–U+F8FF) and Plane Fifteen (U+F–U+F) and even UTF-32 doesn't go beyond 10. Since the Dai Kan-Wa jiten only lists about 50,000 kanji (even though it probably isn't perfectly complete) it fits with ease in both plane fifteen and sixteen (65535 code points each). -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Ross Ridge wrote: Steve Holden wrote: Wider than UTF-16 doesn't make sense. It makes perfect sense. Ross Ridge Not if you're still within Unicode / Universal Character Set code space. While UCS-4 technically goes beyond any Unicode Transformation Format (UTF-7, 8, 16 and 32 stop at 10) it also goes beyond the range of the UCS itself (0-10). UTF-32 is the limitation of UCS-4 to the Unicode standard. While it could be argued that Unicode/UCS limit of 10 was chosen _because_ of the limitations of UTF-16, It's probably irrelevant to the discussion. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Xavier Morel wrote: Not if you're still within Unicode / Universal Character Set code space. Akihiro Kayama in his original post made it clear that he wanted to use a character set larger than entire Unicode code space. Ross Ridge -- http://mail.python.org/mailman/listinfo/python-list
Pure python implementation of string-like class
Hi all. I would like to ask how I can implement string-like class using tuple or list. Does anyone know about some example codes of pure python implementation of string-like class? Because I am trying to use Python for a text processing which is composed of a large character set. As the character set is wider than UTF-16(U+10), I can't use Python's native unicode string class. So I want to prepare my own string class, which provides convenience string methods such as split, join, find and others like usual string class, but it uses a sequence of integer as a internal representation instead of a native string. Obviously, subclassing of str doesn't help. The implementation of each string methods in the Python source tree(stringobject.c) is far from python code, so I have started from scratch, like below: def startswith(self, prefix, start=-1, end=-1): assert start 0, not implemented assert end 0, not implemented if isinstance(prefix, (str, unicode)): prefix = MyString(prefix) n = len(prefix) return self[0:n] == prefix but I found it's not a trivial task for myself to achive correctness and completeness. It smells reinventing the wheel also, though I can't find any hints in google and/or Python cookbook. I don't care efficiency as a starting point. Any comments are welcome. Thanks. -- kayama -- http://mail.python.org/mailman/listinfo/python-list
Re: Pure python implementation of string-like class
Maybe you can create your class using an array of 'L' with the array standard module. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list