Re: Pure python implementation of string-like class

2006-02-26 Thread Xavier Morel
Ross Ridge wrote:
 Xavier Morel wrote:
 Not if you're still within Unicode / Universal Character Set code space.
 
 Akihiro Kayama in his original post made it clear that he wanted to use
 a character set larger than entire Unicode code space.
 
   Ross Ridge
 
He implies that, but in later messages he
1. Implies that he wants to use the Unicode private spaces, which are in 
the Unicode code space
2. Says explicitly that  his needs concern Kanji encoding, which do fit 
in the existing Unicode code space, even if you take the largest 
estimates of the number of existing Kanjis (~8), and which are (I 
think)  already partially represented by the CJK Unified Ideograms and 
CJK Unified Ideograms extension A sets of regular Unicode.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-26 Thread Ross Ridge
Ross Ridge wrote:
 Akihiro Kayama in his original post made it clear that he wanted to use
 a character set larger than entire Unicode code space.

Xavier Morel wrote:
 He implies that ...

He explictly said that character set he wanted to use wouldn't fit in
UTF-16.

... but in later messages he
 1. Implies that he wants to use the Unicode private spaces, which are in
 the Unicode code space

He explictly said that he wanted to use the  U+6000...U+7FFF
range which is outside of the Unicode code space, despite him
mistakenly calling them Unicode characters.

 2. Says explicitly that  his needs concern Kanji encoding...

I have no clue whether he really needs such a large character set, but
if he does then it makes sense for him to want to use an encoding
that's wider than UTF-16.

As for the problem he actually posed, I'd suggest using tuples rather
than lists, since tuples are immutable like strings.  That would make
it easier for the class to be used as key in a dictionary.  Hmm...
thiking about it, it might actually make sense to use strings as the
internal representation as a lot operations can be implemented by using
the standard string operation but multipling the offsets and lengths by
4.

   Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-26 Thread Akihiro KAYAMA

Hi Ross. 

Thanks a lot for your clarifying. I didn't think my post could be an
Unicode frame. 

I don't know this mailing list is the right place talking about
Unicode issue, but as for me, a million codespace which UTF-16 brings
is not enough. It presume that same characters has a same codepoint.
But differs from the simple and beauty Roman Alphabet, it is sometimes
difficult to decide two kanji characters are same or not. Because
its glyph swings with various reason(ex. who, when and where it's
wrote). So first of all we assign codepoints, and next we consider
that this character which appears in this Chinese historical book may
be the same character as this character in Unicode CJK Extension
A. Such an identifying characters is also one of my project's tasks.
I think this can be explanation why UTF-16 is enough for majority but
not for all.

Anyway, I suppose that implementing string-like classes is a generic
python issue. For example, it will be useful if a rich text class
which has style attributes like bold on each characters has also
string-like methods and can be dealt with like a string.

In article [EMAIL PROTECTED],
Ross Ridge [EMAIL PROTECTED] writes:

rridge thiking about it, it might actually make sense to use strings as the
rridge internal representation as a lot operations can be implemented by using
rridge the standard string operation but multipling the offsets and lengths by
rridge 4.

Ah, COOL! It sounds very nice. I'll try it.
Thanks again.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread and-google
Akihiro KAYAMA wrote:
 As the character set is wider than UTF-16(U+10), I can't use
 Python's native unicode string class.

Have you tried using Python compiled in Wide Unicode mode
(--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
which should be enough for most purposes.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Steve Holden
Akihiro KAYAMA wrote:
 Hi all.
 
 I would like to ask how I can implement string-like class using tuple
 or list. Does anyone know about some example codes of pure python
 implementation of string-like class?
 
 Because I am trying to use Python for a text processing which is
 composed of a large character set. As the character set is wider than
 UTF-16(U+10), I can't use Python's native unicode string class.
 
Wider than UTF-16 doesn't make sense.

 So I want to prepare my own string class, which provides convenience
 string methods such as split, join, find and others like usual string
 class, but it uses a sequence of integer as a internal representation
 instead of a native string.  Obviously, subclassing of str doesn't
 help.
 
 The implementation of each string methods in the Python source
 tree(stringobject.c) is far from python code, so I have started from
 scratch, like below:
 
 def startswith(self, prefix, start=-1, end=-1):
 assert start  0, not implemented
 assert end  0, not implemented
 if isinstance(prefix, (str, unicode)):
 prefix = MyString(prefix)
 n = len(prefix)
 return self[0:n] == prefix
 
 but I found it's not a trivial task for myself to achive correctness
 and completeness. It smells reinventing the wheel also, though I
 can't find any hints in google and/or Python cookbook.
 
 I don't care efficiency as a starting point. Any comments are welcome.
 Thanks.
 
The UTF-16 encoding is capable of representing the whole of Unicode. 
There should be no need to do anything special to use UTF-16.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Akihiro KAYAMA
Hi bearophile.

In article [EMAIL PROTECTED],
[EMAIL PROTECTED] writes:

bearophileHUGS Maybe you can create your class using an array of 'L' with the 
array
bearophileHUGS standard module.

Thanks for your suggestion. I'm currently using an usual list as a
internal representation. According to my understanding, as compared to
list, array module offers efficiency but no convenient function to
implement various string methods. As Python's list is already enough
fast, I want to speed up my coding work first.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Akihiro KAYAMA
Hi And.

In article [EMAIL PROTECTED],
[EMAIL PROTECTED] writes:

and-google Akihiro KAYAMA wrote:
and-google  As the character set is wider than UTF-16(U+10), I can't use
and-google  Python's native unicode string class.
and-google 
and-google Have you tried using Python compiled in Wide Unicode mode
and-google (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
and-google which should be enough for most purposes.

From my quick survey, Python's Unicode support is restricted to
UTF-16 range(U+...U+10) intentionally, regardless of
--enable-unicode=ucs4 option. 

 Python 2.4.1 (#2, Sep  3 2005, 22:35:47) 
 [GCC 2.95.4 20020320 [FreeBSD]] on freebsd4
 Type help, copyright, credits or license for more information.
  u\U0010
 u'\U0010'
  len(u\U0010)
 1
  u\U0011
 UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: 
 illegal Unicode character

Simple patch to unicodeobject.c which disables unicode range checking
could solve this, but I don't want to maintenance specialized Python
binary for my project.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge

Steve Holden wrote:
 Wider than UTF-16 doesn't make sense.

It makes perfect sense.

  Ross
Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge
Steve Holden wrote:
Wider than UTF-16 doesn't make sense.

Ross Ridge wrote
 It makes perfect sense.

Alan Kennedy wrote:
 UTF-16 is a Unicode Transcription Format, meaning that it is a
 mechanism for representing all unicode code points, even the ones with
 ordinals greater than 0x, using series of 16-bit values.

It's an encoding format that only supports encoding 1,112,064 different
characters making it a little more than 20-bits wide.   While this
enough to encode all code points currently assigned by Unicode, it's
not sufficient to encode the private use area of ISO 10646-1 that
Akihiro Kayama wants to use.

   Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Xavier Morel
Akihiro KAYAMA wrote:
 Sorry for my terrible English. I am living in Japan, and we have a
 large number of characters called Kanji. UTF-16(U+...U+10) is
 enough for practical use in this country also, but for academic
 purpose, I need a large codespace over 20-bits. I wish I could use
 unicode's private space (U+6000...U+7FFF) in Python.
 
 -- kayama

I think the Kanji are part of the Han script as far as Unicode is 
concerned, you should check it (CJK unified ideograms and CJK unified 
ideograms extension A), they may not all be there, but the 27502 
characters from these two tables should be enough for most uses.

Oh, by the way, the Unicode code space only goes up to 10, while 
UCS-4's encoding allows code values up to and including 7FFF the 
upper Unicode private space is Plane Sixteen (10–10), the other 
private spaces being a part of the Basic Multilingual Plane 
(U+E000–U+F8FF) and Plane Fifteen (U+F–U+F) and even UTF-32 
doesn't go beyond 10.

Since the Dai Kan-Wa jiten only lists about 50,000 kanji (even though 
it probably isn't perfectly complete) it fits with ease in both plane 
fifteen and sixteen (65535 code points each).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Xavier Morel
Ross Ridge wrote:
 Steve Holden wrote:
 Wider than UTF-16 doesn't make sense.
 
 It makes perfect sense.
 
   Ross
 Ridge
 

Not if you're still within Unicode / Universal Character Set code space. 
While UCS-4 technically goes beyond any Unicode Transformation Format 
(UTF-7, 8, 16 and 32 stop at 10) it also goes beyond the range of 
the UCS itself (0-10). UTF-32 is the limitation of UCS-4 to the 
Unicode standard.

While it could be argued that Unicode/UCS limit of 10 was chosen 
_because_ of the limitations of UTF-16, It's probably irrelevant to the 
discussion.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge

Xavier Morel wrote:
 Not if you're still within Unicode / Universal Character Set code space.

Akihiro Kayama in his original post made it clear that he wanted to use
a character set larger than entire Unicode code space.

  Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Pure python implementation of string-like class

2006-02-24 Thread Akihiro KAYAMA

Hi all.

I would like to ask how I can implement string-like class using tuple
or list. Does anyone know about some example codes of pure python
implementation of string-like class?

Because I am trying to use Python for a text processing which is
composed of a large character set. As the character set is wider than
UTF-16(U+10), I can't use Python's native unicode string class.

So I want to prepare my own string class, which provides convenience
string methods such as split, join, find and others like usual string
class, but it uses a sequence of integer as a internal representation
instead of a native string.  Obviously, subclassing of str doesn't
help.

The implementation of each string methods in the Python source
tree(stringobject.c) is far from python code, so I have started from
scratch, like below:

def startswith(self, prefix, start=-1, end=-1):
assert start  0, not implemented
assert end  0, not implemented
if isinstance(prefix, (str, unicode)):
prefix = MyString(prefix)
n = len(prefix)
return self[0:n] == prefix

but I found it's not a trivial task for myself to achive correctness
and completeness. It smells reinventing the wheel also, though I
can't find any hints in google and/or Python cookbook.

I don't care efficiency as a starting point. Any comments are welcome.
Thanks.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-24 Thread bearophileHUGS
Maybe you can create your class using an array of 'L' with the array
standard module.

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list