Re: Pure python implementation of string-like class

2006-02-26 Thread Akihiro KAYAMA

Hi Ross. 

Thanks a lot for your clarifying. I didn't think my post could be an
Unicode frame. 

I don't know this mailing list is the right place talking about
Unicode issue, but as for me, a million codespace which UTF-16 brings
is not enough. It presume that same characters has a same codepoint.
But differs from the simple and beauty Roman Alphabet, it is sometimes
difficult to decide two kanji characters are "same" or not. Because
its glyph swings with various reason(ex. who, when and where it's
wrote). So first of all we assign codepoints, and next we consider
that "this character which appears in this Chinese historical book may
be the same character as this character in Unicode CJK Extension
A". Such an identifying characters is also one of my project's tasks.
I think this can be explanation why UTF-16 is enough for majority but
not for all.

Anyway, I suppose that implementing string-like classes is a generic
python issue. For example, it will be useful if a rich text class
which has style attributes like bold on each characters has also
string-like methods and can be dealt with like a string.

In article <[EMAIL PROTECTED]>,
"Ross Ridge" <[EMAIL PROTECTED]> writes:

rridge> thiking about it, it might actually make sense to use strings as the
rridge> internal representation as a lot operations can be implemented by using
rridge> the standard string operation but multipling the offsets and lengths by
rridge> 4.

Ah, COOL! It sounds very nice. I'll try it.
Thanks again.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-26 Thread Ross Ridge
Ross Ridge wrote:
> Akihiro Kayama in his original post made it clear that he wanted to use
> a character set larger than entire Unicode code space.

Xavier Morel wrote:
> He implies that ...

He explictly said that character set he wanted to use wouldn't fit in
UTF-16.

>... but in later messages he
> 1. Implies that he wants to use the Unicode private spaces, which are in
> the Unicode code space

He explictly said that he wanted to use the  "U+6000...U+7FFF"
range which is outside of the Unicode code space, despite him
mistakenly calling them Unicode characters.

> 2. Says explicitly that  his needs concern Kanji encoding...

I have no clue whether he really needs such a large character set, but
if he does then it makes sense for him to want to use an encoding
that's wider than UTF-16.

As for the problem he actually posed, I'd suggest using tuples rather
than lists, since tuples are immutable like strings.  That would make
it easier for the class to be used as key in a dictionary.  Hmm...
thiking about it, it might actually make sense to use strings as the
internal representation as a lot operations can be implemented by using
the standard string operation but multipling the offsets and lengths by
4.

   Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-26 Thread Xavier Morel
Ross Ridge wrote:
> Xavier Morel wrote:
>> Not if you're still within Unicode / Universal Character Set code space.
> 
> Akihiro Kayama in his original post made it clear that he wanted to use
> a character set larger than entire Unicode code space.
> 
>   Ross Ridge
> 
He implies that, but in later messages he
1. Implies that he wants to use the Unicode private spaces, which are in 
the Unicode code space
2. Says explicitly that  his needs concern Kanji encoding, which do fit 
in the existing Unicode code space, even if you take the largest 
estimates of the number of existing Kanjis (~8), and which are (I 
think)  already partially represented by the CJK Unified Ideograms and 
CJK Unified Ideograms extension A sets of regular Unicode.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge

Xavier Morel wrote:
> Not if you're still within Unicode / Universal Character Set code space.

Akihiro Kayama in his original post made it clear that he wanted to use
a character set larger than entire Unicode code space.

  Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Xavier Morel
Ross Ridge wrote:
> Steve Holden wrote:
>> "Wider than UTF-16" doesn't make sense.
> 
> It makes perfect sense.
> 
>   Ross
> Ridge
> 

Not if you're still within Unicode / Universal Character Set code space. 
While UCS-4 technically goes beyond any Unicode Transformation Format 
(UTF-7, 8, 16 and 32 stop at 10) it also goes beyond the range of 
the UCS itself (0-10). UTF-32 is the limitation of UCS-4 to the 
Unicode standard.

While it could be argued that Unicode/UCS limit of 10 was chosen 
_because_ of the limitations of UTF-16, It's probably irrelevant to the 
discussion.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Xavier Morel
Akihiro KAYAMA wrote:
> Sorry for my terrible English. I am living in Japan, and we have a
> large number of characters called Kanji. UTF-16(U+...U+10) is
> enough for practical use in this country also, but for academic
> purpose, I need a large codespace over 20-bits. I wish I could use
> unicode's private space (U+6000...U+7FFF) in Python.
> 
> -- kayama

I think the Kanji are part of the Han script as far as Unicode is 
concerned, you should check it (CJK unified ideograms and CJK unified 
ideograms extension A), they may not all be there, but the 27502 
characters from these two tables should be enough for most uses.

Oh, by the way, the Unicode code space only goes up to 10, while 
UCS-4's encoding allows code values up to and including 7FFF the 
upper Unicode private space is Plane Sixteen (10–10), the other 
private spaces being a part of the Basic Multilingual Plane 
(U+E000–U+F8FF) and Plane Fifteen (U+F–U+F) and even UTF-32 
doesn't go beyond 10.

Since the Dai Kan-Wa jiten "only" lists about 50,000 kanji (even though 
it probably isn't perfectly complete) it fits with ease in both plane 
fifteen and sixteen (65535 code points each).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge
Steve Holden wrote:
>"Wider than UTF-16" doesn't make sense.

Ross Ridge wrote"
> It makes perfect sense.

Alan Kennedy wrote:
> UTF-16 is a "Unicode Transcription Format", meaning that it is a
> mechanism for representing all unicode code points, even the ones with
> ordinals greater than 0x, using series of 16-bit values.

It's an encoding format that only supports encoding 1,112,064 different
characters making it a little more than 20-bits wide.   While this
enough to encode all code points currently assigned by Unicode, it's
not sufficient to encode the private use area of ISO 10646-1 that
Akihiro Kayama wants to use.

   Ross Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Alan Kennedy
[Steve Holden]
>>"Wider than UTF-16" doesn't make sense.

[Ross Ridge]
> It makes perfect sense.

No it doesn't.

UTF-16 is a "Unicode Transcription Format", meaning that it is a
mechanism for representing all unicode code points, even the ones with
ordinals greater than 0x, using series of 16-bit values.

http://en.wikipedia.org/wiki/UTF-16

"""
UTF-16 represents a character above hexadecimal  as a surrogate
pair of code values from the range D800-DFFF. For example, the
character at code point hexadecimal 1 becomes the code value
sequence D800 DC00, and the character at hexadecimal 10FFFD, the upper
limit of Unicode, becomes the code value sequence DBFF DFFD. Unicode
and ISO/IEC 10646 do not assign characters to any of the code points in
the D800-DFFF range, so an individual code value from a surrogate pair
does not ever represent a character.
"""

So UTF-16 has no "width" to compare to, no more than utf-8 does.

I wonder what character set the OP is dealing with, if it's not
representable with Unicode. Presumably it's not a modern character set?

--
alan kennedy
--
email alan:  http://xhaus.com/contact/alan

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Ross Ridge

Steve Holden wrote:
> "Wider than UTF-16" doesn't make sense.

It makes perfect sense.

  Ross
Ridge

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Akihiro KAYAMA
Hi Steve.

In article <[EMAIL PROTECTED]>,
Steve Holden <[EMAIL PROTECTED]> writes:

steve> Akihiro KAYAMA wrote:
steve> > Hi all.
steve> > 
steve> > I would like to ask how I can implement string-like class using tuple
steve> > or list. Does anyone know about some example codes of pure python
steve> > implementation of string-like class?
steve> > 
steve> > Because I am trying to use Python for a text processing which is
steve> > composed of a large character set. As the character set is wider than
steve> > UTF-16(U+10), I can't use Python's native unicode string class.
steve> > 
steve> "Wider than UTF-16" doesn't make sense.

Sorry for my terrible English. I am living in Japan, and we have a
large number of characters called Kanji. UTF-16(U+...U+10) is
enough for practical use in this country also, but for academic
purpose, I need a large codespace over 20-bits. I wish I could use
unicode's private space (U+6000...U+7FFF) in Python.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Akihiro KAYAMA
Hi And.

In article <[EMAIL PROTECTED]>,
[EMAIL PROTECTED] writes:

and-google> Akihiro KAYAMA wrote:
and-google> > As the character set is wider than UTF-16(U+10), I can't use
and-google> > Python's native unicode string class.
and-google> 
and-google> Have you tried using Python compiled in Wide Unicode mode
and-google> (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
and-google> which should be enough for most purposes.

>From my quick survey, Python's Unicode support is restricted to
UTF-16 range(U+...U+10) intentionally, regardless of
--enable-unicode=ucs4 option. 

> Python 2.4.1 (#2, Sep  3 2005, 22:35:47) 
> [GCC 2.95.4 20020320 [FreeBSD]] on freebsd4
> Type "help", "copyright", "credits" or "license" for more information.
> >>> u"\U0010"
> u'\U0010'
> >>> len(u"\U0010")
> 1
> >>> u"\U0011"
> UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: 
> illegal Unicode character

Simple patch to unicodeobject.c which disables unicode range checking
could solve this, but I don't want to maintenance specialized Python
binary for my project.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Akihiro KAYAMA
Hi bearophile.

In article <[EMAIL PROTECTED]>,
[EMAIL PROTECTED] writes:

bearophileHUGS> Maybe you can create your class using an array of 'L' with the 
array
bearophileHUGS> standard module.

Thanks for your suggestion. I'm currently using an usual list as a
internal representation. According to my understanding, as compared to
list, array module offers efficiency but no convenient function to
implement various string methods. As Python's list is already enough
fast, I want to speed up my coding work first.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread Steve Holden
Akihiro KAYAMA wrote:
> Hi all.
> 
> I would like to ask how I can implement string-like class using tuple
> or list. Does anyone know about some example codes of pure python
> implementation of string-like class?
> 
> Because I am trying to use Python for a text processing which is
> composed of a large character set. As the character set is wider than
> UTF-16(U+10), I can't use Python's native unicode string class.
> 
"Wider than UTF-16" doesn't make sense.

> So I want to prepare my own string class, which provides convenience
> string methods such as split, join, find and others like usual string
> class, but it uses a sequence of integer as a internal representation
> instead of a native string.  Obviously, subclassing of str doesn't
> help.
> 
> The implementation of each string methods in the Python source
> tree(stringobject.c) is far from python code, so I have started from
> scratch, like below:
> 
> def startswith(self, prefix, start=-1, end=-1):
> assert start < 0, "not implemented"
> assert end < 0, "not implemented"
> if isinstance(prefix, (str, unicode)):
> prefix = MyString(prefix)
> n = len(prefix)
> return self[0:n] == prefix
> 
> but I found it's not a trivial task for myself to achive correctness
> and completeness. It smells "reinventing the wheel" also, though I
> can't find any hints in google and/or Python cookbook.
> 
> I don't care efficiency as a starting point. Any comments are welcome.
> Thanks.
> 
The UTF-16 encoding is capable of representing the whole of Unicode. 
There should be no need to do anything special to use UTF-16.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-25 Thread and-google
Akihiro KAYAMA wrote:
> As the character set is wider than UTF-16(U+10), I can't use
> Python's native unicode string class.

Have you tried using Python compiled in Wide Unicode mode
(--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
which should be enough for most purposes.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Pure python implementation of string-like class

2006-02-24 Thread bearophileHUGS
Maybe you can create your class using an array of 'L' with the array
standard module.

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Pure python implementation of string-like class

2006-02-24 Thread Akihiro KAYAMA

Hi all.

I would like to ask how I can implement string-like class using tuple
or list. Does anyone know about some example codes of pure python
implementation of string-like class?

Because I am trying to use Python for a text processing which is
composed of a large character set. As the character set is wider than
UTF-16(U+10), I can't use Python's native unicode string class.

So I want to prepare my own string class, which provides convenience
string methods such as split, join, find and others like usual string
class, but it uses a sequence of integer as a internal representation
instead of a native string.  Obviously, subclassing of str doesn't
help.

The implementation of each string methods in the Python source
tree(stringobject.c) is far from python code, so I have started from
scratch, like below:

def startswith(self, prefix, start=-1, end=-1):
assert start < 0, "not implemented"
assert end < 0, "not implemented"
if isinstance(prefix, (str, unicode)):
prefix = MyString(prefix)
n = len(prefix)
return self[0:n] == prefix

but I found it's not a trivial task for myself to achive correctness
and completeness. It smells "reinventing the wheel" also, though I
can't find any hints in google and/or Python cookbook.

I don't care efficiency as a starting point. Any comments are welcome.
Thanks.

-- kayama
-- 
http://mail.python.org/mailman/listinfo/python-list