Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-08 Thread Travis Oliphant
Thank you, Martin and Stephen, for the suggestions and comments.

For your information:

We decided that all NumPy arrays of unicode strings will use UCS4 for 
internal representation.  When an element of the array is selected, a 
unicodescalar (which inherits directly from the unicode builtin type but 
has attributes and methods of arrays) will be returned.   On wide 
builds, the scalar is a perfect match.  On narrow builds, surrogate 
pairs will be used if they are necessary as the data is copied over to 
the scalar.

Best regards,

-Travis


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Travis E. Oliphant

This is a design question which is why I'm posting here.  Recently the 
NumPy developers have become more aware of the difference between UCS2 
and UCS4 builds of Python.  NumPy arrays can be of Unicode type.  In 
other words a NumPy array can be made of up fixed-data-length unicode 
strings.

Currently that means that they are unicode strings of basic size UCS2 
or UCS4 depending on the platform.  It is this duality that has some 
people concerned.  For all other data-types, NumPy allows the user to 
explicitly request a bit-width for the data-type.

So, we are thinking of introducing another data-type to NumPy to 
differentiate between UCS2 and UCS4 unicode strings.  (This also means a 
unicode scalar object, i.e. string of each of these, exactly one of 
which will inherit from the Python type).

Before embarking on this journey, however, we are seeking advice from 
individuals wiser to the way of Unicode on this list.

Perhaps all we need to do is be more careful on input and output of 
Unicode data-types so that transfer of unicode can be handled correctly 
on each platform.

Any thoughts?

-Travis Oliphant



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Martin v. Löwis
Travis E. Oliphant wrote:
 Currently that means that they are unicode strings of basic size UCS2 
 or UCS4 depending on the platform.  It is this duality that has some 
 people concerned.  For all other data-types, NumPy allows the user to 
 explicitly request a bit-width for the data-type.

Why is that a desirable property? Also: Why does have NumPy support for
Unicode arrays in the first place?

 Before embarking on this journey, however, we are seeking advice from 
 individuals wiser to the way of Unicode on this list.

My initial reaction is: use whatever Python uses in NumPy Unicode.
Upon closer inspection, it is not all that clear what operations
are supported on a Unicode array, and how these operations relate
to the Python Unicode type.

In any case, I think NumPy should have only a single Unicode array
type (please do explain why having zero of them is insufficient).

If the purpose of the type is to interoperate with a Python
unicode object, it should use the same width (as this will
allow for mempcy).

If the purpose is to support arbitrary Unicode characters, it should
use 4 bytes (as two bytes are insufficient to represent arbitrary
Unicode characters).

If the purpose is something else, please explain what the purpose
is.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Martin v. Löwis
Travis E. Oliphant wrote:
 Numpy supports arrays of arbitrary fixed-length records.  It is
 much more than numeric-only data now.  One of the fields that a
 record can contain is a string.  If strings are supported, it makes
 sense to support unicode strings as well.

Hmm. How do you support strings in fixed-length records? Strings are
variable-sized, after all.

On common application is that you have a C struct in some API which
has a fixed-size array for string data (either with a length field,
or null-terminated), in this case, it is moderately useful to model
such a struct in Python. However, transferring this to Unicode is
pointless - there aren't any similar Unicode structs that need
support.

 This allows NumPy to memory-map arbitrary data-files on disk.

Ok, so this is the C struct case. Then why do you need Unicode
support there? Which common file format has embedded fixed-size
Unicode data?

 Perhaps you should explain why you think NumPy shouldn't support
 Unicode

I think I said Unicode arrays, not Unicode. Unicode arrays are
a pointless data type, IMO. Unicode always comes in strings
(i.e. variable sized, either null-terminated or with an introducing
length). On disk/on the wire Unicode comes as UTF-8 more often
than not.

Using UCS-2/UCS-2 as an on-disk represenationis also questionable
practice (although admittedly Microsoft uses that a lot).

 That is currently what is done.  The current unicode data-type is 
 exactly what Python uses.

Then I wonder how this goes along with the use case allow to
map arbitrary files.

 The chararray subclass gives to unicode and string arrays all the 
 methods of unicode and strings (operating on an element-by-element
 basis).

For strings, I can see use cases (although I wonder how you deal
with data formats that also support variable-sized strings, as
most data formats supporting strings do).

 Please explain why having zero of them is *sufficient*.

Because I (still) cannot imagine any specific application that
might need such a feature (IOWYAGNI).

 If the purpose is to support arbitrary Unicode characters, it
 should use 4 bytes (as two bytes are insufficient to represent
 arbitrary Unicode characters).
 
 
 And Python does not support arbitrary Unicode characters on narrow 
 builds?  Then how is \U0010 represented?

It's represented using UTF-16. Try this for yourself:

py len(u\U0010)
2
py u\U0010[0]
u'\udbff'
py u\U0010[1]
u'\udfff'

This has all kinds of non-obvious implications.

 The purpose is to represent bytes as they might exist in a file or 
 data-stream according to the users specification.

See, and this is precisely the statement that I challenge. Sure,
they might exist - but I'd rather expect that they don't.

If they exist, Unicode might come as variable-sized UTF-8, UTF-16,
or UTF-32. In either case, NumPy should already support that by
mapping a string object onto the encoded bytes, to which you then
can apply .decode() should you need to process the actual Unicode
data.

 The purpose is 
 whatever the user wants them for.  It's the same purpose as having an
  unsigned 64-bit data-type --- because users may need it to represent
  data as it exists in a file.

No. I would expect you have 64-bit longs because users *do* need them,
and because there wouldn't be an easy work-around if users wouldn't have
them. For Unicode, it's different: users don't directly need them
(atleast not many users), and if they do, there is an easy work-around
for their absence.

Say I want to process NTFS run lists. In NTFS run lists, there are
24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles).
Can I represent them all in NumPy? Can I have NumPy transparently
map a sequence of run list records (which are variable-sized)
map as an array of run list records?

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Stephen J. Turnbull
 Travis == Travis E Oliphant [EMAIL PROTECTED] writes:

Travis Numpy supports arrays of arbitrary fixed-length records.
Travis It is much more than numeric-only data now.  One of the
Travis fields that a record can contain is a string.  If strings
Travis are supported, it makes sense to support unicode strings
Travis as well.

That is not obvious.  A string is really an array of bytes, which for
historical reasons in some places (primarily the U.S. of A.) can be
used to represent text.  Unicode, on the other hand, is intended to
represent text streams robustly and does so in a universal but
flexible way ... but all of the different Unicode transformation
formats are considered to represent the *identical* text stream.  Some
applications may specify a transformation format, others will not.

In any case, internally Python is only going to support *one*; all the
others must be read in through codecs anyway.  See below.

Travis This allows NumPy to memory-map arbitrary data-files on
Travis disk.

In the case where a transformation format *is* specified, I don't see
why you can't use a byte array field (ie, ordinary string) of
appropriate size for this purpose, and read it through a codec when it
needs to be treated as text.  This is going to be necessary in
essentially all of the cases I encounter, because the files are UTF-8
and sane internal representations are either UTF-16 or UTF-32.  In
particular, Python's internal representation is 16 or 32 bits wide.

Travis Perhaps you should explain why you think NumPy shouldn't
Travis support Unicode

Because it can't, not in the way you would like to, if I understand
you correctly.  Python chooses *one* of the many standard
representations for internal use, and because of the way the standard
is specified, it doesn't matter which one!  And none of the others can
be represented directly, all must be decoded for internal use and
encoded when written back to external media.  So any memory mapping
application is inherently nonportable, even across Python
implementations.

Travis And Python does not support arbitrary Unicode characters
Travis on narrow builds?  Then how is \U0010 represented?

In a way incompatible with the concept of character array.  Now what
do you do?

The point is that Unicode is intentionally designed in such a way that
a plethora of representations is possible, but all are easily and
reliably interconverted.  Implementations are then free to choose an
appropriate internal representation, knowing that conversion from
external representations is cheap and standardized.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can do free software business;
  ask what your business can do for free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com