Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-08 Thread Travis Oliphant
Thank you, Martin and Stephen, for the suggestions and comments.

For your information:

We decided that all NumPy arrays of unicode strings will use UCS4 for 
internal representation.  When an element of the array is selected, a 
unicodescalar (which inherits directly from the unicode builtin type but 
has attributes and methods of arrays) will be returned.   On wide 
builds, the scalar is a perfect match.  On narrow builds, surrogate 
pairs will be used if they are necessary as the data is copied over to 
the scalar.

Best regards,

-Travis


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Stephen J. Turnbull
> "Travis" == Travis E Oliphant <[EMAIL PROTECTED]> writes:

Travis> Numpy supports arrays of arbitrary fixed-length "records".
Travis> It is much more than numeric-only data now.  One of the
Travis> fields that a record can contain is a string.  If strings
Travis> are supported, it makes sense to support unicode strings
Travis> as well.

That is not obvious.  A string is really an array of bytes, which for
historical reasons in some places (primarily the U.S. of A.) can be
used to represent text.  Unicode, on the other hand, is intended to
represent text streams robustly and does so in a universal but
flexible way ... but all of the different Unicode transformation
formats are considered to represent the *identical* text stream.  Some
applications may specify a transformation format, others will not.

In any case, internally Python is only going to support *one*; all the
others must be read in through codecs anyway.  See below.

Travis> This allows NumPy to memory-map arbitrary data-files on
Travis> disk.

In the case where a transformation format *is* specified, I don't see
why you can't use a byte array field (ie, ordinary "string") of
appropriate size for this purpose, and read it through a codec when it
needs to be treated as text.  This is going to be necessary in
essentially all of the cases I encounter, because the files are UTF-8
and sane internal representations are either UTF-16 or UTF-32.  In
particular, Python's internal representation is 16 or 32 bits wide.

Travis> Perhaps you should explain why you think NumPy "shouldn't
Travis> support Unicode"

Because it can't, not in the way you would like to, if I understand
you correctly.  Python chooses *one* of the many standard
representations for internal use, and because of the way the standard
is specified, it doesn't matter which one!  And none of the others can
be represented directly, all must be decoded for internal use and
encoded when written back to external media.  So any memory mapping
application is inherently nonportable, even across Python
implementations.

Travis> And Python does not support arbitrary Unicode characters
Travis> on narrow builds?  Then how is \U0010 represented?

In a way incompatible with the concept of character array.  Now what
do you do?

The point is that Unicode is intentionally designed in such a way that
a plethora of representations is possible, but all are easily and
reliably interconverted.  Implementations are then free to choose an
appropriate internal representation, knowing that conversion from
external representations is "cheap" and standardized.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Martin v. Löwis
Travis E. Oliphant wrote:
> Numpy supports arrays of arbitrary fixed-length "records".  It is
> much more than numeric-only data now.  One of the fields that a
> record can contain is a string.  If strings are supported, it makes
> sense to support unicode strings as well.

Hmm. How do you support strings in fixed-length records? Strings are
variable-sized, after all.

On common application is that you have a C struct in some API which
has a fixed-size array for string data (either with a length field,
or null-terminated), in this case, it is moderately useful to model
such a struct in Python. However, transferring this to Unicode is
pointless - there aren't any similar Unicode structs that need
support.

> This allows NumPy to memory-map arbitrary data-files on disk.

Ok, so this is the "C struct" case. Then why do you need Unicode
support there? Which common file format has embedded fixed-size
Unicode data?

> Perhaps you should explain why you think NumPy "shouldn't support
> Unicode"

I think I said "Unicode arrays", not Unicode. Unicode arrays are
a pointless data type, IMO. Unicode always comes in strings
(i.e. variable sized, either null-terminated or with an introducing
length). On disk/on the wire Unicode comes as UTF-8 more often
than not.

Using UCS-2/UCS-2 as an on-disk represenationis also questionable
practice (although admittedly Microsoft uses that a lot).

> That is currently what is done.  The current unicode data-type is 
> exactly what Python uses.

Then I wonder how this goes along with the use case "allow to
map arbitrary files".

> The chararray subclass gives to unicode and string arrays all the 
> methods of unicode and strings (operating on an element-by-element
> basis).

For strings, I can see use cases (although I wonder how you deal
with data formats that also support variable-sized strings, as
most data formats supporting strings do).

> Please explain why having zero of them is *sufficient*.

Because I (still) cannot imagine any specific application that
might need such a feature (IOWYAGNI).

>> If the purpose is to support arbitrary Unicode characters, it
>> should use 4 bytes (as two bytes are insufficient to represent
>> arbitrary Unicode characters).
> 
> 
> And Python does not support arbitrary Unicode characters on narrow 
> builds?  Then how is \U0010 represented?

It's represented using UTF-16. Try this for yourself:

py> len(u"\U0010")
2
py> u"\U0010"[0]
u'\udbff'
py> u"\U0010"[1]
u'\udfff'

This has all kinds of non-obvious implications.

> The purpose is to represent bytes as they might exist in a file or 
> data-stream according to the users specification.

See, and this is precisely the statement that I challenge. Sure,
they "might" exist - but I'd rather expect that they don't.

If they exist, "Unicode" might come as variable-sized UTF-8, UTF-16,
or UTF-32. In either case, NumPy should already support that by
mapping a string object onto the encoded bytes, to which you then
can apply .decode() should you need to process the actual Unicode
data.

> The purpose is 
> whatever the user wants them for.  It's the same purpose as having an
>  unsigned 64-bit data-type --- because users may need it to represent
>  data as it exists in a file.

No. I would expect you have 64-bit longs because users *do* need them,
and because there wouldn't be an easy work-around if users wouldn't have
them. For Unicode, it's different: users don't directly need them
(atleast not many users), and if they do, there is an easy work-around
for their absence.

Say I want to process NTFS run lists. In NTFS run lists, there are
24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles).
Can I represent them all in NumPy? Can I have NumPy transparently
map a sequence of run list records (which are variable-sized)
map as an array of run list records?

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Travis E. Oliphant
Martin v. Löwis wrote:
> Travis E. Oliphant wrote:
> 
>>Currently that means that they are "unicode" strings of basic size UCS2 
>>or UCS4 depending on the platform.  It is this duality that has some 
>>people concerned.  For all other data-types, NumPy allows the user to 
>>explicitly request a bit-width for the data-type.
> 
> 
> Why is that a desirable property? Also: Why does have NumPy support for
> Unicode arrays in the first place?
> 


Numpy supports arrays of arbitrary fixed-length "records".  It is much 
more than numeric-only data now.  One of the fields that a record can 
contain is a string.  If strings are supported, it makes sense to 
support unicode strings as well.

This allows NumPy to memory-map arbitrary data-files on disk.

Perhaps you should explain why you think NumPy "shouldn't support Unicode"

> 
> My initial reaction is: use whatever Python uses in "NumPy Unicode".
> Upon closer inspection, it is not all that clear what operations
> are supported on a Unicode array, and how these operations relate
> to the Python Unicode type.

That is currently what is done.  The current unicode data-type is 
exactly what Python uses.

The chararray subclass gives to unicode and string arrays all the 
methods of unicode and strings (operating on an element-by-element basis).

When you extract an element from the unicode data-type you get a Python 
unicode object (every NumPy data-type has a corresponding "type-object" 
that determines what is returned when an element is extracted).  All of 
these types are in a hierarchy of data-types which inherit from the 
basic Python types when available.

> 
> In any case, I think NumPy should have only a single "Unicode array"
> type (please do explain why having zero of them is insufficient).
> 

Please explain why having zero of them is *sufficient*.

> If the purpose of the type is to interoperate with a Python
> unicode object, it should use the same width (as this will
> allow for mempcy).
> 
> If the purpose is to support arbitrary Unicode characters, it should
> use 4 bytes (as two bytes are insufficient to represent arbitrary
> Unicode characters).

And Python does not support arbitrary Unicode characters on narrow 
builds?  Then how is \U0010 represented?

> 
> If the purpose is something else, please explain what the purpose
> is.

The purpose is to represent bytes as they might exist in a file or 
data-stream according to the users specification.  The purpose is 
whatever the user wants them for.  It's the same purpose as having an 
unsigned 64-bit data-type --- because users may need it to represent 
data as it exists in a file.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Martin v. Löwis
Travis E. Oliphant wrote:
> Currently that means that they are "unicode" strings of basic size UCS2 
> or UCS4 depending on the platform.  It is this duality that has some 
> people concerned.  For all other data-types, NumPy allows the user to 
> explicitly request a bit-width for the data-type.

Why is that a desirable property? Also: Why does have NumPy support for
Unicode arrays in the first place?

> Before embarking on this journey, however, we are seeking advice from 
> individuals wiser to the way of Unicode on this list.

My initial reaction is: use whatever Python uses in "NumPy Unicode".
Upon closer inspection, it is not all that clear what operations
are supported on a Unicode array, and how these operations relate
to the Python Unicode type.

In any case, I think NumPy should have only a single "Unicode array"
type (please do explain why having zero of them is insufficient).

If the purpose of the type is to interoperate with a Python
unicode object, it should use the same width (as this will
allow for mempcy).

If the purpose is to support arbitrary Unicode characters, it should
use 4 bytes (as two bytes are insufficient to represent arbitrary
Unicode characters).

If the purpose is something else, please explain what the purpose
is.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Help with Unicode arrays in NumPy

2006-02-07 Thread Travis E. Oliphant

This is a design question which is why I'm posting here.  Recently the 
NumPy developers have become more aware of the difference between UCS2 
and UCS4 builds of Python.  NumPy arrays can be of Unicode type.  In 
other words a NumPy array can be made of up fixed-data-length unicode 
strings.

Currently that means that they are "unicode" strings of basic size UCS2 
or UCS4 depending on the platform.  It is this duality that has some 
people concerned.  For all other data-types, NumPy allows the user to 
explicitly request a bit-width for the data-type.

So, we are thinking of introducing another data-type to NumPy to 
differentiate between UCS2 and UCS4 unicode strings.  (This also means a 
unicode scalar object, i.e. string of each of these, exactly one of 
which will inherit from the Python type).

Before embarking on this journey, however, we are seeking advice from 
individuals wiser to the way of Unicode on this list.

Perhaps all we need to do is be more careful on input and output of 
Unicode data-types so that transfer of unicode can be handled correctly 
on each platform.

Any thoughts?

-Travis Oliphant



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com