Re: [Python-Dev] Help with Unicode arrays in NumPy
Thank you, Martin and Stephen, for the suggestions and comments. For your information: We decided that all NumPy arrays of unicode strings will use UCS4 for internal representation. When an element of the array is selected, a unicodescalar (which inherits directly from the unicode builtin type but has attributes and methods of arrays) will be returned. On wide builds, the scalar is a perfect match. On narrow builds, surrogate pairs will be used if they are necessary as the data is copied over to the scalar. Best regards, -Travis ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Help with Unicode arrays in NumPy
This is a design question which is why I'm posting here. Recently the NumPy developers have become more aware of the difference between UCS2 and UCS4 builds of Python. NumPy arrays can be of Unicode type. In other words a NumPy array can be made of up fixed-data-length unicode strings. Currently that means that they are unicode strings of basic size UCS2 or UCS4 depending on the platform. It is this duality that has some people concerned. For all other data-types, NumPy allows the user to explicitly request a bit-width for the data-type. So, we are thinking of introducing another data-type to NumPy to differentiate between UCS2 and UCS4 unicode strings. (This also means a unicode scalar object, i.e. string of each of these, exactly one of which will inherit from the Python type). Before embarking on this journey, however, we are seeking advice from individuals wiser to the way of Unicode on this list. Perhaps all we need to do is be more careful on input and output of Unicode data-types so that transfer of unicode can be handled correctly on each platform. Any thoughts? -Travis Oliphant ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Help with Unicode arrays in NumPy
Travis E. Oliphant wrote: Currently that means that they are unicode strings of basic size UCS2 or UCS4 depending on the platform. It is this duality that has some people concerned. For all other data-types, NumPy allows the user to explicitly request a bit-width for the data-type. Why is that a desirable property? Also: Why does have NumPy support for Unicode arrays in the first place? Before embarking on this journey, however, we are seeking advice from individuals wiser to the way of Unicode on this list. My initial reaction is: use whatever Python uses in NumPy Unicode. Upon closer inspection, it is not all that clear what operations are supported on a Unicode array, and how these operations relate to the Python Unicode type. In any case, I think NumPy should have only a single Unicode array type (please do explain why having zero of them is insufficient). If the purpose of the type is to interoperate with a Python unicode object, it should use the same width (as this will allow for mempcy). If the purpose is to support arbitrary Unicode characters, it should use 4 bytes (as two bytes are insufficient to represent arbitrary Unicode characters). If the purpose is something else, please explain what the purpose is. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Help with Unicode arrays in NumPy
Travis E. Oliphant wrote: Numpy supports arrays of arbitrary fixed-length records. It is much more than numeric-only data now. One of the fields that a record can contain is a string. If strings are supported, it makes sense to support unicode strings as well. Hmm. How do you support strings in fixed-length records? Strings are variable-sized, after all. On common application is that you have a C struct in some API which has a fixed-size array for string data (either with a length field, or null-terminated), in this case, it is moderately useful to model such a struct in Python. However, transferring this to Unicode is pointless - there aren't any similar Unicode structs that need support. This allows NumPy to memory-map arbitrary data-files on disk. Ok, so this is the C struct case. Then why do you need Unicode support there? Which common file format has embedded fixed-size Unicode data? Perhaps you should explain why you think NumPy shouldn't support Unicode I think I said Unicode arrays, not Unicode. Unicode arrays are a pointless data type, IMO. Unicode always comes in strings (i.e. variable sized, either null-terminated or with an introducing length). On disk/on the wire Unicode comes as UTF-8 more often than not. Using UCS-2/UCS-2 as an on-disk represenationis also questionable practice (although admittedly Microsoft uses that a lot). That is currently what is done. The current unicode data-type is exactly what Python uses. Then I wonder how this goes along with the use case allow to map arbitrary files. The chararray subclass gives to unicode and string arrays all the methods of unicode and strings (operating on an element-by-element basis). For strings, I can see use cases (although I wonder how you deal with data formats that also support variable-sized strings, as most data formats supporting strings do). Please explain why having zero of them is *sufficient*. Because I (still) cannot imagine any specific application that might need such a feature (IOWYAGNI). If the purpose is to support arbitrary Unicode characters, it should use 4 bytes (as two bytes are insufficient to represent arbitrary Unicode characters). And Python does not support arbitrary Unicode characters on narrow builds? Then how is \U0010 represented? It's represented using UTF-16. Try this for yourself: py len(u\U0010) 2 py u\U0010[0] u'\udbff' py u\U0010[1] u'\udfff' This has all kinds of non-obvious implications. The purpose is to represent bytes as they might exist in a file or data-stream according to the users specification. See, and this is precisely the statement that I challenge. Sure, they might exist - but I'd rather expect that they don't. If they exist, Unicode might come as variable-sized UTF-8, UTF-16, or UTF-32. In either case, NumPy should already support that by mapping a string object onto the encoded bytes, to which you then can apply .decode() should you need to process the actual Unicode data. The purpose is whatever the user wants them for. It's the same purpose as having an unsigned 64-bit data-type --- because users may need it to represent data as it exists in a file. No. I would expect you have 64-bit longs because users *do* need them, and because there wouldn't be an easy work-around if users wouldn't have them. For Unicode, it's different: users don't directly need them (atleast not many users), and if they do, there is an easy work-around for their absence. Say I want to process NTFS run lists. In NTFS run lists, there are 24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles). Can I represent them all in NumPy? Can I have NumPy transparently map a sequence of run list records (which are variable-sized) map as an array of run list records? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Help with Unicode arrays in NumPy
Travis == Travis E Oliphant [EMAIL PROTECTED] writes: Travis Numpy supports arrays of arbitrary fixed-length records. Travis It is much more than numeric-only data now. One of the Travis fields that a record can contain is a string. If strings Travis are supported, it makes sense to support unicode strings Travis as well. That is not obvious. A string is really an array of bytes, which for historical reasons in some places (primarily the U.S. of A.) can be used to represent text. Unicode, on the other hand, is intended to represent text streams robustly and does so in a universal but flexible way ... but all of the different Unicode transformation formats are considered to represent the *identical* text stream. Some applications may specify a transformation format, others will not. In any case, internally Python is only going to support *one*; all the others must be read in through codecs anyway. See below. Travis This allows NumPy to memory-map arbitrary data-files on Travis disk. In the case where a transformation format *is* specified, I don't see why you can't use a byte array field (ie, ordinary string) of appropriate size for this purpose, and read it through a codec when it needs to be treated as text. This is going to be necessary in essentially all of the cases I encounter, because the files are UTF-8 and sane internal representations are either UTF-16 or UTF-32. In particular, Python's internal representation is 16 or 32 bits wide. Travis Perhaps you should explain why you think NumPy shouldn't Travis support Unicode Because it can't, not in the way you would like to, if I understand you correctly. Python chooses *one* of the many standard representations for internal use, and because of the way the standard is specified, it doesn't matter which one! And none of the others can be represented directly, all must be decoded for internal use and encoded when written back to external media. So any memory mapping application is inherently nonportable, even across Python implementations. Travis And Python does not support arbitrary Unicode characters Travis on narrow builds? Then how is \U0010 represented? In a way incompatible with the concept of character array. Now what do you do? The point is that Unicode is intentionally designed in such a way that a plethora of representations is possible, but all are easily and reliably interconverted. Implementations are then free to choose an appropriate internal representation, knowing that conversion from external representations is cheap and standardized. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can do free software business; ask what your business can do for free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com