Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 18:18 GMT+02:00 Chris Barker :

> On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted  wrote:
>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back
>>
>
> This is the key point -- we can argue all we want about the best encoding
> for fixed-length unicode-supporting strings (I think numpy and HDF have
> very similar requirements), but that is not our decision to make -- many
> other systems have chosen utf-8, so it's a really good idea for numpy to be
> able to deal with that cleanly and easily and consistently.
>

​Agreed.  But it would also be a good idea to spread the word that simple
UCS4 encoding in combination with compression can be a perfectly good
system​ for storing large amounts of unicode data too.


>
> I have made many anti utf-8 points in this thread because while we need to
> deal with utf-8 for interplay with other systems, I am very sure that it is
> not the best format for a default, naive-user-of-numpy unicode-supporting
> dtype. Nor is it the best encoding for a mostly-ascii compact in memory
> format.
>

​I resonate a lot with this feeling too :)​


>
> So I think numpy needs to support at least:
>
> utf-8
> latin-1
> UCS-4
>
> And it maybe should support one-byte encoding suitable for non-european
> languages, and maybe utf-16 for Java and Windows compatibility, and 
>
> So that seems to point to "support as many encodings as possible" And
> python has the machinery to do so -- so why not?
>
> (I'm taking Julian's word for it that having a parameterized dtype would
> not have a major impact on current code)
>
> If we go with a parameterized by encoding string dtype, then we can pick
> sensible defaults, and let users use what they know best fits their
> use-cases.
>
> As for python2 -- it is on the way out, I think we should keep the 'U' and
> 'S' dtypes as they are for backward compatibility and move forward with the
> new one(s) in a way that is optimized for py3. And it would map to a py2
> Unicode type.
>
> The only catch I see in that is what to do with bytes -- we should have a
> numpy dtype that matches the bytes model -- fixed length bytes that map to
> python bytes objects. (this is almost what teh void type is yes?) but then
> under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
> bytes objects??
>
> @Francesc: -- one more question for you:
>
> How important is it for pytables to match the numpy storage to the hdf
> storage byte for byte? i.e. would it be a killer if encoding / decoding
> happened every time at the boundary? I'm guessing yes, as this would have
> been solved long ago if not.
>

​The PyTables team decided some time ago that it was a waste of time and
resources to maintain the internal HDF5 interface, and that it would be
better to switch to h5py for the low I/O communication with HDF5 (btw, we
just received​ a small NumFOCUS grant for continue the ongoing work on
this; thanks guys!).  This means that PyTables will be basically agnostic
about this sort of encoding issues, and that the important package to have
in account for interfacing NumPy and HDF5 is just h5py.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Chris Barker
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted  wrote:

> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back
>

This is the key point -- we can argue all we want about the best encoding
for fixed-length unicode-supporting strings (I think numpy and HDF have
very similar requirements), but that is not our decision to make -- many
other systems have chosen utf-8, so it's a really good idea for numpy to be
able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to
deal with utf-8 for interplay with other systems, I am very sure that it is
not the best format for a default, naive-user-of-numpy unicode-supporting
dtype. Nor is it the best encoding for a mostly-ascii compact in memory
format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european
languages, and maybe utf-16 for Java and Windows compatibility, and 

So that seems to point to "support as many encodings as possible" And
python has the machinery to do so -- so why not?

(I'm taking Julian's word for it that having a parameterized dtype would
not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick
sensible defaults, and let users use what they know best fits their
use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and
'S' dtypes as they are for backward compatibility and move forward with the
new one(s) in a way that is optimized for py3. And it would map to a py2
Unicode type.

The only catch I see in that is what to do with bytes -- we should have a
numpy dtype that matches the bytes model -- fixed length bytes that map to
python bytes objects. (this is almost what teh void type is yes?) but then
under py2, would a bytes object (py2 string) map to numpy 'S' or numpy
bytes objects??

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf
storage byte for byte? i.e. would it be a killer if encoding / decoding
happened every time at the boundary? I'm guessing yes, as this would have
been solved long ago if not.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 13:27 GMT+02:00 Neal Becker :

> So while compression+ucs-4 might be OK for out-of-core representation,
> what about in-core?  blosc+ucs-4?  I don't think that works for mmap, does
> it?
>

​Correct, the real problem is mmap for an out-of-core, HDF5 representation,
I presume.

For in-memory, there are several compressed data containers, like:

https://github.com/alimanfoo/zarr (meant mainly for multidimensional data
containers)
https://github.com/Blosc/bcolz​ (meant mainly for tabular data containers)

​(there might be others).​



>
> On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted  wrote:
>
>> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer :
>>
>>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>>>
 It's worthwhile enough that both major HDF5 bindings don't support
 Unicode arrays, despite user requests for years. The sticking point seems
 to be the difference between HDF5's view of a Unicode string array (defined
 in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
 string array (because of UCS-4, defined by the number of
 characters/codepoints/whatever). So there are HDF5 files out there
 that none of our HDF5 bindings can read, and it is impossible to write
 certain data efficiently.


 I would really like to hear more from the authors of these libraries
 about what exactly it is they feel they're missing. Is it that they want
 numpy to enforce the length limit early, to catch errors when the array is
 modified instead of when they go to write it to the file? Is it that they
 really want an O(1) way to look at a array and know the maximum number of
 bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
 is really annoying and files that need it are rare so they haven't had the
 motivation to implement it? My impression is similar to Julian's: you
 *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
 dozen lines of code, which is nothing compared to all the other hoops these
 libraries are already jumping through, so if this is really the roadblock
 then I must be missing something.

>>>
>>> I actually agree with you. I think it's mostly a matter of convenience
>>> that h5py matched up HDF5 dtypes with numpy dtypes:
>>> fixed width ASCII -> np.string_/bytes
>>> variable length ASCII -> object arrays of np.string_/bytes
>>> variable length UTF-8 -> object arrays of unicode
>>>
>>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>>> there's not an easy fix.
>>>
>>> We absolutely could fix h5py by mapping everything to object arrays of
>>> Python unicode strings, as has been discussed (
>>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>>> would be a fine but non-ideal solution, since there is currently no fixed
>>> width UTF-8 support.
>>>
>>> For fixed width ASCII arrays, this would mean increased convenience for
>>> Python 3 users, at the price of decreased convenience for Python 2 users
>>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>>> dependent on the version of Python. Hence, we're back here, waiting for
>>> better dtypes for encoded strings.
>>>
>>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>>> of bytes.
>>>
>>
>> Well, I'll say upfront that I have not read this discussion in the fully,
>> but apparently some opinions from developers of HDF5 Python packages would
>> be welcome here, so here I go :) ​
>>
>> As a long-time developer of one of the Python HDF5 packages (PyTables), I
>> have always been of the opinion that plain ASCII (for byte strings) and
>> UCS-4 (for Unicode) encoding would be the appropriate dtypes​ for storing
>> large amounts of data, most specially for disk storage (but also using
>> compressed in-memory containers).  My rational is that, although UCS-4 may
>> require way too much space, compression would reduce that to basically the
>> space that is required by compressed UTF-8 (I won't go into detail, but
>> basically this is possible by using the shuffle filter).
>>
>> I remember advocating for UCS-4 adoption in the HDF5 library many years
>> ago (2007?), but I had no success and UTF-8 was decided to be the best
>> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
>> don't think there is a go back (not even adding UCS-4 support on it,
>> although I continue to think it would be a good idea).  So, I suppose that
>> if HDF5 is found to be an important format for NumPy users (and I think
>> this is the case), a solution for representing Unicode characters by using
>> UTF-8 in NumPy would be desirable (at the risk of making the implementation
>> more complex).
>>
>> ​Francesc
>> ​
>>
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>>

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Neal Becker
So while compression+ucs-4 might be OK for out-of-core representation, what
about in-core?  blosc+ucs-4?  I don't think that works for mmap, does it?

On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted  wrote:

> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer :
>
>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>>
>>> It's worthwhile enough that both major HDF5 bindings don't support
>>> Unicode arrays, despite user requests for years. The sticking point seems
>>> to be the difference between HDF5's view of a Unicode string array (defined
>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>> string array (because of UCS-4, defined by the number of
>>> characters/codepoints/whatever). So there are HDF5 files out there that
>>> none of our HDF5 bindings can read, and it is impossible to write certain
>>> data efficiently.
>>>
>>>
>>> I would really like to hear more from the authors of these libraries
>>> about what exactly it is they feel they're missing. Is it that they want
>>> numpy to enforce the length limit early, to catch errors when the array is
>>> modified instead of when they go to write it to the file? Is it that they
>>> really want an O(1) way to look at a array and know the maximum number of
>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>> is really annoying and files that need it are rare so they haven't had the
>>> motivation to implement it? My impression is similar to Julian's: you
>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>> dozen lines of code, which is nothing compared to all the other hoops these
>>> libraries are already jumping through, so if this is really the roadblock
>>> then I must be missing something.
>>>
>>
>> I actually agree with you. I think it's mostly a matter of convenience
>> that h5py matched up HDF5 dtypes with numpy dtypes:
>> fixed width ASCII -> np.string_/bytes
>> variable length ASCII -> object arrays of np.string_/bytes
>> variable length UTF-8 -> object arrays of unicode
>>
>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>> there's not an easy fix.
>>
>> We absolutely could fix h5py by mapping everything to object arrays of
>> Python unicode strings, as has been discussed (
>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>> would be a fine but non-ideal solution, since there is currently no fixed
>> width UTF-8 support.
>>
>> For fixed width ASCII arrays, this would mean increased convenience for
>> Python 3 users, at the price of decreased convenience for Python 2 users
>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>> dependent on the version of Python. Hence, we're back here, waiting for
>> better dtypes for encoded strings.
>>
>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>> of bytes.
>>
>
> Well, I'll say upfront that I have not read this discussion in the fully,
> but apparently some opinions from developers of HDF5 Python packages would
> be welcome here, so here I go :) ​
>
> As a long-time developer of one of the Python HDF5 packages (PyTables), I
> have always been of the opinion that plain ASCII (for byte strings) and
> UCS-4 (for Unicode) encoding would be the appropriate dtypes​ for storing
> large amounts of data, most specially for disk storage (but also using
> compressed in-memory containers).  My rational is that, although UCS-4 may
> require way too much space, compression would reduce that to basically the
> space that is required by compressed UTF-8 (I won't go into detail, but
> basically this is possible by using the shuffle filter).
>
> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back (not even adding UCS-4 support on it,
> although I continue to think it would be a good idea).  So, I suppose that
> if HDF5 is found to be an important format for NumPy users (and I think
> this is the case), a solution for representing Unicode characters by using
> UTF-8 in NumPy would be desirable (at the risk of making the implementation
> more complex).
>
> ​Francesc
> ​
>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> Francesc Alted
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 3:34 GMT+02:00 Stephan Hoyer :

> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>
>> It's worthwhile enough that both major HDF5 bindings don't support
>> Unicode arrays, despite user requests for years. The sticking point seems
>> to be the difference between HDF5's view of a Unicode string array (defined
>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>> string array (because of UCS-4, defined by the number of
>> characters/codepoints/whatever). So there are HDF5 files out there that
>> none of our HDF5 bindings can read, and it is impossible to write certain
>> data efficiently.
>>
>>
>> I would really like to hear more from the authors of these libraries
>> about what exactly it is they feel they're missing. Is it that they want
>> numpy to enforce the length limit early, to catch errors when the array is
>> modified instead of when they go to write it to the file? Is it that they
>> really want an O(1) way to look at a array and know the maximum number of
>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>> is really annoying and files that need it are rare so they haven't had the
>> motivation to implement it? My impression is similar to Julian's: you
>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>> dozen lines of code, which is nothing compared to all the other hoops these
>> libraries are already jumping through, so if this is really the roadblock
>> then I must be missing something.
>>
>
> I actually agree with you. I think it's mostly a matter of convenience
> that h5py matched up HDF5 dtypes with numpy dtypes:
> fixed width ASCII -> np.string_/bytes
> variable length ASCII -> object arrays of np.string_/bytes
> variable length UTF-8 -> object arrays of unicode
>
> This was tenable in a Python 2 world, but on Python 3 it's broken and
> there's not an easy fix.
>
> We absolutely could fix h5py by mapping everything to object arrays of
> Python unicode strings, as has been discussed (
> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
> be a fine but non-ideal solution, since there is currently no fixed width
> UTF-8 support.
>
> For fixed width ASCII arrays, this would mean increased convenience for
> Python 3 users, at the price of decreased convenience for Python 2 users
> (arrays now contain boxed Python objects), unless we made the h5py behavior
> dependent on the version of Python. Hence, we're back here, waiting for
> better dtypes for encoded strings.
>
> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
> handling ASCII arrays as strings) and UTF-8 with length equal to the number
> of bytes.
>

Well, I'll say upfront that I have not read this discussion in the fully,
but apparently some opinions from developers of HDF5 Python packages would
be welcome here, so here I go :) ​

As a long-time developer of one of the Python HDF5 packages (PyTables), I
have always been of the opinion that plain ASCII (for byte strings) and
UCS-4 (for Unicode) encoding would be the appropriate dtypes​ for storing
large amounts of data, most specially for disk storage (but also using
compressed in-memory containers).  My rational is that, although UCS-4 may
require way too much space, compression would reduce that to basically the
space that is required by compressed UTF-8 (I won't go into detail, but
basically this is possible by using the shuffle filter).

I remember advocating for UCS-4 adoption in the HDF5 library many years ago
(2007?), but I had no success and UTF-8 was decided to be the best
candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
don't think there is a go back (not even adding UCS-4 support on it,
although I continue to think it would be a good idea).  So, I suppose that
if HDF5 is found to be an important format for NumPy users (and I think
this is the case), a solution for representing Unicode characters by using
UTF-8 in NumPy would be desirable (at the risk of making the implementation
more complex).

​Francesc
​

>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>


-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:

> It's worthwhile enough that both major HDF5 bindings don't support Unicode
> arrays, despite user requests for years. The sticking point seems to be the
> difference between HDF5's view of a Unicode string array (defined in size
> by the bytes of UTF-8 data) and numpy's current view of a Unicode string
> array (because of UCS-4, defined by the number of
> characters/codepoints/whatever). So there are HDF5 files out there that
> none of our HDF5 bindings can read, and it is impossible to write certain
> data efficiently.
>
>
> I would really like to hear more from the authors of these libraries about
> what exactly it is they feel they're missing. Is it that they want numpy to
> enforce the length limit early, to catch errors when the array is modified
> instead of when they go to write it to the file? Is it that they really
> want an O(1) way to look at a array and know the maximum number of bytes
> needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
> really annoying and files that need it are rare so they haven't had the
> motivation to implement it? My impression is similar to Julian's: you
> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
> dozen lines of code, which is nothing compared to all the other hoops these
> libraries are already jumping through, so if this is really the roadblock
> then I must be missing something.
>

I actually agree with you. I think it's mostly a matter of convenience that
h5py matched up HDF5 dtypes with numpy dtypes:
fixed width ASCII -> np.string_/bytes
variable length ASCII -> object arrays of np.string_/bytes
variable length UTF-8 -> object arrays of unicode

This was tenable in a Python 2 world, but on Python 3 it's broken and
there's not an easy fix.

We absolutely could fix h5py by mapping everything to object arrays of
Python unicode strings, as has been discussed (
https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
be a fine but non-ideal solution, since there is currently no fixed width
UTF-8 support.

For fixed width ASCII arrays, this would mean increased convenience for
Python 3 users, at the price of decreased convenience for Python 2 users
(arrays now contain boxed Python objects), unless we made the h5py behavior
dependent on the version of Python. Hence, we're back here, waiting for
better dtypes for encoded strings.

So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
handling ASCII arrays as strings) and UTF-8 with length equal to the number
of bytes.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern  wrote:

> The proposal is for only latin-1 and UTF-32 to be supported at first, and
> the eventual support of UTF-8 will be constrained by specification of the
> width in terms of characters rather than bytes, which conflicts with the
> use cases of UTF-8 that have been brought forth.
>
>   https://mail.python.org/pipermail/numpy-discussion/
> 2017-April/076668.html
>

thanks -- I had forgotten (clearly) it was that limited.

But my question now is -- if there is a encoding-parameterized string
dtype, then is it much more effort to have it support all the encodings in
the stdlib?

It seems that would solve everyone's issue.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker  wrote:

> But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:
>
> * Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.
>
> For THAT -- utf-8 is critical.
>
> But if I understand Julian's proposal -- he wants to create a
parameterized text dtype that you can set the encoding on, and then numpy
will use the encoding (and python's machinery) to encode / decode when
passing to/from python strings.
>
> It seems this would support all our desires:
>
> I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.
>
> Thomas would get latin-1 for binary interchange with mostly-ascii data
>
> The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)
>
> Even folks that had weird JAVA or Windows-generated UTF-16 data files
could do the binary interchange thing
>
> I'm now lost as to what the hang-up is.

The proposal is for only latin-1 and UTF-32 to be supported at first, and
the eventual support of UTF-8 will be constrained by specification of the
width in terms of characters rather than bytes, which conflicts with the
use cases of UTF-8 that have been brought forth.

  https://mail.python.org/pipermail/numpy-discussion/2017-April/076668.html

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern"  wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support
Unicode arrays, despite user requests for years. The sticking point seems
to be the difference between HDF5's view of a Unicode string array (defined
in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
string array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.
>
> I would really like to hear more from the authors of these libraries
about what exactly it is they feel they're missing. Is it that they want
numpy to enforce the length limit early, to catch errors when the array is
modified instead of when they go to write it to the file? Is it that they
really want an O(1) way to look at a array and know the maximum number of
bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
is really annoying and files that need it are rare so they haven't had the
motivation to implement it?

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/379

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer  wrote:

>
> Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
> myself have already given), but we seem to be talking past each other here.
>

yeah -- I think it's not clear what the use cases we are talking about are.


> I am still -1 on any new string encoding support unless that includes at
> least UTF-8, with length indicated by the number of bytes.
>

I've said multiple times that utf-8 support is key to any "exchange binary
data" use case (memory mapping?) -- so yes, absolutely.

I _think_ this may be some of the source for the confusion:

The name of this thread is: "proposal: smaller representation of string
arrays".

And I got the impression, maybe mistaken, that folks were suggesting that
internally encoding strings in numpy as "UTF-8, with length indicated by
the number of bytes." was THE solution to the

" the 'U' dtype takes up way too much memory, particularly  for
mostly-ascii data" problem.

I do not think it is a good solution to that problem.

I think a good solution to that problem is latin-1 encoding. (bear with me
here...)

But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:

* Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.

For THAT -- utf-8 is critical.

But if I understand Julian's proposal -- he wants to create a parameterized
text dtype that you can set the encoding on, and then numpy will use the
encoding (and python's machinery) to encode / decode when passing to/from
python strings.

It seems this would support all our desires:

I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.

Thomas would get latin-1 for binary interchange with mostly-ascii data

The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)

Even folks that had weird JAVA or Windows-generated UTF-16 data files could
do the binary interchange thing

I'm now lost as to what the hang-up is.

-CHB

PS: null padding is a pain, python strings seem to preserve the zeros, whic
is odd -- is thre a unicode code-point at x00?

But you can use it to strip properly with the unicode sandwich:

In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00'

In [64]: ut16.decode('utf-16')
Out[64]: 'some text\x00\x00\x00'

In [65]: ut16.decode('utf-16').strip('\x00')
Out[65]: 'some text'

In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16')
Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00'

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 12:09 PM, "Robert Kern"  wrote:

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
[...]
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:

  https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_
level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".


Since concrete examples are often helpful in focusing discussions, here's
some code for reading a lab-internal EEG file format:

https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py

See in particular _header_dtype with its embedded string fields, and the
code in _channel_names_from_header -- both of these really benefit from
having a quick and easy way to talk about fixed width strings of single
byte characters. (The history here of course is that the original tools for
reading/writing this format are written in C, and they just read in
sizeof(struct header) and cast to the header.)

_get_full_string in that file is also interesting: it's a nasty hack I
implemented because in some cases I actually needed *fixed width* strings,
not NUL padded ones, and didn't know a better way to do it. (Yes, there's
void, but I have no idea how those work. They're somehow related to buffer
objects, whatever those are?) In other cases though that file really does
want NUL padding.

Of course that file is python 2 and blissfully ignorant of unicode.
Thinking about what we'd want if porting to py3:

For the "pull out this fixed width chunk of the file" problem (what
_get_full_string does) then I definitely don't care about unicode; this
isn't text. np.void or an array of np.uint8 aren't actually too terrible I
suspect, but it'd be nice if there were a fixed-width dtype where indexing
gave back a native bytes or bytearray object, or something similar like
np.bytes_.

For the arrays of single-byte-encoded-NUL-padded text, then the fundamental
problem is just to convert between a chunk of bytes in that format and
something that numpy can handle. One way to do that would be with an dtype
that represented ascii-encoded-fixed-width-NUL-padded text, or any
ascii-compatible encoding. But honestly I'd be just as happy with
np.encode/np.decode ufuncs that converted between the existing S dtype and
any kind of text array; the existing U dtype would be fine given that.

The other thing that might be annoying in practice is that when writing
py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on
py3", but there's no dtype with similar behavior. Maybe there's no good
solution and this just needs a few version-dependent convenience functions
stuck in a private utility library, dunno.


> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.


I would really like to hear more from the authors of these libraries about
what exactly it is they feel they're missing. Is it that they want numpy to
enforce the length limit early, to catch errors when the array is modified
instead of when they go to write it to the file? Is it that they really
want an O(1) way to look at a array and know the maximum number of bytes
needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
really annoying and files that need it are rare so they haven't had the
motivation to implement it? My impression is similar to Julian's: you
*could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
dozen lines of code, which is nothing compared to all the other hoops these
libraries are already jumping through, so if this is really the roadblock
then I must be missing somet

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker  wrote:

> When a numpy user wants to put a string into a numpy array, they should
> know how long a string they can fit -- with "length" defined how python
> strings define it.
>

Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
myself have already given), but we seem to be talking past each other here.

I am still -1 on any new string encoding support unless that includes at
least UTF-8, with length indicated by the number of bytes.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern  wrote:

> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases.
>

isn't UTF-32 pretty compressible also? lots of zeros in there

here's an example with pure ascii  Lorem Ipsum text:

In [17]: len(text)
Out[17]: 446


In [18]: len(utf8)
Out[18]: 446

# the same -- it's pure ascii

In [20]: len(utf32)
Out[20]: 1788

# four times a big -- of course.

In [22]: len(bz2.compress(utf8))
Out[22]: 302

# so from 446 to 302, not that great -- probably it would be better for
longer text
# -- but are compressing whole arrays or individual strings?

In [23]: len(bz2.compress(utf32))
Out[23]: 319

# almost as good as the compressed utf-8

And I'm guessing it would be even closer with more non-ascii charactors.

OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii
charactors:

In [29]: len(text)
Out[29]: 672

In [30]: utf8 = text.encode("utf-8")

In [31]: len(utf8)
Out[31]: 1180

# not bad, really -- still smaller than utf-16 :-)

In [33]: len(bz2.compress(utf8))
Out[33]: 495

# pretty good then -- better than 50%

In [34]: utf32 = text.encode("utf-32")
In [35]: len(utf32)

Out[35]: 2692


In [36]: len(bz2.compress(utf32))
Out[36]: 515

# still not quite as good as utf-8, but close.

So: utf-8 compresses better than utf-32, but only by a little bit -- at
least with bz2.

But it is a lot smaller uncompressed.

>>> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
> >>
> >> It's not just HDF5. Counting bytes is the Right Way to measure the size
> of UTF-8 encoded text:
> >> http://utf8everywhere.org/#myths
>

It's really the only way with utf-8 -- which is why it is an impedance
mismatch with python strings.


>> I also firmly believe (though clearly this is not universally agreed
> upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications.
>

fortunately, we don't need to agree to that to agree that:


> So if we're adding any new string encodings, it needs to be one of them.
>

Yup -- the most important one to add -- I don't think it is "The Right Way"
for all applications -- but it "The Right Way" for text interchange.

And regardless of what any of us think -- it is widely used.

> (1) object arrays of strings. (We have these already; whether a
> strings-only specialization would permit useful things like string-oriented
> ufuncs is a question for someone who's willing to implement one.)
>

This is the right way to get variable length strings -- but I'm concerned
that it doesn't mesh well with numpy uses like npz files, raw dumping of
array data, etc. It should not be the only way to get proper Unicode
support, nor the default when you do:

array(["this", "that"])


> > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
> All python encodings should be permitted. An additional function to
> truncate encoded data without mangling the encoding would be handy.
>

I think necessary -- at least when you pass in a python string...


> I think it makes more sense for this to be NULL-padded than
> NULL-terminated but it may be necessary to support both; note that
> NULL-termination is complicated for encodings like UCS4.
>

is it if you know it's UCS4? or even know the size of the code-unit (I
think that's the term)


> This also includes the legacy UCS4 strings as a special case.
>

what's special about them? I think the only thing shold be that they are
the default.
>

> > (3) a dtype for fixed-length byte strings. This doesn't look very
> different from an array of dtype u8, but given we have the bytes type,
> accessing the data this way makes sense.
>
> The void dtype is already there for this general purpose and mostly works,
> with a few niggles.
>

I'd never noticed that! And if I had I never would have guessed I could use
it that way.


> If it worked more transparently and perhaps rigorously with `bytes`, then
> it would be quite suitable.
>

Then we should fix a bit of those things -- and call it soemthig like
"bytes", please.

-CHB

>
> --

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg  wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right.


I think it's really clear that you don't want to mess with the bytes in any
way without knowing the encoding -- for UTF-16, the code unit is two bytes,
so a "null" is two zero bytes in a row.

So generic "null padded" or "null terminated" is dangerous -- it would have
to be "Null-padded utf-8" or whatever.

  Though I

> think it might have been something like "make everything in
> hdf5/something similar work"


That would be nice :-), but I suspect HDF-5 is the same as everything else
-- there are files in the wild where someone jammed the wrong thing into a
text array 

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith  wrote:

> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague?
>

sorry -- that's what I get for trying to be concise...


> The "character-oriented Python text model" is just that str supports O(1)
> indexing of characters.
>

not really -- I think the performance characteristics are an implementation
detail (though it did influence the design, I'm sure)

I'm referring to the fact that a python string appears (to the user -- also
under the hood, but again, implementation detail)  to be a sequence of
characters, not a sequence of bytes, not a sequence of glyphs, or
graphemes, or anything else. Every Python string has a length, and that
length is the number of characters, and if you index you get a string of
length-1, and it has one character it it, and that character matches to a
code point of a single value.

Someone could implement a python string using utf-8 under the hood, and
none of that would change (and I think micropython may have done that...)

Sure, you might get two characters when you really expect a single
grapheme, but it's at least a consistent oddity. (well, not always, as some
graphemes can be represented by either a single code point or two combined
-- human language really sucks!)

The UTF-8 Manifesto (http://utf8everywhere.org/) makes the very good point
that a character-oriented interface is not the only one that makes sense,
and may not make sense at all. However:

1) Python has chosen that interface

2) It is a good interface (probably the best for computer use) if you need
to choose only one

utf8everywhere is mostly arguing for utf-8 over utf16 -- and secondarily
for utf-8 everywhere as the best option for working at the C level. That's
probably true.

(I also think the utf-8 fans are in a bit of a fantasy world -- this would
all be easier, yes, if one encoding was used for everything, all the time,
but other than that, utf-8 is not a Pancea -- we are still going to have
encoding headaches no matter how you slice it)

So where does numpy fit? well, it does operate at the C level, but people
work with it from python, so exposing the details of the encoding to the
user should be strictly opt-in.

When a numpy user wants to put a string into a numpy array, they should
know how long a string they can fit -- with "length" defined how python
strings define it.

Using utf-8 for the default string in numpy would be like using float16 for
default float--not a good idea!

I believe Julian said there would be no default -- you would need to
specify, but I think there does need to be one:

np.array(["a string", "another string"])

needs to do something.

if we make a parameterized dtype that accepts any encoding, then we could
do:

np.array(["a string", "another string"], dtype=no.stringtype["utf-8"])

If folks really want that.

I'm afraid that that would lead to errors -- cool,. utf-8 is just like
ascii, but with full Unicode support!

But... Numpy doesn't. If you want to access individual characters inside a
> string inside an array, you have to pull out the scalar first, at which
> point the data is copied and boxed into a Python object anyway, using
> whatever representation the interpreter prefers.
>


> So AFAICT​ it makes literally no difference to the user whether numpy's
> internal representation allows for fast character access.
>

agreed - unless someone wants to do a view that makes a N-D array for
strings look like a 1-D array of characters Which seems odd, but there
was recently a big debate on the netcdf CF conventions list about that very
issue...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg 
wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right. (I think she got around it, by weird
> trickery, inverting the endianess or so and thus putting the zero bytes
> first).
> Maybe will ask her if this discussion is interesting to her. Though I
> think it might have been something like "make everything in
> hdf5/something similar work" without any actual use case, I don't know.

I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
>
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > mailto:jtaylor.deb...@googlemail.com>>
> > wrote:
> >
> >> Indeed,
> >> Most of this discussion is irrelevant to numpy.
> >> Numpy only really deals with the in memory storage of strings. And in
> >> that it is limited to fixed length strings (in bytes/codepoints).
> >> How you get your messy strings into numpy arrays is not very relevant
to
> >> the discussion of a smaller representation of strings.
> >> You couldn't get messy strings into numpy without first sorting it out
> >> yourself before, you won't be able to afterwards.
> >> Numpy will offer a set of encodings, the user chooses which one is best
> >> for the use case and if the user screws it up, it is not numpy's
problem.
> >>
> >> You currently only have a few ways to even construct string arrays:
> >> - array construction and loops
> >> - genfromtxt (which is again just a loop)
> >> - memory mapping which I seriously doubt anyone actually does for the S
> >> and U dtype
> >
> > I fear that you decided that the discussion was irrelevant and thus did
> > not read it rather than reading it to decide that it was not relevant.
> > Because several of us have showed that, yes indeed, we do memory-map
> > string arrays.
> >
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> >
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you more
> > details if you need them. We all feel the pain of the rushed, inadequate
> > implementation of the U dtype. But each of our pains is a little bit
> > different; you obviously aren't experiencing the same pains that I am.
>
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:


https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".

> In any case it is still irrelevant. My proposal only _adds_ additional
> cases that can be mmapped. It does not prevent you from doing what you
> have been doing before.

You are the one who keeps worrying about the additional complexity, both in
code and mental capacity of our users, of adding new overlapping dtypes and
solutions, and you're not wrong about that. I think it behooves us to
consider if there are solutions that solve multiple related problems at
once instead of adding new dtypes piecemeal to solve individual problems.

> >> Having a new dtype changes nothing here. You still need to create numpy
> >> arrays from python strings which are well defined and clean.
> >> If you put something in that doesn't encode you get an encoding error.
> >> No oddities like surrogate escapes are needed, numpy arrays are not
> >> interfaces to operating systems nor does numpy need to _add_ support
for
> >> historical oddities beyond what it already has.
> >> If you want to represent bytes exactly as they came in don't use a text
> >> dtype (which includes the S dtype, use i1).
> >
> > Thomas Aldcroft has demonstrated the problem with this approach. numpy
> > arrays are often interfaces to files that have tons of historical
oddities.
>
> This does not matter for numpy, the text dtype is well defined as bytes
> with a specific encoding and null padding.

You cannot dismiss something as "not mattering for *numpy*" just because
your new, *proposed* text dtype doesn't support it.

You seem to have fixed on a course of action and are defining everyone
else's use cases as out-of-scope because your course of action doesn't
support them. That's backwards. Define the use cases first, determine the
requirements, then build a solution that meets those requirements. We
skipped those steps before, and that's why we're all feeling the pain.

> If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure byte
> array instead.

That's his status quo, and he finds it unworkable. Now, I have proposed a
way out of that by supporting ASCII-surrogateescape as a specific encoding.
It's not an ISO standard encoding, b

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Sebastian Berg
On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote:
> On 26.04.2017 19:08, Robert Kern wrote:
> > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> > mailto:jtaylor.deb...@googlemail.co
> > m>>
> > wrote:
> > 
> > > Indeed,
> > > Most of this discussion is irrelevant to numpy.
> > > Numpy only really deals with the in memory storage of strings.
> > > And in
> > > that it is limited to fixed length strings (in bytes/codepoints).
> > > How you get your messy strings into numpy arrays is not very
> > > relevant to
> > > the discussion of a smaller representation of strings.
> > > You couldn't get messy strings into numpy without first sorting
> > > it out
> > > yourself before, you won't be able to afterwards.
> > > Numpy will offer a set of encodings, the user chooses which one
> > > is best
> > > for the use case and if the user screws it up, it is not numpy's
> > > problem.
> > > 
> > > You currently only have a few ways to even construct string
> > > arrays:
> > > - array construction and loops
> > > - genfromtxt (which is again just a loop)
> > > - memory mapping which I seriously doubt anyone actually does for
> > > the S
> > > and U dtype
> > 
> > I fear that you decided that the discussion was irrelevant and thus
> > did
> > not read it rather than reading it to decide that it was not
> > relevant.
> > Because several of us have showed that, yes indeed, we do memory-
> > map
> > string arrays.
> > 
> > You can add to this list C APIs, like that of libhdf5, that need to
> > communicate (Unicode) string arrays.
> > 
> > Look, I know I can be tedious, but *please* go back and read this
> > discussion. We have concrete use cases outlined. We can give you
> > more
> > details if you need them. We all feel the pain of the rushed,
> > inadequate
> > implementation of the U dtype. But each of our pains is a little
> > bit
> > different; you obviously aren't experiencing the same pains that I
> > am.
> 
> I have read every mail and it has been a large waste of time,
> Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.
> In any case it is still irrelevant. My proposal only _adds_
> additional
> cases that can be mmapped. It does not prevent you from doing what
> you
> have been doing before.
> 
> > 
> > > Having a new dtype changes nothing here. You still need to create
> > > numpy
> > > arrays from python strings which are well defined and clean.
> > > If you put something in that doesn't encode you get an encoding
> > > error.
> > > No oddities like surrogate escapes are needed, numpy arrays are
> > > not
> > > interfaces to operating systems nor does numpy need to _add_
> > > support for
> > > historical oddities beyond what it already has.
> > > If you want to represent bytes exactly as they came in don't use
> > > a text
> > > dtype (which includes the S dtype, use i1).
> > 
> > Thomas Aldcroft has demonstrated the problem with this approach.
> > numpy
> > arrays are often interfaces to files that have tons of historical
> > oddities.
> 
> This does not matter for numpy, the text dtype is well defined as
> bytes
> with a specific encoding and null padding. If you have an historical
> oddity that does not fit, do not use the text dtype but use a pure
> byte
> array instead.
> 
> > 
> > > Concerning variable sized strings, this is simply not going to
> > > happen.
> > > Nobody is going to rewrite numpy to support it, especially not
> > > just for
> > > something as unimportant as strings.
> > > Best you are going to get (or better already have) is object
> > > arrays. It
> > > makes no sense to discuss it unless someone comes up with an
> > > actual
> > > proposal and the willingness to code it.
> > 
> > No one has suggested such a thing. At most, we've talked about
> > specializing object arrays.
> > 
> > > What is a relevant discussion is whether we really need a more
> > > compact
> > > but limited representation of text than 4-byte utf32 at all.
> > > Its usecase is for the most part just for python3 porting and
> > > saving
> > > some memory in some ascii heavy cases, e.g. astronomy.
> > > It is not that significant anymore as porting to python3 has
> > > mostly
> > > already happened via the ugly byte workaround and memory saving
> > > is
> > > probably not as significant in the context of numpy which is
> > > already
> > > heavy on memory usage.
> > > 
> > > My initial approach was to not add a new dtype but to make
> > > unicode
> > > parametrizable which would have meant almost no cluttering of
> > > numpys
> > > internals and keeping the api more or less consistent which would
> > > make
> > > this a relatively simple addition of minor functionality for
> > > people that
> > > want it.
> > > But adding a completely new partially redundant dtype for this
> > > usecase
> > > may be a too 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread josef . pktd
On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith  wrote:
> On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal"
>  wrote:
>
>
> UTF-8 does not match the character-oriented Python text model. Plenty
> of people argue that that isn't the "correct" model for Unicode text
> -- maybe so, but it is the model python 3 has chosen. I wrote a much
> longer rant about that earlier.
>
> So I think the easy to access, and particularly defaults, numpy string
> dtypes should match it.
>
>
> This seems a little vague? The "character-oriented Python text model" is
> just that str supports O(1) indexing of characters. But... Numpy doesn't. If
> you want to access individual characters inside a string inside an array,
> you have to pull out the scalar first, at which point the data is copied and
> boxed into a Python object anyway, using whatever representation the
> interpreter prefers. So AFAICT it makes literally no difference to the user
> whether numpy's internal representation allows for fast character access.

you can create a view on individual characters or bytes, AFAICS

>>> t = np.array(['abcdefg']*10)
>>> t2 = t.view([('s%d' % i, '>> t2['s5']
array(['f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f'],
  dtype='>> t.view('
> -n
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" <
chris.bar...@noaa.gov> wrote:


UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.


This seems a little vague? The "character-oriented Python text model" is
just that str supports O(1) indexing of characters. But... Numpy doesn't.
If you want to access individual characters inside a string inside an
array, you have to pull out the scalar first, at which point the data is
copied and boxed into a Python object anyway, using whatever representation
the interpreter prefers. So AFAICT​ it makes literally no difference to the
user whether numpy's internal representation allows for fast character
access.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald 
wrote:
>
> On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer  wrote:
>>
>> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern 
wrote:
>>>
>>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>>>
>>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
that to size arrays by character length. The advantage over UTF-32 is that
it is easily compressible, probably by a factor of 4 in many cases. That
doesn't solve the in memory problem, but does have some advantages on disk
as well as making for easy display. We could compress it ourselves after
encoding by truncation.
>>>
>>> The major use case that we have for a UTF-8 array is HDF5, and it
specifies the width in bytes, not Unicode characters.
>>
>> It's not just HDF5. Counting bytes is the Right Way to measure the size
of UTF-8 encoded text:
>> http://utf8everywhere.org/#myths
>>
>> I also firmly believe (though clearly this is not universally agreed
upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
>
> It seems to me that most of the requirements people have expressed in
this thread would be satisfied by:
>
> (1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)
>
> (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
All python encodings should be permitted. An additional function to
truncate encoded data without mangling the encoding would be handy. I think
it makes more sense for this to be NULL-padded than NULL-terminated but it
may be necessary to support both; note that NULL-termination is complicated
for encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.
>
> (3) a dtype for fixed-length byte strings. This doesn't look very
different from an array of dtype u8, but given we have the bytes type,
accessing the data this way makes sense.

The void dtype is already there for this general purpose and mostly works,
with a few niggles. On Python 3, it uses 'int8' ndarrays underneath the
scalars (fortunately, they do not appear to be mutable views). It also
accepts `bytes` strings that are too short (pads with NULs) and too long
(truncates). If it worked more transparently and perhaps rigorously with
`bytes`, then it would be quite suitable.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 19:08, Robert Kern wrote:
> On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor
> mailto:jtaylor.deb...@googlemail.com>>
> wrote:
> 
>> Indeed,
>> Most of this discussion is irrelevant to numpy.
>> Numpy only really deals with the in memory storage of strings. And in
>> that it is limited to fixed length strings (in bytes/codepoints).
>> How you get your messy strings into numpy arrays is not very relevant to
>> the discussion of a smaller representation of strings.
>> You couldn't get messy strings into numpy without first sorting it out
>> yourself before, you won't be able to afterwards.
>> Numpy will offer a set of encodings, the user chooses which one is best
>> for the use case and if the user screws it up, it is not numpy's problem.
>>
>> You currently only have a few ways to even construct string arrays:
>> - array construction and loops
>> - genfromtxt (which is again just a loop)
>> - memory mapping which I seriously doubt anyone actually does for the S
>> and U dtype
> 
> I fear that you decided that the discussion was irrelevant and thus did
> not read it rather than reading it to decide that it was not relevant.
> Because several of us have showed that, yes indeed, we do memory-map
> string arrays.
> 
> You can add to this list C APIs, like that of libhdf5, that need to
> communicate (Unicode) string arrays.
> 
> Look, I know I can be tedious, but *please* go back and read this
> discussion. We have concrete use cases outlined. We can give you more
> details if you need them. We all feel the pain of the rushed, inadequate
> implementation of the U dtype. But each of our pains is a little bit
> different; you obviously aren't experiencing the same pains that I am.

I have read every mail and it has been a large waste of time, Everything
has been said already many times in the last few years.
Even if you memory map string arrays, of which I have not seen a
concrete use case in the mails beyond "would be nice to have" without
any backing in actual code, but I may have missed it.
In any case it is still irrelevant. My proposal only _adds_ additional
cases that can be mmapped. It does not prevent you from doing what you
have been doing before.

> 
>> Having a new dtype changes nothing here. You still need to create numpy
>> arrays from python strings which are well defined and clean.
>> If you put something in that doesn't encode you get an encoding error.
>> No oddities like surrogate escapes are needed, numpy arrays are not
>> interfaces to operating systems nor does numpy need to _add_ support for
>> historical oddities beyond what it already has.
>> If you want to represent bytes exactly as they came in don't use a text
>> dtype (which includes the S dtype, use i1).
> 
> Thomas Aldcroft has demonstrated the problem with this approach. numpy
> arrays are often interfaces to files that have tons of historical oddities.

This does not matter for numpy, the text dtype is well defined as bytes
with a specific encoding and null padding. If you have an historical
oddity that does not fit, do not use the text dtype but use a pure byte
array instead.

> 
>> Concerning variable sized strings, this is simply not going to happen.
>> Nobody is going to rewrite numpy to support it, especially not just for
>> something as unimportant as strings.
>> Best you are going to get (or better already have) is object arrays. It
>> makes no sense to discuss it unless someone comes up with an actual
>> proposal and the willingness to code it.
> 
> No one has suggested such a thing. At most, we've talked about
> specializing object arrays.
> 
>> What is a relevant discussion is whether we really need a more compact
>> but limited representation of text than 4-byte utf32 at all.
>> Its usecase is for the most part just for python3 porting and saving
>> some memory in some ascii heavy cases, e.g. astronomy.
>> It is not that significant anymore as porting to python3 has mostly
>> already happened via the ugly byte workaround and memory saving is
>> probably not as significant in the context of numpy which is already
>> heavy on memory usage.
>>
>> My initial approach was to not add a new dtype but to make unicode
>> parametrizable which would have meant almost no cluttering of numpys
>> internals and keeping the api more or less consistent which would make
>> this a relatively simple addition of minor functionality for people that
>> want it.
>> But adding a completely new partially redundant dtype for this usecase
>> may be a too large change to the api. Having two partially redundant
>> string types may confuse users more than our current status quo of our
>> single string type (U).
>>
>> Discussing whether we want to support truncated utf8 has some merit as
>> it is a decision whether to give the users an even larger gun to shot
>> themselves in the foot with.
>> But I'd like to focus first on the 1 byte type to add a symmetric API
>> for python2 and python3.
>> utf8 can always be added latter should we deem it

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype

I fear that you decided that the discussion was irrelevant and thus did not
read it rather than reading it to decide that it was not relevant. Because
several of us have showed that, yes indeed, we do memory-map string arrays.

You can add to this list C APIs, like that of libhdf5, that need to
communicate (Unicode) string arrays.

Look, I know I can be tedious, but *please* go back and read this
discussion. We have concrete use cases outlined. We can give you more
details if you need them. We all feel the pain of the rushed, inadequate
implementation of the U dtype. But each of our pains is a little bit
different; you obviously aren't experiencing the same pains that I am.

> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).

Thomas Aldcroft has demonstrated the problem with this approach. numpy
arrays are often interfaces to files that have tons of historical oddities.

> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.

No one has suggested such a thing. At most, we've talked about specializing
object arrays.

> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.

What is your current proposal? A string dtype parameterized with the
encoding (initially supporting the latin-1 that you desire and maybe adding
utf-8 later)? Or a latin-1-specific dtype such that we will have to add a
second utf-8 dtype at a later date?

If you're not going to support arbitrary encodings right off the bat, I'd
actually suggest implementing UTF-8 and ASCII-surrogateescape first as they
seem to knock off more use cases straight away.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker - NOAA Federal
> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with 
> > a few extra characters" data. With all the sloppiness over the years, there 
> > are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding" problem.

Exactly -- but from a practicality beats purity perspective, there is
a difference between "I have no idea whatsoever" and "I know it is
mostly ascii, and European, but there are some extra characters in
there"

Latin-1 had proven very useful for that case.

I suppose in most cases ascii with errors='replace' would be a good
choice, but I'd still rather not throw out potentially useful
information.

> Your previous advocacy has also touched on using latin-1 to decode existing 
> files with unknown encodings as well. If you want to advocate for using 
> latin-1 only for the creation of new data, maybe stop talking about existing 
> files? :-)

Yeah, I've been very unfocused in this discussion -- sorry about that.

> > Note: the primary use-case I have in mind is working with ascii text in 
> > numpy arrays efficiently-- folks have called for that. All I'm saying is 
> > use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which 
> buys you a whole bunch of useful extra characters. ;-)

UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.

It's become clear in this discussion that there is s strong desire to
support a numpy dtype that stores text in particular binary formats
(I.e. Encodings). Rather than choose one or two, we might as well
support all encodings supported by python.

In that case, we'll have utf-8 for those that know they want that, and
we'll have latin-1 for those that incorrectly think they want that :-)

So what remains is to decide is implementation, syntax, and defaults.

Let's keep in mind that most of us on this list, and in this
discussion, are the folks that write interface code and the like. But
most numpy users are not as tuned in to the internals. So defaults
should be set to best support the more "naive" user.

> . If all we do is add a latin-1 dtype for people to use to create new 
> in-memory data, then someone is going to use it to read existing data in 
> unknown or ambiguous encodings.

If we add every encoding known to man someone is going to use Latin-1
to read unknown encodings. Indeed, as we've all pointed out, there is
no correct encoding with which to read unknown encodings.

Frankly, if we have UTF-8 under the hood, I think people are even MORE
likely to use it inappropriately-- it's quite scary how many people
think UTF-8 == Unicode, and think all you need to do is "use utf-8",
and you don't need to change any of the rest of your code. Oh, and
once you've done that, you can use your existing ASCII-only tests and
think you have a working application :-)

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Eric Wieser
> I think we can implement viewers for strings as ndarray subclasses. Then one
> could
> do `my_string_array.view(latin_1)`, and so on.  Essentially that just
> changes the default
> encoding of the 'S' array. That could also work for uint8 arrays if needed.
>
> Chuck

To handle structured data-types containing encoded strings, we'd also
need to subclass `np.void`.

Things would get messy when a structured dtype contains two strings in
different encodings (or more likely, one bytestring and one
textstring) - we'd need some way to specify which fields are in which
encoding, and using subclasses means that this can't be contained
within the dtype information.

So I think there's a strong argument for solving this with`dtype`s
rather than subclasses. This really doesn't seem hard though.
Something like (C-but-as-python):

def ENCSTRING_getitem(ptr, arr):  # The PyArrFuncs slot
encoded = STRING_getitem(ptr, arr)
return encoded.decode(arr.dtype.encoding)

def ENCSTRING_setitem(val, ptr, arr):  # The PyArrFuncs slot
val = val.encode(arr.dtype.encoding)
# todo: handle "safe" truncation, where safe might mean keep
codepoints, keep graphemes, or never allow
STRING_setitem(val, ptr, arr))

We'd probably need to be careful to do a decode/encode dance when
copying from one encoding to another, but we [already have
bugs](https://github.com/numpy/numpy/issues/3258) in those cases
anyway.

Is it reasonable that the user of such an array would want to work
with plain `builtin.unicode` objects, rather than some special numpy
scalar type?

Eric
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Charles R Harris
On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> On 26.04.2017 03:55, josef.p...@gmail.com wrote:
> > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
> >  wrote:
> >>
> >>
> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern 
> wrote:
> >>>
> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
> >>>  wrote:
> >>>
> > Presumably you're getting byte strings (with  unknown encoding.
> 
>  No -- thus is for creating and using mostly ascii string data with
>  python and numpy.
> 
>  Unknown encoding bytes belong in byte arrays -- they are not text.
> >>>
> >>> You are welcome to try to convince Thomas of that. That is the status
> quo
> >>> for him, but he is finding that difficult to work with.
> >>>
>  I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
>  with a few extra characters" data. With all the sloppiness over the
> years,
>  there are way to many files like that.
> >>>
> >>> That sloppiness that you mention is precisely the "unknown encoding"
> >>> problem. Your previous advocacy has also touched on using latin-1 to
> decode
> >>> existing files with unknown encodings as well. If you want to advocate
> for
> >>> using latin-1 only for the creation of new data, maybe stop talking
> about
> >>> existing files? :-)
> >>>
>  Note: the primary use-case I have in mind is working with ascii text
> in
>  numpy arrays efficiently-- folks have called for that. All I'm saying
> is use
>  Latin-1 instead of ascii -- that buys you some useful extra
> characters.
> >>>
> >>> For that use case, the alternative in play isn't ASCII, it's UTF-8,
> which
> >>> buys you a whole bunch of useful extra characters. ;-)
> >>>
> >>> There are several use cases being brought forth here. Some involve file
> >>> reading, some involve file writing, and some involve in-memory
> manipulation.
> >>> Whatever change we make is going to impinge somehow on all of the use
> cases.
> >>> If all we do is add a latin-1 dtype for people to use to create new
> >>> in-memory data, then someone is going to use it to read existing data
> in
> >>> unknown or ambiguous encodings.
> >>
> >>
> >>
> >> The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to
> >> size arrays by character length. The advantage over UTF-32 is that it is
> >> easily compressible, probably by a factor of 4 in many cases. That
> doesn't
> >> solve the in memory problem, but does have some advantages on disk as
> well
> >> as making for easy display. We could compress it ourselves after
> encoding by
> >> truncation.
> >>
> >> Note that for terminal display we will want something supported by the
> >> system, which is another problem altogether. Let me break the problem
> down
> >> into four categories
> >>
> >> Storage -- hdf5, .npy, fits, etc.
> >> Display -- ?
> >> Modification -- editing
> >> Parsing -- fits, etc.
> >>
> >> There is probably no one solution that is optimal for all of those.
> >>
> >> Chuck
> >>
> >>
> >>
> >> ___
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@python.org
> >> https://mail.python.org/mailman/listinfo/numpy-discussion
> >>
> >
> >
> > quoting Julian
> >
> > '''
> > I probably have formulated my goal with the proposal a bit better, I am
> > not very interested in a repetition of which encoding to use debate.
> > In the end what will be done allows any encoding via a dtype with
> > metadata like datetime.
> > This allows any codec (including truncated utf8) to be added easily (if
> > python supports it) and allows sidestepping the debate.
> >
> > My main concern is whether it should be a new dtype or modifying the
> > unicode dtype. Though the backward compatibility argument is strongly in
> > favour of adding a new dtype that makes the np.unicode type redundant.
> > '''
> >
> > I don't quite understand why this discussion goes in a direction of an
> > either one XOR the other dtype.
> >
> > I thought the parameterized 1-byte encoding that Julian mentioned
> > initially sounds useful to me.
> >
> > (I'm not sure I will use it much, but I also don't use float16 )
> >
> > Josef
>
> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Anne Archibald
On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer  wrote:

> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern  wrote:
>
>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>>
>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
>> that to size arrays by character length. The advantage over UTF-32 is that
>> it is easily compressible, probably by a factor of 4 in many cases. That
>> doesn't solve the in memory problem, but does have some advantages on disk
>> as well as making for easy display. We could compress it ourselves after
>> encoding by truncation.
>>
>> The major use case that we have for a UTF-8 array is HDF5, and it
>> specifies the width in bytes, not Unicode characters.
>>
>
> It's not just HDF5. Counting bytes is the Right Way to measure the size of
> UTF-8 encoded text:
> http://utf8everywhere.org/#myths
>
> I also firmly believe (though clearly this is not universally agreed upon)
> that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications. So if we're adding any new string encodings, it needs to be
> one of them.
>

It seems to me that most of the requirements people have expressed in this
thread would be satisfied by:

(1) object arrays of strings. (We have these already; whether a
strings-only specialization would permit useful things like string-oriented
ufuncs is a question for someone who's willing to implement one.)

(2) a dtype for fixed byte-size, specified-encoding, NULL-padded data. All
python encodings should be permitted. An additional function to truncate
encoded data without mangling the encoding would be handy. I think it makes
more sense for this to be NULL-padded than NULL-terminated but it may be
necessary to support both; note that NULL-termination is complicated for
encodings like UCS4. This also includes the legacy UCS4 strings as a
special case.

(3) a dtype for fixed-length byte strings. This doesn't look very different
from an array of dtype u8, but given we have the bytes type, accessing the
data this way makes sense.

There seems to be considerable debate about what the "default" string type
should be,  but since users must specify a length anyway, might as well
force them to specify an encoding and thus dodge the debate about the right
default.

The other question - which I realize is how the thread started - is what to
do about backward compatibility. I'm not writing the code, so my opinion
doesn't matter much, but I think we're stuck maintaining what we have now -
ASCII and UCS4 strings - for a while yet. But it can be deprecated, or they
can be simply reimplemented as shorthand names for ASCII- or UCS4-encoded
strings in the bytes-with-encoding dtype.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 03:55, josef.p...@gmail.com wrote:
> On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
>  wrote:
>>
>>
>> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern  wrote:
>>>
>>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
>>>  wrote:
>>>
> Presumably you're getting byte strings (with  unknown encoding.

 No -- thus is for creating and using mostly ascii string data with
 python and numpy.

 Unknown encoding bytes belong in byte arrays -- they are not text.
>>>
>>> You are welcome to try to convince Thomas of that. That is the status quo
>>> for him, but he is finding that difficult to work with.
>>>
 I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
 with a few extra characters" data. With all the sloppiness over the years,
 there are way to many files like that.
>>>
>>> That sloppiness that you mention is precisely the "unknown encoding"
>>> problem. Your previous advocacy has also touched on using latin-1 to decode
>>> existing files with unknown encodings as well. If you want to advocate for
>>> using latin-1 only for the creation of new data, maybe stop talking about
>>> existing files? :-)
>>>
 Note: the primary use-case I have in mind is working with ascii text in
 numpy arrays efficiently-- folks have called for that. All I'm saying is 
 use
 Latin-1 instead of ascii -- that buys you some useful extra characters.
>>>
>>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
>>> buys you a whole bunch of useful extra characters. ;-)
>>>
>>> There are several use cases being brought forth here. Some involve file
>>> reading, some involve file writing, and some involve in-memory manipulation.
>>> Whatever change we make is going to impinge somehow on all of the use cases.
>>> If all we do is add a latin-1 dtype for people to use to create new
>>> in-memory data, then someone is going to use it to read existing data in
>>> unknown or ambiguous encodings.
>>
>>
>>
>> The maximum length of an UTF-8 character is 4 bytes, so we could use that to
>> size arrays by character length. The advantage over UTF-32 is that it is
>> easily compressible, probably by a factor of 4 in many cases. That doesn't
>> solve the in memory problem, but does have some advantages on disk as well
>> as making for easy display. We could compress it ourselves after encoding by
>> truncation.
>>
>> Note that for terminal display we will want something supported by the
>> system, which is another problem altogether. Let me break the problem down
>> into four categories
>>
>> Storage -- hdf5, .npy, fits, etc.
>> Display -- ?
>> Modification -- editing
>> Parsing -- fits, etc.
>>
>> There is probably no one solution that is optimal for all of those.
>>
>> Chuck
>>
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
> 
> 
> quoting Julian
> 
> '''
> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
> 
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
> '''
> 
> I don't quite understand why this discussion goes in a direction of an
> either one XOR the other dtype.
> 
> I thought the parameterized 1-byte encoding that Julian mentioned
> initially sounds useful to me.
> 
> (I'm not sure I will use it much, but I also don't use float16 )
> 
> Josef

Indeed,
Most of this discussion is irrelevant to numpy.
Numpy only really deals with the in memory storage of strings. And in
that it is limited to fixed length strings (in bytes/codepoints).
How you get your messy strings into numpy arrays is not very relevant to
the discussion of a smaller representation of strings.
You couldn't get messy strings into numpy without first sorting it out
yourself before, you won't be able to afterwards.
Numpy will offer a set of encodings, the user chooses which one is best
for the use case and if the user screws it up, it is not numpy's problem.

You currently only have a few ways to even construct string arrays:
- array construction and loops
- genfromtxt (which is again just a loop)
- memory mapping which I seriously doubt anyone actually does for the S
and U dtype

Having a new dtype changes nothing here. You still need to create numpy
arrays from python strings which are well defined and clean.
If you put something in that doesn't encode you get an encoding error.
No oddities like surrogate escapes are needed, numpy arrays are not
interfa

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Stephan Hoyer
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases. That
> doesn't solve the in memory problem, but does have some advantages on disk
> as well as making for easy display. We could compress it ourselves after
> encoding by truncation.
>
> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
>

It's not just HDF5. Counting bytes is the Right Way to measure the size of
UTF-8 encoded text:
http://utf8everywhere.org/#myths

I also firmly believe (though clearly this is not universally agreed upon)
that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris 
wrote:

> The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies
the width in bytes, not Unicode characters.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Aldcroft, Thomas
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:

> > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:
>
> > Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
> > s. And then from in Python, if you want to actually work with those
> filenames you need to either have a bytestring type or else a Unicode type
> that uses surrogateescape to represent the non-ascii characters.
>
>
> > IMO if you have filenames that are arbitrary bytestrings and you need to
> represent this properly, you should just use bytestrings -- really, they're
> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!
>
> And no, bytestrings are not perfectly friendly in py3.
>
> This got really complicated and sidetracked, but All I'm suggesting is
> that if we have a 1byte per char string type, with a fixed encoding,
> that that encoding be Latin-1, rather than ASCII.
>
> That's it, really.
>

Fully agreed.


>
> Having a settable encoding would work fine, too.
>

Yup.

At a simple level, I just want the things that currently work just fine in
Py2 to start working in Py3. That includes being able to read / manipulate
/ compute and write back to legacy binary FITS and HDF5 files that include
ASCII-ish text data (not strictly ASCII).  Memory mapping such files should
be supportable.  Swapping type from bytes to a 1-byte char str should be
possible without altering data in memory.

BTW, I am saying "I want", but this functionality would definitely be
welcome in astropy.  I wrote a unicode sandwich workaround for the astropy
Table class (https://github.com/astropy/astropy/pull/5700) which should be
in the next release.  It would be way better to have this at a level lower
in numpy.

- Tom


>
> -CHB
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread josef . pktd
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris
 wrote:
>
>
> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern  wrote:
>>
>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal
>>  wrote:
>>
>> >> Presumably you're getting byte strings (with  unknown encoding.
>> >
>> > No -- thus is for creating and using mostly ascii string data with
>> > python and numpy.
>> >
>> > Unknown encoding bytes belong in byte arrays -- they are not text.
>>
>> You are welcome to try to convince Thomas of that. That is the status quo
>> for him, but he is finding that difficult to work with.
>>
>> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
>> > with a few extra characters" data. With all the sloppiness over the years,
>> > there are way to many files like that.
>>
>> That sloppiness that you mention is precisely the "unknown encoding"
>> problem. Your previous advocacy has also touched on using latin-1 to decode
>> existing files with unknown encodings as well. If you want to advocate for
>> using latin-1 only for the creation of new data, maybe stop talking about
>> existing files? :-)
>>
>> > Note: the primary use-case I have in mind is working with ascii text in
>> > numpy arrays efficiently-- folks have called for that. All I'm saying is 
>> > use
>> > Latin-1 instead of ascii -- that buys you some useful extra characters.
>>
>> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
>> buys you a whole bunch of useful extra characters. ;-)
>>
>> There are several use cases being brought forth here. Some involve file
>> reading, some involve file writing, and some involve in-memory manipulation.
>> Whatever change we make is going to impinge somehow on all of the use cases.
>> If all we do is add a latin-1 dtype for people to use to create new
>> in-memory data, then someone is going to use it to read existing data in
>> unknown or ambiguous encodings.
>
>
>
> The maximum length of an UTF-8 character is 4 bytes, so we could use that to
> size arrays by character length. The advantage over UTF-32 is that it is
> easily compressible, probably by a factor of 4 in many cases. That doesn't
> solve the in memory problem, but does have some advantages on disk as well
> as making for easy display. We could compress it ourselves after encoding by
> truncation.
>
> Note that for terminal display we will want something supported by the
> system, which is another problem altogether. Let me break the problem down
> into four categories
>
> Storage -- hdf5, .npy, fits, etc.
> Display -- ?
> Modification -- editing
> Parsing -- fits, etc.
>
> There is probably no one solution that is optimal for all of those.
>
> Chuck
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>


quoting Julian

'''
I probably have formulated my goal with the proposal a bit better, I am
not very interested in a repetition of which encoding to use debate.
In the end what will be done allows any encoding via a dtype with
metadata like datetime.
This allows any codec (including truncated utf8) to be added easily (if
python supports it) and allows sidestepping the debate.

My main concern is whether it should be a new dtype or modifying the
unicode dtype. Though the backward compatibility argument is strongly in
favour of adding a new dtype that makes the np.unicode type redundant.
'''

I don't quite understand why this discussion goes in a direction of an
either one XOR the other dtype.

I thought the parameterized 1-byte encoding that Julian mentioned
initially sounds useful to me.

(I'm not sure I will use it much, but I also don't use float16 )

Josef
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
> chris.bar...@noaa.gov> wrote:
>
> >> Presumably you're getting byte strings (with  unknown encoding.
> >
> > No -- thus is for creating and using mostly ascii string data with
> python and numpy.
> >
> > Unknown encoding bytes belong in byte arrays -- they are not text.
>
> You are welcome to try to convince Thomas of that. That is the status quo
> for him, but he is finding that difficult to work with.
>
> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
> with a few extra characters" data. With all the sloppiness over the years,
> there are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding"
> problem. Your previous advocacy has also touched on using latin-1 to decode
> existing files with unknown encodings as well. If you want to advocate for
> using latin-1 only for the creation of new data, maybe stop talking about
> existing files? :-)
>
> > Note: the primary use-case I have in mind is working with ascii text in
> numpy arrays efficiently-- folks have called for that. All I'm saying is
> use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
> buys you a whole bunch of useful extra characters. ;-)
>
> There are several use cases being brought forth here. Some involve file
> reading, some involve file writing, and some involve in-memory
> manipulation. Whatever change we make is going to impinge somehow on all of
> the use cases. If all we do is add a latin-1 dtype for people to use to
> create new in-memory data, then someone is going to use it to read existing
> data in unknown or ambiguous encodings.
>


The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

Note that for terminal display we will want something supported by the
system, which is another problem altogether. Let me break the problem down
into four categories


   1. Storage -- hdf5, .npy, fits, etc.
   2. Display -- ?
   3. Modification -- editing
   4. Parsing -- fits, etc.

There is probably no one solution that is optimal for all of those.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal
 wrote:
>> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:
>
>> Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
>> s. And then from in Python, if you want to actually work with those 
>> filenames you need to either have a bytestring type or else a Unicode type 
>> that uses surrogateescape to represent the non-ascii characters.
>
>
>> IMO if you have filenames that are arbitrary bytestrings and you need to 
>> represent this properly, you should just use bytestrings -- really, they're 
>> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!

No, the path APIs all accept bytestrings (and ones that return
pathnames like listdir return bytestrings if given bytestrings). Or at
least they're supposed to.

The really urgent need for surrogateescape was things like sys.argv
and os.environ where arbitrary bytes might come in (on some systems)
but the API is restricted to strs.

> And no, bytestrings are not perfectly friendly in py3.

I'm not saying you should use them everywhere or that they remove the
need for an ergonomic text dtype, but when you actually want to work
with bytes they're pretty good (esp. in modern py3).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:

>> Presumably you're getting byte strings (with  unknown encoding.
>
> No -- thus is for creating and using mostly ascii string data with python
and numpy.
>
> Unknown encoding bytes belong in byte arrays -- they are not text.

You are welcome to try to convince Thomas of that. That is the status quo
for him, but he is finding that difficult to work with.

> I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
with a few extra characters" data. With all the sloppiness over the years,
there are way to many files like that.

That sloppiness that you mention is precisely the "unknown encoding"
problem. Your previous advocacy has also touched on using latin-1 to decode
existing files with unknown encodings as well. If you want to advocate for
using latin-1 only for the creation of new data, maybe stop talking about
existing files? :-)

> Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.

For that use case, the alternative in play isn't ASCII, it's UTF-8, which
buys you a whole bunch of useful extra characters. ;-)

There are several use cases being brought forth here. Some involve file
reading, some involve file writing, and some involve in-memory
manipulation. Whatever change we make is going to impinge somehow on all of
the use cases. If all we do is add a latin-1 dtype for people to use to
create new in-memory data, then someone is going to use it to read existing
data in unknown or ambiguous encodings.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:

> Eh... First, on Windows and MacOS, filenames are natively Unicode.

Yeah, though once they are stored I. A text file -- who the heck
knows? That may be simply unsolvable.
> s. And then from in Python, if you want to actually work with those filenames 
> you need to either have a bytestring type or else a Unicode type that uses 
> surrogateescape to represent the non-ascii characters.


> IMO if you have filenames that are arbitrary bytestrings and you need to 
> represent this properly, you should just use bytestrings -- really, they're 
> perfectly friendly :-).

I thought the Python file (and Path) APIs all required (Unicode)
strings? That was the whole complaint!

And no, bytestrings are not perfectly friendly in py3.

This got really complicated and sidetracked, but All I'm suggesting is
that if we have a 1byte per char string type, with a fixed encoding,
that that encoding be Latin-1, rather than ASCII.

That's it, really.

Having a settable encoding would work fine, too.

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
Actually, for what it's worth, the FITS spec says that in such values
trailing spaces are not significant, see page 7:
https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf
But they're not really relevant to numpy's situation, because as here you
need to do elaborate de-quoting before they can go into a data structure.
What I was wondering was whether people have data lying around with
fixed-width fields where the strings are space-padded, so that numpy needs
to support that.


I would say whether to strip space-padded strings should be the reader's
problem, not numpy's

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do.


Actually, *I* am not trying to do anything here -- I'm the one that said
computers are so big and fast now that we shouldn't whine about 4 bytes for
a characterbut this whole conversation started with that request...and
I have sympathy .. no one likes to waste memory. After all, numpy support
small numeric dtypes, too.

Do you need your in-memory format for this data to be compatible with
anything in particular?


Not for this requirement -- binary interchange is another requirement.

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?


That's the whole point, yes. Object arrays would be a good solution to the
full Unicode problem, not the "why am I wasting so much space when all my
data are ascii ?

Presumably you're getting byte strings (with  unknown encoding.


No -- thus is for creating and using mostly ascii string data with python
and numpy.

Unknown encoding bytes belong in byte arrays -- they are not text.

I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with
a few extra characters" data. With all the sloppiness over the years, there
are way to many files like that.

Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.


If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays;


Or UCS-4.

I think object arrays would be more problematic for npz storage, and raw
"tostring" dumping. (And pickle?) not sure how important that is.

And it would be good to have something that plays well with recarrays

anyone who just has a bunch of python strings to store is unlikely to be
surprised by this. Someone with more specific needs will choose a more
specific - that is, not default - string data type.


Exactly.

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 10:13 AM, "Anne Archibald" 
wrote:


On Tue, Apr 25, 2017 at 6:05 PM Chris Barker  wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.


It's possible to do much better than this when defining a specialized
variable-width string dtype. E.g. make the itemsize 8 bytes (like an object
array, assuming a 64 bit system), but then for strings that can be encoded
in 7 bytes or less of utf8 store them directly in the array; else store a
pointer to a raw utf8 string on the heap. (Possibly with a reference count
- there are some interesting tradeoffs there. I suspect 1-byte reference
counts might be the way to go; if a logical copy would make it overflow
then make an actual copy instead.) Anything involving the heap is going to
have some overhead, but we don't need full fledged Python objects and once
we give up mmap compatibility then there's a lot of room to tune.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris  wrote:

>
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern 
> wrote:
>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
>> peridot.face...@gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>> >
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
>> towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
>> me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
>> N non-NULL bytes. Any extra space left over is padded with NULLs, but no
>> space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be up
>> to N-1 non-NULL bytes. There must always be space reserved for the
>> terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
>> embedded NULLs. It's certainly possible to deal with them: just strip
>> trailing NULLs and leave any embedded ones alone. But I'm also sure that
>> there are some implementations somewhere that interpret the requirement as
>> "stop at the first NULL or the end of the fixed width, whichever comes
>> first", effectively being NULL-terminated just not requiring the reserved
>> space.
>>
>
> Thanks for the clarification. NULL-padded is what I meant.
>
> I'm wondering how much of the desired functionality we could get by simply
> subclassing ndarray in python. I think we mostly want to be able to view
> byte strings and convert to unicode if needed.
>
>
And I think the really tricky part is sorting and rich comparison.
Unfortunately, the comparison function is currently located in the c
structure. I suppose we could define a c wrapper function to go in the slot.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker"  wrote:


 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.


Eh... First, on Windows and MacOS, filenames are natively Unicode. So you
don't care about preserving the bytes, only the characters. It's only Linux
and the other traditional unixes where filenames are natively bytestrings.
And then from in Python, if you want to actually work with those filenames
you need to either have a bytestring type or else a Unicode type that uses
surrogateescape to represent the non-ascii characters. I'm not seeing how
latin1 really helps anything here -- best case you still have to do
something like the wsgi "encoding dance" before you could use the
filenames. IMO if you have filenames that are arbitrary bytestrings and you
need to represent this properly, you should just use bytestrings -- really,
they're perfectly friendly :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern 
wrote:
>>
>> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>> >
>> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:
>>
>> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>> >
>> > Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.
>>
>> Just to clarify some terminology (because it wasn't originally clear to
me until I looked it up in reference to HDF5):
>>
>> * "NULL-padded" implies that, for a fixed width of N, there can be up to
N non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.
>>
>> * "NULL-terminated" implies that, for a fixed width of N, there can be
up to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.
>>
>> I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.
>
> Thanks for the clarification. NULL-padded is what I meant.

Okay, however, the biggest use-case we have for UTF-8 arrays (HDF5) is
NULL-terminated.

> I'm wondering how much of the desired functionality we could get by
simply subclassing ndarray in python. I think we mostly want to be able to
view byte strings and convert to unicode if needed.

I'm not sure. Some of these fixed-width string arrays are embedded inside
structured arrays with other dtypes.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
> >
> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
> peridot.face...@gmail.com> wrote:
>
> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
> >
> >
> > Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> Just to clarify some terminology (because it wasn't originally clear to me
> until I looked it up in reference to HDF5):
>
> * "NULL-padded" implies that, for a fixed width of N, there can be up to N
> non-NULL bytes. Any extra space left over is padded with NULLs, but no
> space needs to be reserved for NULLs.
>
> * "NULL-terminated" implies that, for a fixed width of N, there can be up
> to N-1 non-NULL bytes. There must always be space reserved for the
> terminating NULL.
>
> I'm not really sure if "NULL-padded" also specifies the behavior for
> embedded NULLs. It's certainly possible to deal with them: just strip
> trailing NULLs and leave any embedded ones alone. But I'm also sure that
> there are some implementations somewhere that interpret the requirement as
> "stop at the first NULL or the end of the fixed width, whichever comes
> first", effectively being NULL-terminated just not requiring the reserved
> space.
>

Thanks for the clarification. NULL-padded is what I meant.

I'm wondering how much of the desired functionality we could get by simply
subclassing ndarray in python. I think we mostly want to be able to view
byte strings and convert to unicode if needed.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 11:53 AM, "Robert Kern"  wrote:

On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.


And to save anyone else having to check, numpy's current NUL-padded dtypes
only strip trailing NULs, so they can round-trip strings that contain NULs,
just not strings where NUL is the last character.

So the set of strings representable by str/bytes is a strict superset of
the set of strings representable by numpy U/S dtypes, which in turn is a
strict superset of the set of strings representable by a hypothetical
NUL-terminated dtype.

(Of course this doesn't matter for most practical purposes, because people
rarely make strings with embedded NULs.)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:
>
> On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
peridot.face...@gmail.com> wrote:

>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
other packages are waiting specifically for it. But specifying this
requires two pieces of information: What is the encoding? and How is the
length specified? I know they're not numpy-compatible, but FITS header
values are space-padded; does that occur elsewhere? Are there other ways
existing data specifies string length within a fixed-size field? There are
some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
PKCS7, etc. - but they are probably too specialized to need? We should make
sure we can support all the ways that actually occur.
>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

Just to clarify some terminology (because it wasn't originally clear to me
until I looked it up in reference to HDF5):

* "NULL-padded" implies that, for a fixed width of N, there can be up to N
non-NULL bytes. Any extra space left over is padded with NULLs, but no
space needs to be reserved for NULLs.

* "NULL-terminated" implies that, for a fixed width of N, there can be up
to N-1 non-NULL bytes. There must always be space reserved for the
terminating NULL.

I'm not really sure if "NULL-padded" also specifies the behavior for
embedded NULLs. It's certainly possible to deal with them: just strip
trailing NULLs and leave any embedded ones alone. But I'm also sure that
there are some implementations somewhere that interpret the requirement as
"stop at the first NULL or the end of the fixed width, whichever comes
first", effectively being NULL-terminated just not requiring the reserved
space.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Eric Wieser
Chuck: That sounds like something we want to deprecate, for the same reason
that python3 no longer allows str(b'123') to do the right thing.

Specifically, it seems like astype should always be forbidden to go between
unicode and byte arrays - so that would need to be written as:

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
  dtype='|S1')

In [3]: a.view('U[ascii]')
Out[3]:
array([u'1', u'2', u'3'],
  dtype='http://mailto:charlesr.har...@gmail.com> wrote:

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald 
> wrote:
>
>>
>> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern 
>> wrote:
>>
>>> * HDF5 supports fixed-length and variable-length string arrays encoded
>>> in ASCII and UTF-8. In all cases, these strings are NULL-terminated
>>> (despite the documentation claiming that there are more options). In
>>> practice, the ASCII strings permit high-bit characters, but the encoding is
>>> unspecified. Memory-mapping is rare (but apparently possible). The two
>>> major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to
>>> support that HDF5 option. Compression is supported for fixed-length string
>>> arrays but not variable-length string arrays.
>>>
>>> * FITS supports fixed-length string arrays that are NULL-padded. The
>>> strings do not have a formal encoding, but in practice, they are typically
>>> mostly ASCII characters with the occasional high-bit character from an
>>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>>> be quite large even if each scalar is reasonably small.
>>>
>>> * pandas uses object arrays for flexible in-memory handling of string
>>> columns. Lengths are not fixed, and None is used as a marker for missing
>>> data. String columns must be written to and read from a variety of formats,
>>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>>> with `unicode/str` objects instead of `bytes`.
>>>
>>> * There are a number of sometimes-poorly-documented,
>>> often-poorly-adhered-to, aging file format "standards" that include string
>>> arrays but do not specify encodings, or such specification is ignored in
>>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>>> difficult to perform.
>>>
>>> * In Python 3 environments, `unicode/str` objects are rather more
>>> common, and simple operations like equality comparisons no longer work
>>> between `bytes` and `unicode/str`, making it difficult to work with numpy
>>> string arrays that yield `bytes` scalars.
>>>
>>
>> It seems the greatest challenge is interacting with binary data from
>> other programs and libraries. If we were living entirely in our own data
>> world, Unicode strings in object arrays would generally be pretty
>> satisfactory. So let's try to get what is needed to read and write other
>> people's formats.
>>
>> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
>> map directly to numpy arrays; we can store it however we want, as
>> conversion is necessary anyway.
>>
>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> For  byte strings, it looks like we need a parameterized type. This is for
> two uses, display and conversion to (Python) unicode. One could handle the
> display and conversion using view and astype methods. For instance, we
> already have
>
> In [1]: a = array([1,2,3], uint8) + 0x30
>
> In [2]: a.view('S1')
> Out[2]:
> array(['1', '2', '3'],
>   dtype='|S1')
>
> In [3]: a.view('S1').astype('U')
> Out[3]:
> array([u'1', u'2', u'3'],
>   dtype='
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
​
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald 
wrote:

>
> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern  wrote:
>
>> * HDF5 supports fixed-length and variable-length string arrays encoded in
>> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
>> the documentation claiming that there are more options). In practice, the
>> ASCII strings permit high-bit characters, but the encoding is unspecified.
>> Memory-mapping is rare (but apparently possible). The two major HDF5
>> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
>> HDF5 option. Compression is supported for fixed-length string arrays but
>> not variable-length string arrays.
>>
>> * FITS supports fixed-length string arrays that are NULL-padded. The
>> strings do not have a formal encoding, but in practice, they are typically
>> mostly ASCII characters with the occasional high-bit character from an
>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>> be quite large even if each scalar is reasonably small.
>>
>> * pandas uses object arrays for flexible in-memory handling of string
>> columns. Lengths are not fixed, and None is used as a marker for missing
>> data. String columns must be written to and read from a variety of formats,
>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>> with `unicode/str` objects instead of `bytes`.
>>
>> * There are a number of sometimes-poorly-documented,
>> often-poorly-adhered-to, aging file format "standards" that include string
>> arrays but do not specify encodings, or such specification is ignored in
>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>> difficult to perform.
>>
>> * In Python 3 environments, `unicode/str` objects are rather more common,
>> and simple operations like equality comparisons no longer work between
>> `bytes` and `unicode/str`, making it difficult to work with numpy string
>> arrays that yield `bytes` scalars.
>>
>
> It seems the greatest challenge is interacting with binary data from other
> programs and libraries. If we were living entirely in our own data world,
> Unicode strings in object arrays would generally be pretty satisfactory. So
> let's try to get what is needed to read and write other people's formats.
>
> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
> map directly to numpy arrays; we can store it however we want, as
> conversion is necessary anyway.
>
> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
>

Agree with the UTF-8 fixed byte length strings, although I would tend
towards null terminated.

For  byte strings, it looks like we need a parameterized type. This is for
two uses, display and conversion to (Python) unicode. One could handle the
display and conversion using view and astype methods. For instance, we
already have

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
  dtype='|S1')

In [3]: a.view('S1').astype('U')
Out[3]:
array([u'1', u'2', u'3'],
  dtype='___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge  wrote:

> On 04/25/2017 01:34 PM, Anne Archibald wrote:
> > I know they're not numpy-compatible, but FITS header values are
> > space-padded; does that occur elsewhere?
>
> Strings in FITS headers are delimited by single quotes.  Some keywords
> (only a handful) are required to have values that are blank-padded (in
> the FITS file) if the value is less than eight characters.  Whether you
> get trailing blanks when you read the header depends on the FITS
> reader.  I use astropy.io.fits to read/write FITS files, and that
> interface strips trailing blanks from character strings:
>
> TARGPROP= 'UNKNOWN '   / Proposer's name for the target
>
>  >>> fd = fits.open("test.fits")
>  >>> s = fd[0].header['targprop']
>  >>> len(s)
> 7
>

Actually, for what it's worth, the FITS spec says that in such values
trailing spaces are not significant, see page 7:
https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf
But they're not really relevant to numpy's situation, because as here you
need to do elaborate de-quoting before they can go into a data structure.
What I was wondering was whether people have data lying around with
fixed-width fields where the strings are space-padded, so that numpy needs
to support that.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker  wrote:

>
> This is essentially my rant about use-case (2):
>
> A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do. Do you need
your in-memory format for this data to be compatible with anything in
particular?

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?

Presumably you're getting byte strings (with no NULLs) from somewhere and
need to store them in this memory structure in a way that makes them as
usable as possible in spite of their unknown encoding. Presumably the thing
to do is just copy them in there as-is and then use .astype to arrange for
python to decode them when accessed. So this is precisely the problem of
"how should I decode random byte strings?" that python has been struggling
with. My impression is that the community has established that there's no
one solution that makes everyone happy, but that most people can cope with
some combination of picking a one-byte encoding,
ascii-with-surrogateescapes, zapping bogus characters, and giving wrong
results. But I think that all the standard python alternatives are needed,
in general, and in terms of interpreting numpy arrays full of bytes.
Clearly your preferred solution is .astype("string[latin-9]"), but just as
clearly that's not going to work for everyone.

If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays; anyone who just has a bunch of python
strings to store is unlikely to be surprised by this. Someone with more
specific needs will choose a more specific - that is, not default - string
data type.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Phil Hodge

On 04/25/2017 01:34 PM, Anne Archibald wrote:
I know they're not numpy-compatible, but FITS header values are 
space-padded; does that occur elsewhere?


Strings in FITS headers are delimited by single quotes.  Some keywords 
(only a handful) are required to have values that are blank-padded (in 
the FITS file) if the value is less than eight characters.  Whether you 
get trailing blanks when you read the header depends on the FITS 
reader.  I use astropy.io.fits to read/write FITS files, and that 
interface strips trailing blanks from character strings:


TARGPROP= 'UNKNOWN '   / Proposer's name for the target

>>> fd = fits.open("test.fits")
>>> s = fd[0].header['targprop']
>>> len(s)
7

Phil

___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern  wrote:

> * HDF5 supports fixed-length and variable-length string arrays encoded in
> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
> the documentation claiming that there are more options). In practice, the
> ASCII strings permit high-bit characters, but the encoding is unspecified.
> Memory-mapping is rare (but apparently possible). The two major HDF5
> bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
> HDF5 option. Compression is supported for fixed-length string arrays but
> not variable-length string arrays.
>
> * FITS supports fixed-length string arrays that are NULL-padded. The
> strings do not have a formal encoding, but in practice, they are typically
> mostly ASCII characters with the occasional high-bit character from an
> unspecific encoding. Memory-mapping is a common practice. These arrays can
> be quite large even if each scalar is reasonably small.
>
> * pandas uses object arrays for flexible in-memory handling of string
> columns. Lengths are not fixed, and None is used as a marker for missing
> data. String columns must be written to and read from a variety of formats,
> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
> with `unicode/str` objects instead of `bytes`.
>
> * There are a number of sometimes-poorly-documented,
> often-poorly-adhered-to, aging file format "standards" that include string
> arrays but do not specify encodings, or such specification is ignored in
> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
> difficult to perform.
>
> * In Python 3 environments, `unicode/str` objects are rather more common,
> and simple operations like equality comparisons no longer work between
> `bytes` and `unicode/str`, making it difficult to work with numpy string
> arrays that yield `bytes` scalars.
>

It seems the greatest challenge is interacting with binary data from other
programs and libraries. If we were living entirely in our own data world,
Unicode strings in object arrays would generally be pretty satisfactory. So
let's try to get what is needed to read and write other people's formats.

I'll note that this is numpy, so variable-width fields (e.g. CSV) don't map
directly to numpy arrays; we can store it however we want, as conversion is
necessary anyway.

Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other
packages are waiting specifically for it. But specifying this requires two
pieces of information: What is the encoding? and How is the length
specified? I know they're not numpy-compatible, but FITS header values are
space-padded; does that occur elsewhere? Are there other ways existing data
specifies string length within a fixed-size field? There are some
cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7,
etc. - but they are probably too specialized to need? We should make sure
we can support all the ways that actually occur.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker 
wrote:
>
> On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI  wrote:
>>
>> 2017-04-25 12:34 GMT-04:00 Chris Barker :
>> > I am totally euro-centric,
>
>> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
>> half-width characters and nothing else). :-)
>
> bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)

I've run into Windows-1251 in files (seismic and well log data from Russian
wells). Treating them as latin-1 does not make for a happy time. Both
encodings also technically derive from ASCII in the lower half, but most of
the actual language is written with the high-bit characters.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:05 PM Chris Barker  wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.

2) User has to deal with fixed-width binary data from an external
program/library and wants to see it as python strings. This may be
systematically encoded in a known encoding (e.g. HDF5's
fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS'
zero-padded ASCII) or
ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating
FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length
may be signaled by null termination, null padding, or space padding.
-> Solvable with a fixed-storage-size encoded-string dtype, as long as it
has a parameter for how length is signaled. Python tricks for dealing with
wrong or unknown encodings can make bogus data manageable.

3) User has to deal with fixed-width binary data from an external
program/library that really is binary bytes.
-> Solvable with a dtype that returns fixed-length byte strings.

4) User has a stupendous number (billions) of short strings which are
mostly but not entirely ASCII and wants to manipulate them as strings.
-> Not sure how to solve this. Maybe an object array with byte strings for
storage and encoding information in the dtype, allowing transparent
decoding? Or a fixed-storage-size array with a one-byte encoding that can
cope with all the characters the user will ever want to use?

5) User has a bunch of mystery-encoding strings(?) and wants to store them
in a numpy array.
-> If they're python strings already, no further harm is done by treating
this as case 1 when in python-land. If they need to be in fixed-width
fields for communication with an external program or library, this puts us
in case 2, unknown encoding variety; user will have to pick an encoding
that the external program is likely to be able to cope with; this may be
the one that originated the mystery strings in the first place.

6) User has python strings and wants to store them in non-object numpy
arrays for some reason but doesn't care about the actual memory layout.
-> Solvable with the current setup; fixed-width UCS-4 fields, padded with
Unicode NULL. Happily, this comes for free from arbitrary-encoding
fixed-storage-size dtypes, though a friendlier interface might be nice.
Also allows people to use UCS-2 or ASCII if they know their strings fit.

7) User has data in one binary format and it needs to go into another, with
perhaps casual inspection while in python-land. Such data is mostly ASCII
but might contain mystery characters; presenting gobbledygook to the user
is okay as long as the characters are output intact.
-> Reading and writing as a fixed-width one-byte encoding, preferably one
resembling the one the data is actually in, should work here. UTF-8 is
likely to mangle the data; ASCII-with-surrogateescape might do okay. The
key thing here is that both input and output files will have their own ways
of specifying string length and their own storage specifiers; user must
know these, and someone has to know and specify what to do with strings
that don't fit. Simple truncation will mangle UTF-8 if it is not known to
be UTF-8, but there's maybe not much that can be done about that.

I guess my point is that a use case should specify:
* Where does the data come from (i.e. in what format)?
* Are there memory constraints in the storage format?
* How should access look to the user? In particular, what should misencoded
data look like?
* Where does the data need to go?

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
Now my proposal for the other use cases:

2) There be some way to store mostly ascii-compatible strings in a single
> byte-per-character array -- so not to be wasting space for "typical
> european-language-oriented data". Note: this should ALSO be compatible with
> Python's character-oriented string model. i.e. a Python String with length
> N will fit into a dtype of size N.
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding String  would
> raise an EncodingError.
>
> This is also a use-case primarily for "casual" users -- but ones concerned
> with the size of the data storage and know that are using european text.
>

more detail elsewhere -- but either ascii with surrageescape or latin-1
always are good options here. I prefer latin-1 (I really see no downside),
but others disagree...

But then we get to:


> 3) dtypes that support storage in particular encodings:
>

We need utf-8. We may need others. We may need a 1-byte per char compact
encoding that isn't close enough to ascii or latin-1 to be useful (say,
shift-jis), And I don't think we are going to come to a consensus on what
"single" encoding to use for 1-byte-per-char.

So really -- going back to Julian's earlier proposal:

dytpe with an encoding specified
"size" in bytes

once defined, numpy would encode/decode to/from python strings "correctly"

we might need "null-terminated utf-8" as a special case.

That would support all the other use cases.

Even the one-byte per char encoding. I"d like to see a clean alias to a
latin-1 encoding, but not a big deal.

That leaves a couple decisions:

 - error out or truncate if the passed-in string is too long?

 - error out or suragateescape if there are invalid bytes in the data?

 - error out or something else if there are characters that can't be
encoded in the specified encoding.

And we still need a proper bytes type:

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
> three -- settable from a bytes or bytearray object (or other memoryview?),
> and returns a bytes object.
>
> You could use astype() to convert between bytes and a specified encoding
> with no change in binary representation. This could be used to store any
> binary data, including encoded text or anything else. this should map
> directly to the Python bytes model -- thus NOT null-terminted.
>
> This is a little different than 'S' behaviour on py3 -- it appears that
> with 'S', a if ALL the trailing bytes are null, then it is truncated, but
> if there is a null byte in the middle, then it is preserved. I suspect that
> this is a legacy from Py2's use of "strings" as both text and binary data.
> But in py3, a "bytes" type should be about bytes, and not text, and thus
> null-values bytes are simply another value a byte can hold.
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker  wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.
>
> So I'll try again -- use-case only! we'll keep the possible solutions
separate.
>
> Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

... etc.

These aren't use cases but rather requirements. I'm looking for something
rather more concrete than that.

* HDF5 supports fixed-length and variable-length string arrays encoded in
ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite
the documentation claiming that there are more options). In practice, the
ASCII strings permit high-bit characters, but the encoding is unspecified.
Memory-mapping is rare (but apparently possible). The two major HDF5
bindings are waiting for a fixed-length UTF-8 numpy dtype to support that
HDF5 option. Compression is supported for fixed-length string arrays but
not variable-length string arrays.

* FITS supports fixed-length string arrays that are NULL-padded. The
strings do not have a formal encoding, but in practice, they are typically
mostly ASCII characters with the occasional high-bit character from an
unspecific encoding. Memory-mapping is a common practice. These arrays can
be quite large even if each scalar is reasonably small.

* pandas uses object arrays for flexible in-memory handling of string
columns. Lengths are not fixed, and None is used as a marker for missing
data. String columns must be written to and read from a variety of formats,
including CSV, Excel, and HDF5, some of which are Unicode-aware and work
with `unicode/str` objects instead of `bytes`.

* There are a number of sometimes-poorly-documented,
often-poorly-adhered-to, aging file format "standards" that include string
arrays but do not specify encodings, or such specification is ignored in
practice. This can make the usual "Unicode sandwich" at the I/O boundaries
difficult to perform.

* In Python 3 environments, `unicode/str` objects are rather more common,
and simple operations like equality comparisons no longer work between
`bytes` and `unicode/str`, making it difficult to work with numpy string
arrays that yield `bytes` scalars.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI  wrote:

> 2017-04-25 12:34 GMT-04:00 Chris Barker :
> > I am totally euro-centric,
>


> But Shift-JIS is not one-byte; it's two-byte (unless you allow only
> half-width characters and nothing else). :-)


bad example then -- are their other non-euro-centric one byte per char
encodings worth worrying about? I have no clue :-)


> This I don't understand. As far as I can tell non-Western-European
> filenames are not unusual. If filenames are a reason, even if you're
> euro-centric (think Eastern Europe, say) I don't see how latin1 is a
> good choice.
>

right -- this is the age of Unicode -- Unicode is the correct choice.

But many of us have data in old files that are not proper Unicode -- and
that includes filenames.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Ambrose LI
2017-04-25 12:34 GMT-04:00 Chris Barker :
> I am totally euro-centric, but as I understand it, that is the whole point
> of the desire for a compact one-byte-per character encoding. If there is a
> strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
> should support that. But this all started with "mostly ascii". My take on
> that is:

But Shift-JIS is not one-byte; it's two-byte (unless you allow only
half-width characters and nothing else). :-) In fact legacy CJK
encodings are all nominally two-byte (so that the width of a
character's internal representation matches that of its visual
representation).

>  - filenames
>
> File names are one of the key reasons folks struggled with the python3 data
> model (particularly on *nix) and why 'surrogateescape' was added. It's
> pretty common to store filenames in with our data, and thus in numpy arrays
> -- we need to preserve them exactly and display them mostly right. Again,
> euro-centric, but if you are euro-centric, then latin-1 is a good choice for
> this.

This I don't understand. As far as I can tell non-Western-European
filenames are not unusual. If filenames are a reason, even if you're
euro-centric (think Eastern Europe, say) I don't see how latin1 is a
good choice.

Lurker here, and I haven't touched numpy in ages. So I might be
blurting out nonsense.

-- 
Ambrose Li // http://o.gniw.ca / http://gniw.ca
If you saw this on CE-L: You do not need my permission to quote
me, only proper attribution. Always cite your sources, even if
you have to anonymize and/or cite it as "personal communication".
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
OK -- onto proposals:

1) The default behaviour for numpy arrays of strings is compatible with
> Python3's string model: i.e. fully unicode supporting, and with a character
> oriented interface. i.e. if you do::
>
>   arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
> characters.
>
> and arr[1] will return a native Python3 string object.
>
> This is the use-case for "casual" numpy users -- not the folks writing
> H5py and the like, or the ones writing Cython bindings to C++ libs.
>

I see two options here:

a) The current 'U' dtype -- fully meets the specs, and is already there.

b) Having a pointer-to-a-python string dtype:

-I take it that's what Pandas does and people seem happy.

-That would get us variable length strings, and potentially other nifty
string-processing.

   - It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?

   - how would it work with pickle and numpy zip storage?

Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.

Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.

That's it for use-case (1)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern  wrote:

> > My question: What are those non-ASCII characters? How often are they
> truly latin-1/9 vs. some other text encoding vs. non-string binary data?
>
> I don't know that we can reasonably make that accounting relevant. Number
> of such characters per byte of text? Number of files with such characters
> out of all existing files?
>

I have a lot of mostly english -- usually not latin-1, but usually mostly
latin-1. -- the non-ascii characters are a handful of accented characters
(usually from spanish, some french), then a few "scientific" characters:
the degree symbol, the "micro" symbol.

I suspect that this is not an unusual pattern for mostly-english scientific
text.

if it's non-string binary data, I know it -- and I'd use a bytes type.

I have two options -- try to detect the encoding properly or use
_something_ and fix it up later. latin-1 is a great choice for the later
option -- most of the text displays fine, and the wrong stuff is untouched,
so I can figure it out.

What I can say with assurance is that every time I have decided, as a
> developer, to write code that just hardcodes latin-1 for such cases, I have
> regretted it. While it's just personal anecdote, I think it's at least
> measuring the right thing. :-)
>

I've had the opposite experience -- so that's two anecdotes :-)

If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but
not really much worse then any other option other than properly decoding
it. IN a way, using latin-1 is like the old py2 string -- it can be used as
text, even if it has arbitrary non-text garbage in it...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer  wrote:

> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
> wrote:
>
>> On the other hand, if this is the use-case, perhaps we really want an
>>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>>> signaled more explicitly. I would suggest that "text[unknown]" should
>>> support operations like a string if it can be decoded as ASCII, and
>>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>>> bytes.
>>>
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
>> really is ascii, then it's perfect. If it really is latin-*, then you get
>> some extra useful stuff, and if it's corrupted somehow, you still get the
>> ascii text correct, and the rest won't  barf and can be passed on through.
>>
>
> I am totally in agreement with Thomas that "We are living in a messy
> world right now with messy legacy datasets that have character type data
> that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they truly
> latin-1/9 vs. some other text encoding vs. non-string binary data?
>

I am totally euro-centric, but as I understand it, that is the whole point
of the desire for a compact one-byte-per character encoding. If there is a
strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
should support that. But this all started with "mostly ascii". My take on
that is:

We don't want to use pure-ASCII -- that is the hell that python2's default
encoding approach led to -- it is MUCH better to pass garbage through than
crash out with an EncodingError -- data are messy, and people are really
bad at writing comprehensive tests.

So we need something that handles ASCII properly, and can pass trhough
arbitrary bytes as well without crashing. Options are:

* ASCII With errors='ignore' or 'replace'

I think that is a very bad idea -- it is tossing away information that
_may_ have some use eslewhere::

  s = arr[i]
  arr[i] = s

should put the same bytes back into the array.

* ASCII with errors='surrogateescape'

This would preserve bytes and not crash out, so meets the key criteria.


* latin-1

This would do the exactly correct thing for ASCII, preserve the bytes, and
not crash out. But it would also allow additional symbols useful to
european languages and scientific computing. Seems like a win-win to me.

As for my use-cases:

 - Messy data:

I have had a lot of data sets with european text in them, mostly ASCII and
an occasional non ASCII accented character or symbol -- most of these come
from legacy systems, and have an ugly arbitrary combination of MacRoman,
Win-something-or-other, and who knows what -- i.e. mojibake, though at
least mostly ascii.

The only way to deal with it "properly" is to examine each string and try
to figure out which encoding it is in, hope at least a single string is in
one encoding, and then decode/encode it properly. So numpy should support
that -- which would be handled by a 'bytes' type, just like in Python
itself.

But sometimes that isn't practical, and still doesn't work 100% -- in which
case, we can go with latin-1, and there will be some weird, incorrect
characters in there, and that is OK -- we fix them later when QA/QC or
users notice it -- really just like a typo.

But stripping the non-ascii characters out would be a worse solution. As
would "replace", as sometimes it IS the correct symbol! (european encodings
aren't totally incompatible...). And surrogateescape is worse, too -- any
"weird" character is the same to my users, and at least sometimes it will
be the right character -- however surrogateescape gets printed, it will
never look right. (and can it even be handles by a non-python system?)

 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.

Granted, I should probably simply use a proper unicode type for filenames
anyway, but sometimes the data comes in already encoded as latin-something.

In the end I still see no downside to latin-1 over ascii-only -- only an
upside.

I don't think that silently (mis)interpreting non-ASCII characters as
> latin-1/9 is a good idea, which is why I think it would be a mistake to use
> 'latin-1' for text data with unknown encoding.
>

if it's totally unknown, then yes -- but for totally uknown, bytes is the
only reasonable option -- then run chardet or something over it.

but "some latin encoding" -- latin-1 is a good choice.

I could get behind a data type that compares equal to strings for ASCII
> only an

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern  wrote:

> Chris, you've mashed all of my emails together, some of them are in reply
> to you, some in reply to others. Unfortunately, this dropped a lot of the
> context from each of them, and appears to be creating some
> misunderstandings about what each person is advocating.
>

Sorry about that -- I was trying to keep an already really long thread from
getting eve3n longer

And I'm not sure it matters who's doing the advocating, but rather *what*
is being advocated -- I hope I didn't screw that up too badly.

Anyway, I think I made the mistake of mingling possible solutions in with
the use-cases, so I'm not sure if there is any consensus on the use cases
-- which I think we really do need to nail down first -- as Robert has made
clear.

So I'll try again -- use-case only! we'll keep the possible solutions
separate.

Do we need to write up a NEP for this? it seems we are going a bit in
circles, and we really do want to capture the final decision process.

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do::

  arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less
characters.

and arr[1] will return a native Python3 string object.

This is the use-case for "casual" numpy users -- not the folks writing H5py
and the like, or the ones writing Cython bindings to C++ libs.


2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not to be wasting space for "typical
european-language-oriented data". Note: this should ALSO be compatible with
Python's character-oriented string model. i.e. a Python String with length
N will fit into a dtype of size N.

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding String  would raise
an EncodingError.

This is also a use-case primarily for "casual" users -- but ones concerned
with the size of the data storage and know that are using european text.

3) dtypes that support storage in particular encodings:

   Python strings would be encoded appropriately when put into the array. A
Python string would be returned when indexing.

   a) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange   with other systems (netcdf, HDF,
others???) at the binary level.

   b) There be a dtype that could store data in any encoding supported by
Python -- to facilitate bytes-level interchange with other systems. If we
need more than utf-8, then we might as well have the full set.

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object (or other memoryview?),
and returns a bytes object.

You could use astype() to convert between bytes and a specified encoding
with no change in binary representation. This could be used to store any
binary data, including encoded text or anything else. this should map
directly to the Python bytes model -- thus NOT null-terminted.

This is a little different than 'S' behaviour on py3 -- it appears that
with 'S', a if ALL the trailing bytes are null, then it is truncated, but
if there is a null byte in the middle, then it is preserved. I suspect that
this is a legacy from Py2's use of "strings" as both text and binary data.
But in py3, a "bytes" type should be about bytes, and not text, and thus
null-values bytes are simply another value a byte can hold.

There are multiple ways to address these use cases -- please try to make
your comments clear about whether you think the use-case is unimportant, or
ill-defined, or if you think a given solution is a poor choice.

To facilitate that, I will put my comments on possible solutions in a
separate note, too.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith  wrote:
>
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern 
wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is
support
> >> for arrays that can hold variable-length strings, and the only place
where
> >> the current approach is *optimal* is when we need mmap compatibility
with
> >> legacy formats that use fixed-width-nul-padded fields (at which point
it's
> >> super convenient). It's not even possible to *represent* all Python
strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this
use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.

It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor
PyTables supports that kind of HDF5 Dataset. The variable-length Datasets
can take up a lot of disk-space because they can't be compressed (even
accounting for the wasted padding space). I mean, they probably could have
implemented it with objects arrays like h5py does with the variable-length
string Datasets, but they didn't.

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/624

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith  wrote:

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-).


Of course they do :)
https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682


> Also, further searching suggests that HDF5 actually supports all of
> nul termination, nul padding, and space padding, and that nul
> termination is the default? How much does it help to have in-memory
> compatibility with just one of these options (and not even the default
> one)? Would we need to add the other options to be really useful for
> HDF5?


h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
users.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern  wrote:
> On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:
>
>> That said, AFAICT what people actually want in most use cases is support
>> for arrays that can hold variable-length strings, and the only place where
>> the current approach is *optimal* is when we need mmap compatibility with
>> legacy formats that use fixed-width-nul-padded fields (at which point it's
>> super convenient). It's not even possible to *represent* all Python strings
>> or bytestrings in current numpy unicode or string arrays (Python
>> strings/bytestrings can have trailing nuls). So if we're talking about
>> tweaks to the current system it probably makes sense to focus on this use
>> case specifically.
>>
>> From context I'm assuming FITS files use fixed-width-nul-padding for
>> strings? Is that right? I know HDF5 doesn't.
>
> Yes, HDF5 does. Or at least, it is supported in addition to the
> variable-length ones.
>
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

Doh, I found that page but it was (and is) meaningless to me, so I
went by http://docs.h5py.org/en/latest/strings.html, which says the
options are fixed-width ascii, variable-length ascii, or
variable-length utf-8 ... I guess it's just talking about what h5py
currently supports.

But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-). Is it important for some other reason?

Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5? (Unlikely to happen within numpy itself, but potentially
something that could be done inside h5py or whatever if numpy's
user-defined dtype system were a little more useful.)

-n

[1] hope

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith  wrote:

> That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.
>
> From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

Yes, HDF5 does. Or at least, it is supported in addition to the
variable-length ones.

https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Apr 21, 2017 2:34 PM, "Stephan Hoyer"  wrote:

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.


You may already know this, but probably not everyone reading does: the
reason why latin1 often gets special attention in discussions of Unicode
encoding is that latin1 is effectively "ucs1". It's the unique one byte
text encoding where byte N represents codepoint U+N.

I can't think of any reason why this property is particularly important for
numpy's usage, because we always have a conversion step anyway to get data
in and out of an array. The potential arguments for latin1 that I can think
of are:
- if we have to implement our own en/decoding code for some reason then
it's the most trivial encoding
- if other formats standardize on latin1-with-nul-padding and we want
in-memory/mmap compatibility
- if we really want a fixed width encoding for some reason but don't care
which one, then it's in some sense the most obvious choice

I can't think of many reasons why having a fixed width encoding is
particularly important though... For our current style of string storage,
even calculating the length of a string is O(n), and AFAICT the only way to
actually take advantage of the theoretical O(1) character indexing is to
make a uint8 view. I guess it would be useful if we had a string slicing
ufunc... But why would we?

That said, AFAICT what people actually want in most use cases is support
for arrays that can hold variable-length strings, and the only place where
the current approach is *optimal* is when we need mmap compatibility with
legacy formats that use fixed-width-nul-padded fields (at which point it's
super convenient). It's not even possible to *represent* all Python strings
or bytestrings in current numpy unicode or string arrays (Python
strings/bytestrings can have trailing nuls). So if we're talking about
tweaks to the current system it probably makes sense to focus on this use
case specifically.

>From context I'm assuming FITS files use fixed-width-nul-padding for
strings? Is that right? I know HDF5 doesn't.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern 
wrote:
>>
>> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern 
wrote:
>> >>
>> >> I am not unfamiliar with this problem. I still work with files that
have fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>> >>
>> >> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>> >
>> > This is very simple and obvious but I will state for the record.
>>
>> I appreciate it. What is obvious to you is not obvious to me.
>>
>> > Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>> >
>> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>> >
>> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
>> > Out[23]: False
>> >
>> > In [24]: dat == b'yes'  # Right answer but not practical
>> > Out[24]: array([ True, False], dtype=bool)
>>
>> I'm curious why you think this is not practical. It seems like a very
practical solution to me.
>
> In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

Okay, so the problem isn't with (byte-)string literals, but with variables
being passed around from other sources. Eg.

def func(dat, scalar):
return dat == scalar

Every one of those functions deepens the abstraction and moves that
unicode-by-default scalar farther away from the bytesish array, so it's
harder to demand that users of those functions be aware that they need to
pass in `bytes` strings. So you need to implement those functions
defensively, which complicates them.

> The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

What do you think about my ASCII-surrogateescape proposal? Do you think
that would work with your use cases?

In general, I don't think Unicode sandwiches will be eliminated by this or
the latin-1 dtype; the sandwich is usually the right thing to do and the
surrogateescape the wrong thing. But I'm keenly aware of the problems you
get when there just isn't a reliable encoding to use.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern  wrote:

> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
> aldcr...@head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern 
> wrote:
> >>
> >> I am not unfamiliar with this problem. I still work with files that
> have fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
> >>
> >> Can you walk us through the problems that you are having with working
> with these columns as arrays of `bytes`?
> >
> > This is very simple and obvious but I will state for the record.
>
> I appreciate it. What is obvious to you is not obvious to me.
>
> > Reading an HDF5 file with character data currently gives arrays of
> `bytes` [1].  In Py3 this cannot be compared to a string literal, and
> comparing to (or assigning from) explicit byte strings everywhere in the
> code quickly spins out of control.  This generally forces one to convert
> the data to `U` type and incur the 4x memory bloat.
> >
> > In [22]: dat = np.array(['yes', 'no'], dtype='S3')
> >
> > In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> > Out[23]: False
> >
> > In [24]: dat == b'yes'  # Right answer but not practical
> > Out[24]: array([ True, False], dtype=bool)
>
> I'm curious why you think this is not practical. It seems like a very
> practical solution to me.
>

In Py3 most character data will be string, not bytes.  So every time you
want to interact with the bytes array (compare, assign, etc) you need to
explicitly coerce the right hand side operand to be a bytes-compatible
object.  For code that developers write, this might be possible but results
in ugly code.  But for the general science and engineering communities that
use numpy this is completely untenable.

The only practical solution so far is to implement a unicode sandwich and
convert to the 4-byte `U` type at the interface.  That is precisely what we
are trying to eliminate.

- Tom


>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer  wrote:
>
> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
wrote:
>>>
>>> On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
really is ascii, then it's perfect. If it really is latin-*, then you get
some extra useful stuff, and if it's corrupted somehow, you still get the
ascii text correct, and the rest won't  barf and can be passed on through.
>
> I am totally in agreement with Thomas that "We are living in a messy
world right now with messy legacy datasets that have character type data
that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't know that we can reasonably make that accounting relevant. Number
of such characters per byte of text? Number of files with such characters
out of all existing files?

What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have
regretted it. While it's just personal anecdote, I think it's at least
measuring the right thing. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern  wrote:

> Let me make a counter-proposal for your latin-1 dtype (your #2) that might
> address your, Thomas's, and Julian's use cases:
>
> 2) We want a single-byte-per-character, NULL-terminated string dtype that
> can be used to represent mostly-ASCII textish data that may have some
> high-bit characters from some 8-bit encoding. It should be able to read
> arbitrary bytes (that is, up to the NULL-termination) and write them back
> out as the same bytes if unmodified. This lets us read this text from files
> where the encoding is unspecified (or is lying about the encoding) into
> `unicode/str` objects. The encoding is specified as `ascii` but the
> decoding/encoding is done with the `surrogateescape` option so that
> high-bit characters are faithfully represented in the `unicode/str` string
> but are not erroneously reinterpreted as other characters from an arbitrary
> encoding.
>
> I'd even be happy if Julian or someone wants to go ahead and implement
> this right now and leave the UTF-8 dtype for a later time.
>
> As long as this ASCII-surrogateescape dtype is not called np.realstring
> (it's *really* important to me that the bikeshed not be this color). ;-)
>

This sounds quite similar to my text[unknown] proposal, with the advantage
that the concept of "surrogateescape" that already exists. Surrogate-escape
characters compare equal to themselves, which is maybe less than ideal, but
it looks like you can put them in real unicode strings, which is nice.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern 
wrote:
>>
>> I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>>
>> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>
> This is very simple and obvious but I will state for the record.

I appreciate it. What is obvious to you is not obvious to me.

> Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>
> In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>
> In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> Out[23]: False
>
> In [24]: dat == b'yes'  # Right answer but not practical
> Out[24]: array([ True, False], dtype=bool)

I'm curious why you think this is not practical. It seems like a very
practical solution to me.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
wrote:

> On the other hand, if this is the use-case, perhaps we really want an
>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>> signaled more explicitly. I would suggest that "text[unknown]" should
>> support operations like a string if it can be decoded as ASCII, and
>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>> bytes.
>>
>
> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
> is ascii, then it's perfect. If it really is latin-*, then you get some
> extra useful stuff, and if it's corrupted somehow, you still get the ascii
> text correct, and the rest won't  barf and can be passed on through.
>

I am totally in agreement with Thomas that "We are living in a messy world
right now with messy legacy datasets that have character type data that are
*mostly* ASCII, but not infrequently contain non-ASCII characters."

My question: What are those non-ASCII characters? How often are they truly
latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't think that silently (mis)interpreting non-ASCII characters as
latin-1/9 is a good idea, which is why I think it would be a mistake to use
'latin-1' for text data with unknown encoding.

I could get behind a data type that compares equal to strings for ASCII
only and allows for *storing* other characters, but making blind
assumptions about characters 128-255 seems like a recipe for disaster.
Imagine text[unknown] as a one character string type, but it supports
.decode() like bytes and every character in the range 128-255 compares for
equality with other characters like NaN -- not even equal to itself.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
Chris, you've mashed all of my emails together, some of them are in reply
to you, some in reply to others. Unfortunately, this dropped a lot of the
context from each of them, and appears to be creating some
misunderstandings about what each person is advocating.

On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker  wrote:
>
> On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern 
wrote:

>> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>
> I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?

If the point is to have an array that transparently accepts/yields
`unicode/str` scalars while maintaining the in-memory encoding, yes. If
that's not the point, then IMO the status quo is fine, and *no* new dtypes
should be added, just maybe some utility functions to convert between the
bytes-ish arrays and the Unicode-holding arrays (which was one of my
proposals). I am mostly happy to live in a world where I read in data as
bytes-ish arrays, decode into `object` arrays holding `unicode/str`
objects, do my manipulations, then encode the array into a bytes-ish array
to give to the C API or file format.

>> or leave it be until someone else is willing to solve that problem. I
don't think we're at the bikeshedding stage yet; we're still disagreeing
about fundamental requirements.
>
> yeah -- though I've seen projects get stuck in the sorting out what to
do, so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

I understand, but not all tedious discussions that have not yet achieved
consensus are bikeshedding to be cut short. We couldn't really decide what
to do back in the pre-1.0 days, too, so we just did *something*, and that
something is now the very situation that Julian has a problem with.

We have more experience now, especially with the added wrinkles of Python
3; other projects have advanced and matured their Unicode string
array-handling (e.g. pandas and HDF5); now is a great time to have a real
discussion about what we *need* before we make decisions about what we
should *do*.

> So here I'll lay out what I think are the fundamental requirements:
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:
>
> arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
characters
>
> and arr[1] will return a native Python string object.
>
> 2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.
>
> I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the
encoding in this case.
>
> 3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)
>
> 4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
>  - you could use astype() to convert between bytes and a specified
encoding with no change in binary representation.

You'll need to specify what NULL-terminating behavior you want here.
np.string_ has NULL-termination. np.void (which could be made to work
better with `bytes`) does not. Both have use-cases for text encoding
(shakes fist at UTF-16).

> 2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.
>
> 1) could be covered with the existing 'U': type - only downside being
some wasted space -- or with a pointer to a python string dtype -- which
would also waste space, though less for long-ish strings, and maybe give us
some better access to the nifty built-in string features.
>
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern  wrote:

> I am not unfamiliar with this problem. I still work with files that have
> fields that are supposed to be in EBCDIC but actually contain text in
> ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
> encodings. In that experience, I have found that just treating the data as
> latin-1 unconditionally is not a pragmatic solution. It's really easy to
> implement, and you do get a program that runs without raising an exception
> (at the I/O boundary at least), but you don't often get a program that
> really runs correctly or treats the data properly.
>

> Can you walk us through the problems that you are having with working with
> these columns as arrays of `bytes`?
>

This is very simple and obvious but I will state for the record.  Reading
an HDF5 file with character data currently gives arrays of `bytes` [1].  In
Py3 this cannot be compared to a string literal, and comparing to (or
assigning from) explicit byte strings everywhere in the code quickly spins
out of control.  This generally forces one to convert the data to `U` type
and incur the 4x memory bloat.

In [22]: dat = np.array(['yes', 'no'], dtype='S3')

In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
Out[23]: False

In [24]: dat == b'yes'  # Right answer but not practical
Out[24]: array([ True, False], dtype=bool)

- Tom

[1]: Using h5py or pytables.  Same with FITS, although astropy.io.fits does
some tricks under the hood to auto-convert to `U` type as needed.


>
>
> > So I would beg to actually move forward with a pragmatic solution that
> addresses very real and consequential problems that we face instead of
> waiting/praying for a perfect solution.
>
> Well, I outlined a solution: work with `bytes` arrays with utilities to
> convert to/from the Unicode-aware string dtypes (or `object`).
>
> A UTF-8-specific dtype and maybe a string-specialized `object` dtype
> address the very real and consequential problems that I face (namely and
> respectively, working with HDF5 and in-memory manipulation of string
> datasets).
>
> I'm happy to consider a latin-1-specific dtype as a second,
> workaround-for-specific-applications-only-you-have-been-
> warned-you're-gonna-get-mojibake option. It should not be *the* Unicode
> string dtype (i.e. named np.realstring or np.unicode as in the original
> proposal).
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern  wrote:

> > I agree -- it is a VERY common case for scientific data sets. But a
> one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
> Unicode. The wasted space is not that big a deal with short strings...
>
> Unless if you have hundreds of billions of them.
>

Which is why a one-byte-per char encoding is a good idea.

Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>

I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?


> or leave it be until someone else is willing to solve that problem. I
> don't think we're at the bikeshedding stage yet; we're still disagreeing
> about fundamental requirements.
>

yeah -- though I've seen projects get stuck in the sorting out what to do,
so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

So here I'll lay out what I think are the fundamental requirements:

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:

arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters

and arr[1] will return a native Python string object.

2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.

I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the encoding
in this case.

3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
 - you could use astype() to convert between bytes and a specified encoding
with no change in binary representation.

2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.

1) could be covered with the existing 'U': type - only downside being some
wasted space -- or with a pointer to a python string dtype -- which would
also waste space, though less for long-ish strings, and maybe give us some
better access to the nifty built-in string features.

> +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.


That says to me that these are properly represented by `bytes` objects, not
> `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.


Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
> really important use case for me.


latin-1 would be only for the special case of mostly-ascii (or true latin)
one-byte-per-char encodings (which is a common use-case in scientific data
sets). I think it has only upside over ascii. It would be a fine idea to
support any one-byte-per-char encoding, too.

As 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern 
wrote:
>>
>> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>> >
>> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
wrote:
>>
>> >> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> >
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.

Can you walk us through the problems that you are having with working with
these columns as arrays of `bytes`?

> So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

Well, I outlined a solution: work with `bytes` arrays with utilities to
convert to/from the Unicode-aware string dtypes (or `object`).

A UTF-8-specific dtype and maybe a string-specialized `object` dtype
address the very real and consequential problems that I face (namely and
respectively, working with HDF5 and in-memory manipulation of string
datasets).

I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-
been-warned-you're-gonna-get-mojibake option. It should not be *the*
Unicode string dtype (i.e. named np.realstring or np.unicode as in the
original proposal).

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:
>
>>> In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for
specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the
string, and maybe only if there are more than N non-ascii characters. i.e.
it is very likely to be a run-time error that may not have shown up in
tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
they need to truncate it, and THAT is non-obvious how to do, too -- you
don't want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

If this becomes the right strategy for dealing with these problems (and I'm
not sure that it is), we can easily make a utility function that does this
for people.

This discussion is why I want to be sure that we have our use cases
actually mapped out. For this kind of in-memory manipulation, I'd use an
object array (a la pandas), then convert to the uniform-width string dtype
when I needed to push this out to a C API, HDF5 file, or whatever actually
requires a string-dtype array. The required width gets computed from the
data after all of the manipulations are done. Doing in-memory assignments
to a fixed-encoding, fixed-width string dtype will always have this kind of
problem. You should only put up with it if you have a requirement to write
to a format that specifies the width and the encoding. That specified
encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that
your data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern  wrote:

> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
> aldcr...@head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
> wrote:
>
> >> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> >
> > +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.
>
> That says to me that these are properly represented by `bytes` objects,
> not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.
>

If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

- Tom


>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>
> On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
wrote:

>> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>
> +1.  The key point is that there is a HUGE amount of legacy science data
in the form of FITS (astronomy-specific binary file format that has been
the primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.

That says to me that these are properly represented by `bytes` objects, not
`unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker 
wrote:
>
> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>>>
>>> BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings,
maybe a name field, where you might assign 32 bytes or so -- then someone
has an accented character in their name, and then ge30 or 31 characters --
no big deal.
>>
>>
>> I wouldn't call it a pathological use case, it doesn't seem so uncommon
to have large datasets of short strings.
>
> It's pathological for using a variable-length encoding.
>
>> I personally deal with a database of hundreds of billions of 2 to 5
character ASCII strings.  This has been a significant blocker to Python 3
adoption in my world.
>
> I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

>> BTW, for those new to the list or with a short memory, this topic has
been discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!
>
> yes, let's hope so!
>
> The big difference now is that Julian seems to be committed to actually
making it happen!
>
> Thanks Julian!
>
> Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
>
> I have strong opinions, but would still rather see any of the ideas on
the table implemented than nothing.

FWIW, I prefer nothing to just adding a special case for latin-1. Solve the
HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone
else is willing to solve that problem. I don't think we're at the
bikeshedding stage yet; we're still disagreeing about fundamental
requirements.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

> BTW -- maybe we should keep the pathological use-case in mind: really
>> short strings. I think we are all thinking in terms of longer strings,
>> maybe a name field, where you might assign 32 bytes or so -- then someone
>> has an accented character in their name, and then ge30 or 31 characters --
>> no big deal.
>>
>
> I wouldn't call it a pathological use case, it doesn't seem so uncommon to
> have large datasets of short strings.
>

It's pathological for using a variable-length encoding.


> I personally deal with a database of hundreds of billions of 2 to 5
> character ASCII strings.  This has been a significant blocker to Python 3
> adoption in my world.
>

I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

BTW, for those new to the list or with a short memory, this topic has been
> discussed fairly extensively at least 3 times before.  Hopefully the
> *fourth* time will be the charm!
>

yes, let's hope so!

The big difference now is that Julian seems to be committed to actually
making it happen!

Thanks Julian!

Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.

I have strong opinions, but would still rather see any of the ideas on the
table implemented than nothing.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer  wrote:

> - round-tripping of binary data (at least with Python's encoding/decoding)
>> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
>> same bytes back. You may get garbage, but you won't get an EncodingError.
>>
>
> For a new application, it's a good thing if a text type breaks when you to
> stuff arbitrary bytes in it
>

maybe, maybe not -- the application may be new, but the data it works with
may not be.


> (see Python 2 vs Python 3 strings).
>

this is exactly why py3 strings needed to add the "surrogateescape" error
handler:

https://www.python.org/dev/peps/pep-0383

sometimes text and binary data are mixed, sometimes encoded text is broken.
It is very useful to be able to pass such data through strings losslessly.

Certainly, I would argue that nobody should write data in latin-1 unless
> they're doing so for the sake of a legacy application.
>

or you really want that 1-byte per char efficiency


> I do understand the value in having some "string" data type that could be
> used by default by loaders for legacy file formats/applications (i.e.,
> netCDF3) that support unspecified "one byte strings." Then you're a few
> short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
> if we support arbitrary encodings) or decoding (i.e.,
> np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
> proper encoding. It's not realistic to expect users to know the true
> encoding for strings from a file before they even look at the data.
>

except that you really should :-(

On the other hand, if this is the use-case, perhaps we really want an
> encoding closer to "Python 2" string, i.e, "unknown", to let this be
> signaled more explicitly. I would suggest that "text[unknown]" should
> support operations like a string if it can be decoded as ASCII, and
> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
> bytes.
>

I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
is ascii, then it's perfect. If it really is latin-*, then you get some
extra useful stuff, and if it's corrupted somehow, you still get the ascii
text correct, and the rest won't  barf and can be passed on through.


So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
> "unknown".
>

hmm -- "unknown" should be bytes, not text. If the user needs to look at it
first, then load it as bytes, run chardet or something on it, then cast to
the right encoding.

The current 'S' dtype truncates silently already:
>>
>
> One advantage of a new (non-default) dtype is that we can change this
> behavior.
>

yeah -- still on the edge about that, at least with variable-size
encodings. It's hard to know when it's going to happen and it's hard to
know what to do when it does.

At least if if truncates silently, numpy can have the code to do the
truncation properly. Maybe an option?

And the numpy numeric types truncate (Or overflow) already. Again:

If the default string handling matches expectations from python strings,
then the specialized ones can be more buyer-beware.

Also -- if utf-8 is the default -- what do you get when you create an array
>> from a python string sequence? Currently with the 'S' and 'U' dtypes, the
>> dtype is set to the longest string passed in. Are we going to pad it a bit?
>> stick with the exact number of bytes?
>>
>
> It might be better to avoid this for now, and force users to be explicit
> about encoding if they use the dtype for encoded text.
>

yup.

And we really should have a bytes type for py3 -- which we do, it's just
called 'S', which is pretty confusing :-)

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker  wrote:

> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:
>
>
>> In this case, we want something compatible with Python's string (i.e.
>>> full Unicode supporting) and I think should be as transparent as possible.
>>> Python's string has made the decision to present a character oriented API
>>> to users (despite what the manifesto says...).
>>>
>>
>> Yes, but NumPy doesn't really implement string operations, so fortunately
>> this is pretty irrelevant to us -- except for our API for specifying dtype
>> size.
>>
>
> Exactly -- the character-orientation of python strings means that people
> are used to thinking that strings have a length that is the number of
> characters in the string. I think there will a cognitive dissonance if
> someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.
>
> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the string,
> and maybe only if there are more than N non-ascii characters. i.e. it is
> very likely to be a run-time error that may not have shown up in tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
> they need to truncate it, and THAT is non-obvious how to do, too -- you
> don't want to truncate the encodes bytes naively, you could end up with an
> invalid bytestring. but you don't know how many characters to truncate,
> either.
>
>
>> We already have strong precedence for dtypes reflecting number of bytes
>> used for storage even when Python doesn't: consider numeric types like
>> int64 and float32 compared to the Python equivalents. It's an intrinsic
>> aspect of NumPy that users need to think about how their data is actually
>> stored.
>>
>
> sure, but a float64 is 64 bytes forever an always and the defaults
> perfectly match what python is doing under its hood --even if users don't
> think about. So the default behaviour of numpy matched python's built-in
> types.
>
>
> Storage cost is always going to be a concern. Arguably, it's even more of
>>> a concern today than it used to be be, because compute has been improving
>>> faster than storage.
>>>
>>
> sure -- but again, what is the use-case for numpy arrays with a s#$)load
> of text in them? common? I don't think so. And as you pointed out numpy
> doesn't do text processing anyway, so cache performance and all that are
> not important. So having UCS-4 as the default, but allowing folks to select
> a more compact format if they really need it is a good way to go. Just like
> numpy generally defaults to float64 and Int64 (or 32, depending on
> platform) -- users can select a smaller size if they have a reason to.
>
> I guess that's my summary -- just like with numeric values, numpy should
> default to Python-like behavior as much as possible for strings, too --
> with an option for a knowledgeable user to do something more performant.
>
>
>> I still don't understand why a latin encoding makes sense as a preferred
>> one-byte-per-char dtype. The world, including Python 3, has standardized on
>> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>>
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
> data are one-byte per char, then you could use ASCII, and it would be
> binary compatible with utf-8, but not sure what the point of that is in
> this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
> languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
> few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

+1.  The key point is that there is a HUGE amount of legacy science data in
the form of FITS (astronomy-specific binary file format that has been the
primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.


>
> For Python use -- a pointer to a Python string would be nice.
>>>
>>
>> Yes, absolutely. If we want to be really fancy, we could consider a
>> parametric object dtype that allows for object arrays of *any* homogeneous
>> Python type. Even if NumPy itself doesn't do anything with that
>> information, there are lots of use cases for that information.
>>
>
> hmm -- that's nifty idea -- though I think strings could/should be special
> cased.
>
>
>> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alon

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:

> latin-1 or latin-9 buys you (over ASCII):
>
> ...
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.


> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>

Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
>

One advantage of a new (non-default) dtype is that we can change this
behavior.


> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>

It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:


> In this case, we want something compatible with Python's string (i.e. full
>> Unicode supporting) and I think should be as transparent as possible.
>> Python's string has made the decision to present a character oriented API
>> to users (despite what the manifesto says...).
>>
>
> Yes, but NumPy doesn't really implement string operations, so fortunately
> this is pretty irrelevant to us -- except for our API for specifying dtype
> size.
>

Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string,
and maybe only if there are more than N non-ascii characters. i.e. it is
very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they
need to truncate it, and THAT is non-obvious how to do, too -- you don't
want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.


> We already have strong precedence for dtypes reflecting number of bytes
> used for storage even when Python doesn't: consider numeric types like
> int64 and float32 compared to the Python equivalents. It's an intrinsic
> aspect of NumPy that users need to think about how their data is actually
> stored.
>

sure, but a float64 is 64 bytes forever an always and the defaults
perfectly match what python is doing under its hood --even if users don't
think about. So the default behaviour of numpy matched python's built-in
types.


Storage cost is always going to be a concern. Arguably, it's even more of a
>> concern today than it used to be be, because compute has been improving
>> faster than storage.
>>
>
sure -- but again, what is the use-case for numpy arrays with a s#$)load of
text in them? common? I don't think so. And as you pointed out numpy
doesn't do text processing anyway, so cache performance and all that are
not important. So having UCS-4 as the default, but allowing folks to select
a more compact format if they really need it is a good way to go. Just like
numpy generally defaults to float64 and Int64 (or 32, depending on
platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should
default to Python-like behavior as much as possible for strings, too --
with an option for a knowledgeable user to do something more performant.


> I still don't understand why a latin encoding makes sense as a preferred
> one-byte-per-char dtype. The world, including Python 3, has standardized on
> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.

For Python use -- a pointer to a Python string would be nice.
>>
>
> Yes, absolutely. If we want to be really fancy, we could consider a
> parametric object dtype that allows for object arrays of *any* homogeneous
> Python type. Even if NumPy itself doesn't do anything with that
> information, there are lots of use cases for that information.
>

hmm -- that's nifty idea -- though I think strings could/should be special
cased.


> Then use a native flexible-encoding dtype for everything else.
>>
>
> No opposition here from me. Though again, I think utf-8 alone would also
> be enough.
>

maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array
>> that doesn't fit, they should get an error:
>>
>
>> EncodingError if it can't be encoded into the defined encoding.
>>
>> ValueError if it is too long -- it should not be silently truncated.
>>
>
> I think we all agree here.
>

I'm actually having second thoughts -- see above -- if the encoding is
utf-8, then truncating is non-trivial -- mayb

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Stephan Hoyer
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker 
wrote:

> 1) Use with/from  Python -- both creating and working with numpy arrays.
>

> In this case, we want something compatible with Python's string (i.e. full
> Unicode supporting) and I think should be as transparent as possible.
> Python's string has made the decision to present a character oriented API
> to users (despite what the manifesto says...).
>

Yes, but NumPy doesn't really implement string operations, so fortunately
this is pretty irrelevant to us -- except for our API for specifying dtype
size.

We already have strong precedence for dtypes reflecting number of bytes
used for storage even when Python doesn't: consider numeric types like
int64 and float32 compared to the Python equivalents. It's an intrinsic
aspect of NumPy that users need to think about how their data is actually
stored.


> However, there is a challenge here: numpy requires fixed-number-of-bytes
> dtypes. And full unicode support with fixed number of bytes matching fixed
> number of characters is only possible with UCS-4 -- hence the current
> implementation. And this is actually just fine! I know we all want to be
> efficient with data storage, but really -- in the early days of Unicode,
> when folks thought 16 bits were enough, doubling the memory usage for
> western language storage was considered fine -- how long in computer life
> time does it take to double your memory? But now, when memory, disk space,
> bandwidth, etc, are all literally orders of magnitude larger, we can't
> handle a factor of 4 increase in "wasted" space?
>

Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving
faster than storage.


> But as scientific text data often is 1-byte compatible, a
> one-byte-per-char dtype is a fine idea, too -- and we pretty much have that
> already with the existing string type -- that could simply be enhanced by
> enforcing the encoding to be latin-9 (or latin-1, if you don't want the
> Euro symbol). This would get us what scientists expect from strings in a
> way that is properly compatible with Python's string type. You'd get
> encoding errors if you tried to stuff anything else in there, and that's
> that.
>

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

So -- I think we should address the use-cases separately -- one for
> "normal" python use and simple interoperability with python strings, and
> one for interoperability at the binary level. And an easy way to convert
> between the two.
>
> For Python use -- a pointer to a Python string would be nice.
>

Yes, absolutely. If we want to be really fancy, we could consider a
parametric object dtype that allows for object arrays of *any* homogeneous
Python type. Even if NumPy itself doesn't do anything with that
information, there are lots of use cases for that information.

Then use a native flexible-encoding dtype for everything else.
>

No opposition here from me. Though again, I think utf-8 alone would also be
enough.


> Thinking out loud -- another option would be to set defaults for the
> multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
> with the python string type -- and make folks make an effort to get
> anything else.
>

The np.unicode_ type is already UCS-4 and the default for dtype=str on
Python 3. We probably shouldn't change that, but if we set any default
encoding for the new text type, I strongly believe it should be utf-8.

One more note: if a user tries to assign a value to a numpy string array
> that doesn't fit, they should get an error:
>
> EncodingError if it can't be encoded into the defined encoding.
>
> ValueError if it is too long -- it should not be silently truncated.
>

I think we all agree here.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Chris Barker
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.

2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python’s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
for that.
"""

My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
flle formats.

This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
interface code)

If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.

This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.

For Python use -- a pointer to a Python string would be

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge  wrote:
>
> On 04/20/2017 03:17 PM, Anne Archibald wrote:
>>
>> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.
>
> FITS BINTABLE extensions can have columns containing strings, and in that
case the values are NULL terminated, except that if the string fills the
field (i.e. there's no room for a NULL), the NULL will not be written.

Ah, that's what I was thinking of, thank you.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Phil Hodge

On 04/20/2017 03:17 PM, Anne Archibald wrote:
Actually if I understood the spec, FITS header lines are 80 bytes long 
and contain ASCII with no NULLs; strings are quoted and trailing 
spaces are stripped.




FITS BINTABLE extensions can have columns containing strings, and in 
that case the values are NULL terminated, except that if the string 
fills the field (i.e. there's no room for a NULL), the NULL will not be 
written.


Phil
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer  wrote:
>
> On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern 
wrote:
>>
>> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>> >
>> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
wrote:
>> >>
>> >> I don't know of a format off-hand that works with numpy
uniform-length strings and Unicode as well. HDF5 (to my recollection)
supports arrays of NULL-terminated, uniform-length ASCII like FITS, but
only variable-length UTF8 strings.
>> >
>> >
>> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
and variable length versions:
>> > https://github.com/PyTables/PyTables/issues/499
>> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>> >
>> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
>>
>> Ah, okay, I was interpolating from a quick perusal of the h5py docs,
which of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.
>
> Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).

"... well enough with np.string's semantics [that h5py actually used it to
pass data in and out; whether that array is fit for purpose beyond that, I
won't comment]." :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Marten van Kerkwijk
> I suggest a new data type  'text[encoding]', 'T'.

I like the suggestion very much (it is even in between S and U!). The
utf-8 manifesto linked to above convinced me that the number that
should follow is the number of bytes, which is nicely consistent with
use in all numerical dtypes.

Any way, more specifically on Julian's question: it seems to me one
has little choice but to make a new dtype (and OK if that makes
unicode obsolete). I think what exact encodings to support is a
separate question.

-- Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:
>
> On 20.04.2017 20:53, Robert Kern wrote:
> > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> > mailto:jtaylor.deb...@googlemail.com>>
> > wrote:
> >
> >> Do you have comments on how to go forward, in particular in regards to
> >> new dtype vs modify np.unicode?
> >
> > Can we restate the use cases explicitly? I feel like we ended up with
> > the current sub-optimal situation because we never really laid out the
> > use cases. We just felt like we needed bytestring and unicode dtypes,
> > more out of completionism than anything, and we made a bunch of
> > assumptions just to get each one done. I think there may be broad
> > agreement that many of those assumptions are "wrong", but it would be
> > good to reference that against concretely-stated use cases.
>
> We ended up in this situation because we did not take the opportunity to
> break compatibility when python3 support was added.

Oh, the root cause I'm thinking of long predates Python 3, or even numpy
1.0. There never was an explicitly fleshed out use case for unicode arrays
other than "Python has unicode strings, so we should have a string dtype
that supports it". Hence the "we only support UCS4" implementation; it's
not like anyone *wants* UCS4 or interoperates with UCS4, but it does
represent all possible Unicode strings. The Python 3 transition merely
exacerbated the problem by making Unicode strings the primary string type
to work with. I don't really want to ameliorate the exacerbation without
addressing the root problem, which is worth solving.

I will put this down as a marker use case: Support HDF5's fixed-width UTF-8
arrays.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern  wrote:

> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
> >
> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
> wrote:
> >>
> >> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
> >
> >
> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
> and variable length versions:
> > https://github.com/PyTables/PyTables/issues/499
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
> >
> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>

Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald 
wrote:
>
> On Thu, Apr 20, 2017 at 8:55 PM Robert Kern  wrote:

>> For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.
>
> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

Never mind, then. :-)

>> If I had to jump ahead and propose new dtypes, I might suggest this:
>>
>> * For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.
>>
>> * Acknowledge the use cases of the current NULL-terminated np.string
dtype, but perhaps add a new canonical alias, document it as being for
those specific use cases, and deprecate/de-emphasize the current name.
>>
>> * Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.
>
> How would this differ from a numpy array of bytes with one more
dimension?

The scalar in the implementation being the scalar in the use case,
immutability of the scalar, directly working with b'' strings in and out
(and thus work with the Python codecs easily).

>> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).
>
> I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps? Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

HDF5 seems to support this, but only for ASCII and UTF8, not a large list
of encodings.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:59, Anne Archibald wrote:
> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor
> mailto:jtaylor.deb...@googlemail.com>>
> wrote:
> 
> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
> 
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
> 
> 
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a
> new type is having to find an obvious name that isn't already in use.
> (And having to actively  maintain/deprecate the old one.) 
> 
> Anne
> 

We wouldn't really be changing the behaviour of the unicode dtype. Only
programs accessing the databuffer directly and trying to decode would
need to be changed.

I assume this can happen for programs that do serialization + reencoding
of numpy string arrays at the C level (at the python level you would be
fine).
These programs would be broken, but only when they actually receive a
string array that does not have the default utf32 encoding.

I really don't like that a fully new dtype means creating more junk and
extra code paths to numpy.
But it is probably do big of a compatibility break to accept to keep our
code clean.



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Feng Yu
I suggest a new data type  'text[encoding]', 'T'.

1. text can be cast to python strings via decoding.

2. Conceptually casting to python bytes first cast to a string then
calls encode(); the current encoding in the meta data is used by
default, but the new encoding can be overridden.

I slightly favour 'T16' as a fixed size, text record backed by 16
bytes. This way over-allocation is forcefully delegated to the user,
simplifying numpy array.


Yu

On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern  wrote:
> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>>
>> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
>> wrote:
>>>
>>> I don't know of a format off-hand that works with numpy uniform-length
>>> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
>>> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
>>> UTF8 strings.
>>
>>
>> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
>> variable length versions:
>> https://github.com/PyTables/PyTables/issues/499
>> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>>
>> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
>> storage, not the number of characters.
>
> Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
> of course are also constrained by numpy's current set of dtypes. The
> NULL-terminated ASCII works well enough with np.string's semantics.
>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Charles R Harris
On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern  wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.deb...@googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>
> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>
> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
>
> We should look at some of the newer formats and APIs, like Parquet and
> Arrow, and also consider the cross-language APIs with Julia and R.
>
> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>
> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>
>
A little history, IIRC, storing null terminated strings in fixed byte
lengths was done in Fortran, strings were  usually stored in
integers/integer_arrays.

If memory mapping of arbitrary types is not important, I'd settle for ascii
or latin-1, utf-8 fixed byte length, and arrays of fixed python object
type. Using one byte encodings and utf-8 avoids needing to deal with
endianess.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:53, Robert Kern wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> mailto:jtaylor.deb...@googlemail.com>>
> wrote:
> 
>> Do you have comments on how to go forward, in particular in regards to
>> new dtype vs modify np.unicode?
> 
> Can we restate the use cases explicitly? I feel like we ended up with
> the current sub-optimal situation because we never really laid out the
> use cases. We just felt like we needed bytestring and unicode dtypes,
> more out of completionism than anything, and we made a bunch of
> assumptions just to get each one done. I think there may be broad
> agreement that many of those assumptions are "wrong", but it would be
> good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to
break compatibility when python3 support was added.
We should have made the string dtype an encoded byte type (ascii or
latin1) in python3 instead of null terminated unencoded bytes which do
not make very much practical sense.

So the use case is very simple: Give users of the string dtype a
migration path that does not involve converting to full utf32 unicode.
The latin1 encoded bytes dtype would allow that.

As we already have the infrastructure this same dtype can allow more
than just latin1 with minimal effort, for the fixed size python
supported stuff it is literally adding an enum entry, two new switch
clauses and a little bit of dtype string parsing and testcases.


Having some form of variable string handling would be nice. But this is
another topic all together.
Having builtin support for variable strings only seems overkill as the
string dtype is not that important and object arrays should work
reasonably well for this usecase already.



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>
> On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
wrote:
>>
>> I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.
>
>
> HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
> https://github.com/PyTables/PyTables/issues/499
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.

Ah, okay, I was interpolating from a quick perusal of the h5py docs, which
of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern  wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.deb...@googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>

+1


> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>

Actually if I understood the spec, FITS header lines are 80 bytes long and
contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

[...]

> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>

How would this differ from a numpy array of bytes with one more dimension?


> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>

I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps?  Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


  1   2   >