Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith  wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern"  wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support
Unicode arrays, despite user requests for years. The sticking point seems
to be the difference between HDF5's view of a Unicode string array (defined
in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
string array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.
>
> I would really like to hear more from the authors of these libraries
about what exactly it is they feel they're missing. Is it that they want
numpy to enforce the length limit early, to catch errors when the array is
modified instead of when they go to write it to the file? Is it that they
really want an O(1) way to look at a array and know the maximum number of
bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
is really annoying and files that need it are rare so they haven't had the
motivation to implement it?

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/379

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer  wrote:

>
> Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and
> myself have already given), but we seem to be talking past each other here.
>

yeah -- I think it's not clear what the use cases we are talking about are.


> I am still -1 on any new string encoding support unless that includes at
> least UTF-8, with length indicated by the number of bytes.
>

I've said multiple times that utf-8 support is key to any "exchange binary
data" use case (memory mapping?) -- so yes, absolutely.

I _think_ this may be some of the source for the confusion:

The name of this thread is: "proposal: smaller representation of string
arrays".

And I got the impression, maybe mistaken, that folks were suggesting that
internally encoding strings in numpy as "UTF-8, with length indicated by
the number of bytes." was THE solution to the

" the 'U' dtype takes up way too much memory, particularly  for
mostly-ascii data" problem.

I do not think it is a good solution to that problem.

I think a good solution to that problem is latin-1 encoding. (bear with me
here...)

But a bunch of folks have brought up that while we're messing around with
string encoding, let's solve another problem:

* Exchanging unicode text at the binary level with other systems that
generally don't use UCS-4.

For THAT -- utf-8 is critical.

But if I understand Julian's proposal -- he wants to create a parameterized
text dtype that you can set the encoding on, and then numpy will use the
encoding (and python's machinery) to encode / decode when passing to/from
python strings.

It seems this would support all our desires:

I'd get a latin-1 encoded type for compact representation of mostly-ascii
data.

Thomas would get latin-1 for binary interchange with mostly-ascii data

The HDF-5 folks would get utf-8 for binary interchange (If we can workout
the null-padding issue)

Even folks that had weird JAVA or Windows-generated UTF-16 data files could
do the binary interchange thing

I'm now lost as to what the hang-up is.

-CHB

PS: null padding is a pain, python strings seem to preserve the zeros, whic
is odd -- is thre a unicode code-point at x00?

But you can use it to strip properly with the unicode sandwich:

In [63]: ut16 = text.encode('utf-16') + b'\x00\x00\x00\x00\x00\x00'

In [64]: ut16.decode('utf-16')
Out[64]: 'some text\x00\x00\x00'

In [65]: ut16.decode('utf-16').strip('\x00')
Out[65]: 'some text'

In [66]: ut16.decode('utf-16').strip('\x00').encode('utf-16')
Out[66]: b'\xff\xfes\x00o\x00m\x00e\x00 \x00t\x00e\x00x\x00t\x00'

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern  wrote:

> >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases.
>

isn't UTF-32 pretty compressible also? lots of zeros in there

here's an example with pure ascii  Lorem Ipsum text:

In [17]: len(text)
Out[17]: 446


In [18]: len(utf8)
Out[18]: 446

# the same -- it's pure ascii

In [20]: len(utf32)
Out[20]: 1788

# four times a big -- of course.

In [22]: len(bz2.compress(utf8))
Out[22]: 302

# so from 446 to 302, not that great -- probably it would be better for
longer text
# -- but are compressing whole arrays or individual strings?

In [23]: len(bz2.compress(utf32))
Out[23]: 319

# almost as good as the compressed utf-8

And I'm guessing it would be even closer with more non-ascii charactors.

OK -- turns out I'm wrong -- here it with greek -- not a lot of ascii
charactors:

In [29]: len(text)
Out[29]: 672

In [30]: utf8 = text.encode("utf-8")

In [31]: len(utf8)
Out[31]: 1180

# not bad, really -- still smaller than utf-16 :-)

In [33]: len(bz2.compress(utf8))
Out[33]: 495

# pretty good then -- better than 50%

In [34]: utf32 = text.encode("utf-32")
In [35]: len(utf32)

Out[35]: 2692


In [36]: len(bz2.compress(utf32))
Out[36]: 515

# still not quite as good as utf-8, but close.

So: utf-8 compresses better than utf-32, but only by a little bit -- at
least with bz2.

But it is a lot smaller uncompressed.

>>> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
> >>
> >> It's not just HDF5. Counting bytes is the Right Way to measure the size
> of UTF-8 encoded text:
> >> http://utf8everywhere.org/#myths
>

It's really the only way with utf-8 -- which is why it is an impedance
mismatch with python strings.


>> I also firmly believe (though clearly this is not universally agreed
> upon) that UTF-8 is the Right Way to encode strings for *non-legacy*
> applications.
>

fortunately, we don't need to agree to that to agree that:


> So if we're adding any new string encodings, it needs to be one of them.
>

Yup -- the most important one to add -- I don't think it is "The Right Way"
for all applications -- but it "The Right Way" for text interchange.

And regardless of what any of us think -- it is widely used.

> (1) object arrays of strings. (We have these already; whether a
> strings-only specialization would permit useful things like string-oriented
> ufuncs is a question for someone who's willing to implement one.)
>

This is the right way to get variable length strings -- but I'm concerned
that it doesn't mesh well with numpy uses like npz files, raw dumping of
array data, etc. It should not be the only way to get proper Unicode
support, nor the default when you do:

array(["this", "that"])


> > (2) a dtype for fixed byte-size, specified-encoding, NULL-padded data.
> All python encodings should be permitted. An additional function to
> truncate encoded data without mangling the encoding would be handy.
>

I think necessary -- at least when you pass in a python string...


> I think it makes more sense for this to be NULL-padded than
> NULL-terminated but it may be necessary to support both; note that
> NULL-termination is complicated for encodings like UCS4.
>

is it if you know it's UCS4? or even know the size of the code-unit (I
think that's the term)


> This also includes the legacy UCS4 strings as a special case.
>

what's special about them? I think the only thing shold be that they are
the default.
>

> > (3) a dtype for fixed-length byte strings. This doesn't look very
> different from an array of dtype u8, but given we have the bytes type,
> accessing the data this way makes sense.
>
> The void dtype is already there for this general purpose and mostly works,
> with a few niggles.
>

I'd never noticed that! And if I had I never would have guessed I could use
it that way.


> If it worked more transparently and perhaps rigorously with `bytes`, then
> it would be quite suitable.
>

Then we should fix a bit of those things -- and call it soemthig like
"bytes", please.

-CHB

>
> --

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg 
wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right. (I think she got around it, by weird
> trickery, inverting the endianess or so and thus putting the zero bytes
> first).
> Maybe will ask her if this discussion is interesting to her. Though I
> think it might have been something like "make everything in
> hdf5/something similar work" without any actual use case, I don't know.

I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> Indeed,
> Most of this discussion is irrelevant to numpy.
> Numpy only really deals with the in memory storage of strings. And in
> that it is limited to fixed length strings (in bytes/codepoints).
> How you get your messy strings into numpy arrays is not very relevant to
> the discussion of a smaller representation of strings.
> You couldn't get messy strings into numpy without first sorting it out
> yourself before, you won't be able to afterwards.
> Numpy will offer a set of encodings, the user chooses which one is best
> for the use case and if the user screws it up, it is not numpy's problem.
>
> You currently only have a few ways to even construct string arrays:
> - array construction and loops
> - genfromtxt (which is again just a loop)
> - memory mapping which I seriously doubt anyone actually does for the S
> and U dtype

I fear that you decided that the discussion was irrelevant and thus did not
read it rather than reading it to decide that it was not relevant. Because
several of us have showed that, yes indeed, we do memory-map string arrays.

You can add to this list C APIs, like that of libhdf5, that need to
communicate (Unicode) string arrays.

Look, I know I can be tedious, but *please* go back and read this
discussion. We have concrete use cases outlined. We can give you more
details if you need them. We all feel the pain of the rushed, inadequate
implementation of the U dtype. But each of our pains is a little bit
different; you obviously aren't experiencing the same pains that I am.

> Having a new dtype changes nothing here. You still need to create numpy
> arrays from python strings which are well defined and clean.
> If you put something in that doesn't encode you get an encoding error.
> No oddities like surrogate escapes are needed, numpy arrays are not
> interfaces to operating systems nor does numpy need to _add_ support for
> historical oddities beyond what it already has.
> If you want to represent bytes exactly as they came in don't use a text
> dtype (which includes the S dtype, use i1).

Thomas Aldcroft has demonstrated the problem with this approach. numpy
arrays are often interfaces to files that have tons of historical oddities.

> Concerning variable sized strings, this is simply not going to happen.
> Nobody is going to rewrite numpy to support it, especially not just for
> something as unimportant as strings.
> Best you are going to get (or better already have) is object arrays. It
> makes no sense to discuss it unless someone comes up with an actual
> proposal and the willingness to code it.

No one has suggested such a thing. At most, we've talked about specializing
object arrays.

> What is a relevant discussion is whether we really need a more compact
> but limited representation of text than 4-byte utf32 at all.
> Its usecase is for the most part just for python3 porting and saving
> some memory in some ascii heavy cases, e.g. astronomy.
> It is not that significant anymore as porting to python3 has mostly
> already happened via the ugly byte workaround and memory saving is
> probably not as significant in the context of numpy which is already
> heavy on memory usage.
>
> My initial approach was to not add a new dtype but to make unicode
> parametrizable which would have meant almost no cluttering of numpys
> internals and keeping the api more or less consistent which would make
> this a relatively simple addition of minor functionality for people that
> want it.
> But adding a completely new partially redundant dtype for this usecase
> may be a too large change to the api. Having two partially redundant
> string types may confuse users more than our current status quo of our
> single string type (U).
>
> Discussing whether we want to support truncated utf8 has some merit as
> it is a decision whether to give the users an even larger gun to shot
> themselves in the foot with.
> But I'd like to focus first on the 1 byte type to add a symmetric API
> for python2 and python3.
> utf8 can always be added latter should we deem it a good idea.

What is your current proposal? A string dtype parameterized with the
encoding (initially supporting the latin-1 that you desire and maybe adding
utf-8 later)? Or a latin-1-specific dtype such that we will have to add a
second utf-8 dtype at a later date?

If you're not going to support arbitrary encodings right off the bat, I'd
actually suggest implementing UTF-8 and ASCII-surrogateescape first as they
seem to knock off more use cases straight away.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker - NOAA Federal
> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with 
> > a few extra characters" data. With all the sloppiness over the years, there 
> > are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding" problem.

Exactly -- but from a practicality beats purity perspective, there is
a difference between "I have no idea whatsoever" and "I know it is
mostly ascii, and European, but there are some extra characters in
there"

Latin-1 had proven very useful for that case.

I suppose in most cases ascii with errors='replace' would be a good
choice, but I'd still rather not throw out potentially useful
information.

> Your previous advocacy has also touched on using latin-1 to decode existing 
> files with unknown encodings as well. If you want to advocate for using 
> latin-1 only for the creation of new data, maybe stop talking about existing 
> files? :-)

Yeah, I've been very unfocused in this discussion -- sorry about that.

> > Note: the primary use-case I have in mind is working with ascii text in 
> > numpy arrays efficiently-- folks have called for that. All I'm saying is 
> > use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which 
> buys you a whole bunch of useful extra characters. ;-)

UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.

It's become clear in this discussion that there is s strong desire to
support a numpy dtype that stores text in particular binary formats
(I.e. Encodings). Rather than choose one or two, we might as well
support all encodings supported by python.

In that case, we'll have utf-8 for those that know they want that, and
we'll have latin-1 for those that incorrectly think they want that :-)

So what remains is to decide is implementation, syntax, and defaults.

Let's keep in mind that most of us on this list, and in this
discussion, are the folks that write interface code and the like. But
most numpy users are not as tuned in to the internals. So defaults
should be set to best support the more "naive" user.

> . If all we do is add a latin-1 dtype for people to use to create new 
> in-memory data, then someone is going to use it to read existing data in 
> unknown or ambiguous encodings.

If we add every encoding known to man someone is going to use Latin-1
to read unknown encodings. Indeed, as we've all pointed out, there is
no correct encoding with which to read unknown encodings.

Frankly, if we have UTF-8 under the hood, I think people are even MORE
likely to use it inappropriately-- it's quite scary how many people
think UTF-8 == Unicode, and think all you need to do is "use utf-8",
and you don't need to change any of the rest of your code. Oh, and
once you've done that, you can use your existing ASCII-only tests and
think you have a working application :-)

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Eric Wieser
> I think we can implement viewers for strings as ndarray subclasses. Then one
> could
> do `my_string_array.view(latin_1)`, and so on.  Essentially that just
> changes the default
> encoding of the 'S' array. That could also work for uint8 arrays if needed.
>
> Chuck

To handle structured data-types containing encoded strings, we'd also
need to subclass `np.void`.

Things would get messy when a structured dtype contains two strings in
different encodings (or more likely, one bytestring and one
textstring) - we'd need some way to specify which fields are in which
encoding, and using subclasses means that this can't be contained
within the dtype information.

So I think there's a strong argument for solving this with`dtype`s
rather than subclasses. This really doesn't seem hard though.
Something like (C-but-as-python):

def ENCSTRING_getitem(ptr, arr):  # The PyArrFuncs slot
encoded = STRING_getitem(ptr, arr)
return encoded.decode(arr.dtype.encoding)

def ENCSTRING_setitem(val, ptr, arr):  # The PyArrFuncs slot
val = val.encode(arr.dtype.encoding)
# todo: handle "safe" truncation, where safe might mean keep
codepoints, keep graphemes, or never allow
STRING_setitem(val, ptr, arr))

We'd probably need to be careful to do a decode/encode dance when
copying from one encoding to another, but we [already have
bugs](https://github.com/numpy/numpy/issues/3258) in those cases
anyway.

Is it reasonable that the user of such an array would want to work
with plain `builtin.unicode` objects, rather than some special numpy
scalar type?

Eric
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Aldcroft, Thomas
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal <
chris.bar...@noaa.gov> wrote:

> > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:
>
> > Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
> > s. And then from in Python, if you want to actually work with those
> filenames you need to either have a bytestring type or else a Unicode type
> that uses surrogateescape to represent the non-ascii characters.
>
>
> > IMO if you have filenames that are arbitrary bytestrings and you need to
> represent this properly, you should just use bytestrings -- really, they're
> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!
>
> And no, bytestrings are not perfectly friendly in py3.
>
> This got really complicated and sidetracked, but All I'm suggesting is
> that if we have a 1byte per char string type, with a fixed encoding,
> that that encoding be Latin-1, rather than ASCII.
>
> That's it, really.
>

Fully agreed.


>
> Having a settable encoding would work fine, too.
>

Yup.

At a simple level, I just want the things that currently work just fine in
Py2 to start working in Py3. That includes being able to read / manipulate
/ compute and write back to legacy binary FITS and HDF5 files that include
ASCII-ish text data (not strictly ASCII).  Memory mapping such files should
be supportable.  Swapping type from bytes to a 1-byte char str should be
possible without altering data in memory.

BTW, I am saying "I want", but this functionality would definitely be
welcome in astropy.  I wrote a unicode sandwich workaround for the astropy
Table class (https://github.com/astropy/astropy/pull/5700) which should be
in the next release.  It would be way better to have this at a level lower
in numpy.

- Tom


>
> -CHB
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
> chris.bar...@noaa.gov> wrote:
>
> >> Presumably you're getting byte strings (with  unknown encoding.
> >
> > No -- thus is for creating and using mostly ascii string data with
> python and numpy.
> >
> > Unknown encoding bytes belong in byte arrays -- they are not text.
>
> You are welcome to try to convince Thomas of that. That is the status quo
> for him, but he is finding that difficult to work with.
>
> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii,
> with a few extra characters" data. With all the sloppiness over the years,
> there are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding"
> problem. Your previous advocacy has also touched on using latin-1 to decode
> existing files with unknown encodings as well. If you want to advocate for
> using latin-1 only for the creation of new data, maybe stop talking about
> existing files? :-)
>
> > Note: the primary use-case I have in mind is working with ascii text in
> numpy arrays efficiently-- folks have called for that. All I'm saying is
> use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which
> buys you a whole bunch of useful extra characters. ;-)
>
> There are several use cases being brought forth here. Some involve file
> reading, some involve file writing, and some involve in-memory
> manipulation. Whatever change we make is going to impinge somehow on all of
> the use cases. If all we do is add a latin-1 dtype for people to use to
> create new in-memory data, then someone is going to use it to read existing
> data in unknown or ambiguous encodings.
>


The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

Note that for terminal display we will want something supported by the
system, which is another problem altogether. Let me break the problem down
into four categories


   1. Storage -- hdf5, .npy, fits, etc.
   2. Display -- ?
   3. Modification -- editing
   4. Parsing -- fits, etc.

There is probably no one solution that is optimal for all of those.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal
 wrote:
>> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:
>
>> Eh... First, on Windows and MacOS, filenames are natively Unicode.
>
> Yeah, though once they are stored I. A text file -- who the heck
> knows? That may be simply unsolvable.
>> s. And then from in Python, if you want to actually work with those 
>> filenames you need to either have a bytestring type or else a Unicode type 
>> that uses surrogateescape to represent the non-ascii characters.
>
>
>> IMO if you have filenames that are arbitrary bytestrings and you need to 
>> represent this properly, you should just use bytestrings -- really, they're 
>> perfectly friendly :-).
>
> I thought the Python file (and Path) APIs all required (Unicode)
> strings? That was the whole complaint!

No, the path APIs all accept bytestrings (and ones that return
pathnames like listdir return bytestrings if given bytestrings). Or at
least they're supposed to.

The really urgent need for surrogateescape was things like sys.argv
and os.environ where arbitrary bytes might come in (on some systems)
but the API is restricted to strs.

> And no, bytestrings are not perfectly friendly in py3.

I'm not saying you should use them everywhere or that they remove the
need for an ergonomic text dtype, but when you actually want to work
with bytes they're pretty good (esp. in modern py3).

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith  wrote:

> Eh... First, on Windows and MacOS, filenames are natively Unicode.

Yeah, though once they are stored I. A text file -- who the heck
knows? That may be simply unsolvable.
> s. And then from in Python, if you want to actually work with those filenames 
> you need to either have a bytestring type or else a Unicode type that uses 
> surrogateescape to represent the non-ascii characters.


> IMO if you have filenames that are arbitrary bytestrings and you need to 
> represent this properly, you should just use bytestrings -- really, they're 
> perfectly friendly :-).

I thought the Python file (and Path) APIs all required (Unicode)
strings? That was the whole complaint!

And no, bytestrings are not perfectly friendly in py3.

This got really complicated and sidetracked, but All I'm suggesting is
that if we have a 1byte per char string type, with a fixed encoding,
that that encoding be Latin-1, rather than ASCII.

That's it, really.

Having a settable encoding would work fine, too.

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker"  wrote:


 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.


Eh... First, on Windows and MacOS, filenames are natively Unicode. So you
don't care about preserving the bytes, only the characters. It's only Linux
and the other traditional unixes where filenames are natively bytestrings.
And then from in Python, if you want to actually work with those filenames
you need to either have a bytestring type or else a Unicode type that uses
surrogateescape to represent the non-ascii characters. I'm not seeing how
latin1 really helps anything here -- best case you still have to do
something like the wsgi "encoding dance" before you could use the
filenames. IMO if you have filenames that are arbitrary bytestrings and you
need to represent this properly, you should just use bytestrings -- really,
they're perfectly friendly :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern  wrote:

> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
> >
> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <
> peridot.face...@gmail.com> wrote:
>
> >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
> other packages are waiting specifically for it. But specifying this
> requires two pieces of information: What is the encoding? and How is the
> length specified? I know they're not numpy-compatible, but FITS header
> values are space-padded; does that occur elsewhere? Are there other ways
> existing data specifies string length within a fixed-size field? There are
> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
> PKCS7, etc. - but they are probably too specialized to need? We should make
> sure we can support all the ways that actually occur.
> >
> >
> > Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> Just to clarify some terminology (because it wasn't originally clear to me
> until I looked it up in reference to HDF5):
>
> * "NULL-padded" implies that, for a fixed width of N, there can be up to N
> non-NULL bytes. Any extra space left over is padded with NULLs, but no
> space needs to be reserved for NULLs.
>
> * "NULL-terminated" implies that, for a fixed width of N, there can be up
> to N-1 non-NULL bytes. There must always be space reserved for the
> terminating NULL.
>
> I'm not really sure if "NULL-padded" also specifies the behavior for
> embedded NULLs. It's certainly possible to deal with them: just strip
> trailing NULLs and leave any embedded ones alone. But I'm also sure that
> there are some implementations somewhere that interpret the requirement as
> "stop at the first NULL or the end of the fixed width, whichever comes
> first", effectively being NULL-terminated just not requiring the reserved
> space.
>

Thanks for the clarification. NULL-padded is what I meant.

I'm wondering how much of the desired functionality we could get by simply
subclassing ndarray in python. I think we mostly want to be able to view
byte strings and convert to unicode if needed.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Eric Wieser
Chuck: That sounds like something we want to deprecate, for the same reason
that python3 no longer allows str(b'123') to do the right thing.

Specifically, it seems like astype should always be forbidden to go between
unicode and byte arrays - so that would need to be written as:

In [1]: a = array([1,2,3], uint8) + 0x30

In [2]: a.view('S1')
Out[2]:
array(['1', '2', '3'],
  dtype='|S1')

In [3]: a.view('U[ascii]')
Out[3]:
array([u'1', u'2', u'3'],
  dtype=' wrote:

On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald 
> wrote:
>
>>
>> On Tue, Apr 25, 2017 at 7:09 PM Robert Kern 
>> wrote:
>>
>>> * HDF5 supports fixed-length and variable-length string arrays encoded
>>> in ASCII and UTF-8. In all cases, these strings are NULL-terminated
>>> (despite the documentation claiming that there are more options). In
>>> practice, the ASCII strings permit high-bit characters, but the encoding is
>>> unspecified. Memory-mapping is rare (but apparently possible). The two
>>> major HDF5 bindings are waiting for a fixed-length UTF-8 numpy dtype to
>>> support that HDF5 option. Compression is supported for fixed-length string
>>> arrays but not variable-length string arrays.
>>>
>>> * FITS supports fixed-length string arrays that are NULL-padded. The
>>> strings do not have a formal encoding, but in practice, they are typically
>>> mostly ASCII characters with the occasional high-bit character from an
>>> unspecific encoding. Memory-mapping is a common practice. These arrays can
>>> be quite large even if each scalar is reasonably small.
>>>
>>> * pandas uses object arrays for flexible in-memory handling of string
>>> columns. Lengths are not fixed, and None is used as a marker for missing
>>> data. String columns must be written to and read from a variety of formats,
>>> including CSV, Excel, and HDF5, some of which are Unicode-aware and work
>>> with `unicode/str` objects instead of `bytes`.
>>>
>>> * There are a number of sometimes-poorly-documented,
>>> often-poorly-adhered-to, aging file format "standards" that include string
>>> arrays but do not specify encodings, or such specification is ignored in
>>> practice. This can make the usual "Unicode sandwich" at the I/O boundaries
>>> difficult to perform.
>>>
>>> * In Python 3 environments, `unicode/str` objects are rather more
>>> common, and simple operations like equality comparisons no longer work
>>> between `bytes` and `unicode/str`, making it difficult to work with numpy
>>> string arrays that yield `bytes` scalars.
>>>
>>
>> It seems the greatest challenge is interacting with binary data from
>> other programs and libraries. If we were living entirely in our own data
>> world, Unicode strings in object arrays would generally be pretty
>> satisfactory. So let's try to get what is needed to read and write other
>> people's formats.
>>
>> I'll note that this is numpy, so variable-width fields (e.g. CSV) don't
>> map directly to numpy arrays; we can store it however we want, as
>> conversion is necessary anyway.
>>
>> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two
>> other packages are waiting specifically for it. But specifying this
>> requires two pieces of information: What is the encoding? and How is the
>> length specified? I know they're not numpy-compatible, but FITS header
>> values are space-padded; does that occur elsewhere? Are there other ways
>> existing data specifies string length within a fixed-size field? There are
>> some cryptographic length-specification tricks - ANSI X.293, ISO 10126,
>> PKCS7, etc. - but they are probably too specialized to need? We should make
>> sure we can support all the ways that actually occur.
>>
>
> Agree with the UTF-8 fixed byte length strings, although I would tend
> towards null terminated.
>
> For  byte strings, it looks like we need a parameterized type. This is for
> two uses, display and conversion to (Python) unicode. One could handle the
> display and conversion using view and astype methods. For instance, we
> already have
>
> In [1]: a = array([1,2,3], uint8) + 0x30
>
> In [2]: a.view('S1')
> Out[2]:
> array(['1', '2', '3'],
>   dtype='|S1')
>
> In [3]: a.view('S1').astype('U')
> Out[3]:
> array([u'1', u'2', u'3'],
>   dtype='
> Chuck
>
> ___
> NumPy-Discussion 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge  wrote:

> On 04/25/2017 01:34 PM, Anne Archibald wrote:
> > I know they're not numpy-compatible, but FITS header values are
> > space-padded; does that occur elsewhere?
>
> Strings in FITS headers are delimited by single quotes.  Some keywords
> (only a handful) are required to have values that are blank-padded (in
> the FITS file) if the value is less than eight characters.  Whether you
> get trailing blanks when you read the header depends on the FITS
> reader.  I use astropy.io.fits to read/write FITS files, and that
> interface strips trailing blanks from character strings:
>
> TARGPROP= 'UNKNOWN '   / Proposer's name for the target
>
>  >>> fd = fits.open("test.fits")
>  >>> s = fd[0].header['targprop']
>  >>> len(s)
> 7
>

Actually, for what it's worth, the FITS spec says that in such values
trailing spaces are not significant, see page 7:
https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf
But they're not really relevant to numpy's situation, because as here you
need to do elaborate de-quoting before they can go into a data structure.
What I was wondering was whether people have data lying around with
fixed-width fields where the strings are space-padded, so that numpy needs
to support that.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker  wrote:

>
> This is essentially my rant about use-case (2):
>
> A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do. Do you need
your in-memory format for this data to be compatible with anything in
particular?

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?

Presumably you're getting byte strings (with no NULLs) from somewhere and
need to store them in this memory structure in a way that makes them as
usable as possible in spite of their unknown encoding. Presumably the thing
to do is just copy them in there as-is and then use .astype to arrange for
python to decode them when accessed. So this is precisely the problem of
"how should I decode random byte strings?" that python has been struggling
with. My impression is that the community has established that there's no
one solution that makes everyone happy, but that most people can cope with
some combination of picking a one-byte encoding,
ascii-with-surrogateescapes, zapping bogus characters, and giving wrong
results. But I think that all the standard python alternatives are needed,
in general, and in terms of interpreting numpy arrays full of bytes.
Clearly your preferred solution is .astype("string[latin-9]"), but just as
clearly that's not going to work for everyone.

If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays; anyone who just has a bunch of python
strings to store is unlikely to be surprised by this. Someone with more
specific needs will choose a more specific - that is, not default - string
data type.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Ambrose LI
2017-04-25 12:34 GMT-04:00 Chris Barker :
> I am totally euro-centric, but as I understand it, that is the whole point
> of the desire for a compact one-byte-per character encoding. If there is a
> strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
> should support that. But this all started with "mostly ascii". My take on
> that is:

But Shift-JIS is not one-byte; it's two-byte (unless you allow only
half-width characters and nothing else). :-) In fact legacy CJK
encodings are all nominally two-byte (so that the width of a
character's internal representation matches that of its visual
representation).

>  - filenames
>
> File names are one of the key reasons folks struggled with the python3 data
> model (particularly on *nix) and why 'surrogateescape' was added. It's
> pretty common to store filenames in with our data, and thus in numpy arrays
> -- we need to preserve them exactly and display them mostly right. Again,
> euro-centric, but if you are euro-centric, then latin-1 is a good choice for
> this.

This I don't understand. As far as I can tell non-Western-European
filenames are not unusual. If filenames are a reason, even if you're
euro-centric (think Eastern Europe, say) I don't see how latin1 is a
good choice.

Lurker here, and I haven't touched numpy in ages. So I might be
blurting out nonsense.

-- 
Ambrose Li // http://o.gniw.ca / http://gniw.ca
If you saw this on CE-L: You do not need my permission to quote
me, only proper attribution. Always cite your sources, even if
you have to anonymize and/or cite it as "personal communication".
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer  wrote:
>
> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker 
wrote:
>>>
>>> On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
really is ascii, then it's perfect. If it really is latin-*, then you get
some extra useful stuff, and if it's corrupted somehow, you still get the
ascii text correct, and the rest won't  barf and can be passed on through.
>
> I am totally in agreement with Thomas that "We are living in a messy
world right now with messy legacy datasets that have character type data
that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't know that we can reasonably make that accounting relevant. Number
of such characters per byte of text? Number of files with such characters
out of all existing files?

What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have
regretted it. While it's just personal anecdote, I think it's at least
measuring the right thing. :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:
>
>>> In this case, we want something compatible with Python's string (i.e.
full Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so
fortunately this is pretty irrelevant to us -- except for our API for
specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the
string, and maybe only if there are more than N non-ascii characters. i.e.
it is very likely to be a run-time error that may not have shown up in
tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
they need to truncate it, and THAT is non-obvious how to do, too -- you
don't want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.

If this becomes the right strategy for dealing with these problems (and I'm
not sure that it is), we can easily make a utility function that does this
for people.

This discussion is why I want to be sure that we have our use cases
actually mapped out. For this kind of in-memory manipulation, I'd use an
object array (a la pandas), then convert to the uniform-width string dtype
when I needed to push this out to a C API, HDF5 file, or whatever actually
requires a string-dtype array. The required width gets computed from the
data after all of the manipulations are done. Doing in-memory assignments
to a fixed-encoding, fixed-width string dtype will always have this kind of
problem. You should only put up with it if you have a requirement to write
to a format that specifies the width and the encoding. That specified
encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that
your data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern  wrote:

> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
> aldcr...@head.cfa.harvard.edu> wrote:
> >
> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker 
> wrote:
>
> >> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> >
> > +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.
>
> That says to me that these are properly represented by `bytes` objects,
> not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.
>

If you could go back 30 years and get every scientist in the world to do
the right thing, then sure.  But we are living in a messy world right now
with messy legacy datasets that have character type data that are *mostly*
ASCII, but not infrequently contain non-ASCII characters.

So I would beg to actually move forward with a pragmatic solution that
addresses very real and consequential problems that we face instead of
waiting/praying for a perfect solution.

- Tom


>
> --
> Robert Kern
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker 
wrote:
>
> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>>>
>>> BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings,
maybe a name field, where you might assign 32 bytes or so -- then someone
has an accented character in their name, and then ge30 or 31 characters --
no big deal.
>>
>>
>> I wouldn't call it a pathological use case, it doesn't seem so uncommon
to have large datasets of short strings.
>
> It's pathological for using a variable-length encoding.
>
>> I personally deal with a database of hundreds of billions of 2 to 5
character ASCII strings.  This has been a significant blocker to Python 3
adoption in my world.
>
> I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

>> BTW, for those new to the list or with a short memory, this topic has
been discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!
>
> yes, let's hope so!
>
> The big difference now is that Julian seems to be committed to actually
making it happen!
>
> Thanks Julian!
>
> Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
>
> I have strong opinions, but would still rather see any of the ideas on
the table implemented than nothing.

FWIW, I prefer nothing to just adding a special case for latin-1. Solve the
HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone
else is willing to solve that problem. I don't think we're at the
bikeshedding stage yet; we're still disagreeing about fundamental
requirements.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:

> latin-1 or latin-9 buys you (over ASCII):
>
> ...
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.


> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>

Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
>

One advantage of a new (non-default) dtype is that we can change this
behavior.


> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>

It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer  wrote:


> In this case, we want something compatible with Python's string (i.e. full
>> Unicode supporting) and I think should be as transparent as possible.
>> Python's string has made the decision to present a character oriented API
>> to users (despite what the manifesto says...).
>>
>
> Yes, but NumPy doesn't really implement string operations, so fortunately
> this is pretty irrelevant to us -- except for our API for specifying dtype
> size.
>

Exactly -- the character-orientation of python strings means that people
are used to thinking that strings have a length that is the number of
characters in the string. I think there will a cognitive dissonance if
someone does:

arr[i] = a_string

Which then raises a ValueError, something like:

String too long for a string[12] dytype array.

When len(a_string) <= 12

AND that will only  occur if there are non-ascii characters in the string,
and maybe only if there are more than N non-ascii characters. i.e. it is
very likely to be a run-time error that may not have shown up in tests.

So folks need to do something like:

len(a_string.encode('utf-8')) to see if their string will fit. If not, they
need to truncate it, and THAT is non-obvious how to do, too -- you don't
want to truncate the encodes bytes naively, you could end up with an
invalid bytestring. but you don't know how many characters to truncate,
either.


> We already have strong precedence for dtypes reflecting number of bytes
> used for storage even when Python doesn't: consider numeric types like
> int64 and float32 compared to the Python equivalents. It's an intrinsic
> aspect of NumPy that users need to think about how their data is actually
> stored.
>

sure, but a float64 is 64 bytes forever an always and the defaults
perfectly match what python is doing under its hood --even if users don't
think about. So the default behaviour of numpy matched python's built-in
types.


Storage cost is always going to be a concern. Arguably, it's even more of a
>> concern today than it used to be be, because compute has been improving
>> faster than storage.
>>
>
sure -- but again, what is the use-case for numpy arrays with a s#$)load of
text in them? common? I don't think so. And as you pointed out numpy
doesn't do text processing anyway, so cache performance and all that are
not important. So having UCS-4 as the default, but allowing folks to select
a more compact format if they really need it is a good way to go. Just like
numpy generally defaults to float64 and Int64 (or 32, depending on
platform) -- users can select a smaller size if they have a reason to.

I guess that's my summary -- just like with numeric values, numpy should
default to Python-like behavior as much as possible for strings, too --
with an option for a knowledgeable user to do something more performant.


> I still don't understand why a latin encoding makes sense as a preferred
> one-byte-per-char dtype. The world, including Python 3, has standardized on
> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>

utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
data are one-byte per char, then you could use ASCII, and it would be
binary compatible with utf-8, but not sure what the point of that is in
this context.

latin-1 or latin-9 buys you (over ASCII):

- A bunch of accented characters -- sure it only covers the latin
languages, but does cover those much better.

- A handful of other characters, including scientifically useful ones. (a
few greek characters, the degree symbol, etc...)

- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.

For Python use -- a pointer to a Python string would be nice.
>>
>
> Yes, absolutely. If we want to be really fancy, we could consider a
> parametric object dtype that allows for object arrays of *any* homogeneous
> Python type. Even if NumPy itself doesn't do anything with that
> information, there are lots of use cases for that information.
>

hmm -- that's nifty idea -- though I think strings could/should be special
cased.


> Then use a native flexible-encoding dtype for everything else.
>>
>
> No opposition here from me. Though again, I think utf-8 alone would also
> be enough.
>

maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.

One more note: if a user tries to assign a value to a numpy string array
>> that doesn't fit, they should get an error:
>>
>
>> EncodingError if it can't be encoded into the defined encoding.
>>
>> ValueError if it is too long -- it should not be silently truncated.
>>
>
> I think we all agree here.
>

I'm actually having second thoughts -- see above -- if the encoding is
utf-8, then truncating is 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Stephan Hoyer
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker 
wrote:

> 1) Use with/from  Python -- both creating and working with numpy arrays.
>

> In this case, we want something compatible with Python's string (i.e. full
> Unicode supporting) and I think should be as transparent as possible.
> Python's string has made the decision to present a character oriented API
> to users (despite what the manifesto says...).
>

Yes, but NumPy doesn't really implement string operations, so fortunately
this is pretty irrelevant to us -- except for our API for specifying dtype
size.

We already have strong precedence for dtypes reflecting number of bytes
used for storage even when Python doesn't: consider numeric types like
int64 and float32 compared to the Python equivalents. It's an intrinsic
aspect of NumPy that users need to think about how their data is actually
stored.


> However, there is a challenge here: numpy requires fixed-number-of-bytes
> dtypes. And full unicode support with fixed number of bytes matching fixed
> number of characters is only possible with UCS-4 -- hence the current
> implementation. And this is actually just fine! I know we all want to be
> efficient with data storage, but really -- in the early days of Unicode,
> when folks thought 16 bits were enough, doubling the memory usage for
> western language storage was considered fine -- how long in computer life
> time does it take to double your memory? But now, when memory, disk space,
> bandwidth, etc, are all literally orders of magnitude larger, we can't
> handle a factor of 4 increase in "wasted" space?
>

Storage cost is always going to be a concern. Arguably, it's even more of a
concern today than it used to be be, because compute has been improving
faster than storage.


> But as scientific text data often is 1-byte compatible, a
> one-byte-per-char dtype is a fine idea, too -- and we pretty much have that
> already with the existing string type -- that could simply be enhanced by
> enforcing the encoding to be latin-9 (or latin-1, if you don't want the
> Euro symbol). This would get us what scientists expect from strings in a
> way that is properly compatible with Python's string type. You'd get
> encoding errors if you tried to stuff anything else in there, and that's
> that.
>

I still don't understand why a latin encoding makes sense as a preferred
one-byte-per-char dtype. The world, including Python 3, has standardized on
UTF-8, which is also one-byte-per-char for (ASCII) scientific data.

So -- I think we should address the use-cases separately -- one for
> "normal" python use and simple interoperability with python strings, and
> one for interoperability at the binary level. And an easy way to convert
> between the two.
>
> For Python use -- a pointer to a Python string would be nice.
>

Yes, absolutely. If we want to be really fancy, we could consider a
parametric object dtype that allows for object arrays of *any* homogeneous
Python type. Even if NumPy itself doesn't do anything with that
information, there are lots of use cases for that information.

Then use a native flexible-encoding dtype for everything else.
>

No opposition here from me. Though again, I think utf-8 alone would also be
enough.


> Thinking out loud -- another option would be to set defaults for the
> multiple-encoding dtype so you'd get UCS-4 -- with its full compatibility
> with the python string type -- and make folks make an effort to get
> anything else.
>

The np.unicode_ type is already UCS-4 and the default for dtype=str on
Python 3. We probably shouldn't change that, but if we set any default
encoding for the new text type, I strongly believe it should be utf-8.

One more note: if a user tries to assign a value to a numpy string array
> that doesn't fit, they should get an error:
>
> EncodingError if it can't be encoded into the defined encoding.
>
> ValueError if it is too long -- it should not be silently truncated.
>

I think we all agree here.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Chris Barker
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts:

1) most of it is focused on utf-8 vs utf-16. And that is a strong argument
-- utf-16 is the worst of both worlds.

2) it isn't really addressing how to deal with fixed-size string storage as
needed by numpy.

It does bring up Python's current approach to Unicode:

"""
This lead to software design decisions such as Python’s string O(1) code
point access. The truth, however, is that Unicode is inherently more
complicated and there is no universal definition of such thing as *Unicode
character*. We see no particular reason to favor Unicode code points over
Unicode grapheme clusters, code units or perhaps even words in a language
for that.
"""

My thoughts on that-- it's technically correct, but practicality beats
purity, and the character concept is pretty darn useful for at least some
(commonly used in the computing world) languages.

In any case, whether the top-level API is character focused doesn't really
have a bearing on the internal encoding, which is very much an
implementation detail in py 3 at least.

And Python has made its decision about that.

So what are the numpy use-cases?

I see essentially two:

1) Use with/from  Python -- both creating and working with numpy arrays.

In this case, we want something compatible with Python's string (i.e. full
Unicode supporting) and I think should be as transparent as possible.
Python's string has made the decision to present a character oriented API
to users (despite what the manifesto says...).

However, there is a challenge here: numpy requires fixed-number-of-bytes
dtypes. And full unicode support with fixed number of bytes matching fixed
number of characters is only possible with UCS-4 -- hence the current
implementation. And this is actually just fine! I know we all want to be
efficient with data storage, but really -- in the early days of Unicode,
when folks thought 16 bits were enough, doubling the memory usage for
western language storage was considered fine -- how long in computer life
time does it take to double your memory? But now, when memory, disk space,
bandwidth, etc, are all literally orders of magnitude larger, we can't
handle a factor of 4 increase in "wasted" space?

Alternatively, Robert's suggestion of having essentially an object array,
where the objects were known to be python strings is a pretty nice idea --
it gives the full power of python strings, and is a perfect one-to-one
match with the python text data model.

But as scientific text data often is 1-byte compatible, a one-byte-per-char
dtype is a fine idea, too -- and we pretty much have that already with the
existing string type -- that could simply be enhanced by enforcing the
encoding to be latin-9 (or latin-1, if you don't want the Euro symbol).
This would get us what scientists expect from strings in a way that is
properly compatible with Python's string type. You'd get encoding errors if
you tried to stuff anything else in there, and that's that.

Yes, it would have to be a "new" dtype for backwards compatibility.

2) Interchange with other systems: passing the raw binary data back and
forth between numpy arrays and other code, written in C, Fortran, or binary
flle formats.

This is a key use-case for numpy -- I think the key to its enormous
success. But how important is it for text? Certainly any data set I've ever
worked with has had gobs of binary numerical data, and a small smattering
of text. So in that case, if, for instance, h5py had to encode/decode text
when transferring between HDF files and numpy arrays, I don't think I'd
ever see the performance hit. As for code complexity -- it would mean more
complex code in interface libs, and less complex code in numpy itself.
(though numpy could provide utilities to make it easy to write the
interface code)

If we do want to support direct binary interchange with other libs, then we
should probably simply go for it, and support any encoding that Python
supports -- as long as you are dealing with multiple encodings, why try to
decide up front which ones to support?

But how do we expose this to numpy users? I still don't like having
non-fixed-width encoding under the hood, but what can you do? Other than
that, having the encoding be a selectable part of the dtype works fine --
and in that case the number of bytes should be the "length" specifier.

This, however, creates a bit of an impedance mismatch between the
"character-focused" approach of the python string type. And requires the
user to understand something about the encoding in order to even know how
many bytes they need -- a utf-8-100 string will hold a different "length"
of string than a utf-16-100 string.

So -- I think we should address the use-cases separately -- one for
"normal" python use and simple interoperability with python strings, and
one for interoperability at the binary level. And an easy way to convert
between the two.

For Python use -- a pointer to a Python string would 

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer  wrote:
>
> On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern 
wrote:
>>
>> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer  wrote:
>> >
>> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern 
wrote:
>> >>
>> >> I don't know of a format off-hand that works with numpy
uniform-length strings and Unicode as well. HDF5 (to my recollection)
supports arrays of NULL-terminated, uniform-length ASCII like FITS, but
only variable-length UTF8 strings.
>> >
>> >
>> > HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed
and variable length versions:
>> > https://github.com/PyTables/PyTables/issues/499
>> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>> >
>> > "Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
>>
>> Ah, okay, I was interpolating from a quick perusal of the h5py docs,
which of course are also constrained by numpy's current set of dtypes. The
NULL-terminated ASCII works well enough with np.string's semantics.
>
> Yes, except that on Python 3, "Fixed length ASCII" in HDF5 should
correspond to a string type, not np.string_ (which is really bytes).

"... well enough with np.string's semantics [that h5py actually used it to
pass data in and out; whether that array is fit for purpose beyond that, I
won't comment]." :-)

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald 
wrote:
>
> On Thu, Apr 20, 2017 at 8:55 PM Robert Kern  wrote:

>> For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.
>
> Actually if I understood the spec, FITS header lines are 80 bytes long
and contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

Never mind, then. :-)

>> If I had to jump ahead and propose new dtypes, I might suggest this:
>>
>> * For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.
>>
>> * Acknowledge the use cases of the current NULL-terminated np.string
dtype, but perhaps add a new canonical alias, document it as being for
those specific use cases, and deprecate/de-emphasize the current name.
>>
>> * Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.
>
> How would this differ from a numpy array of bytes with one more
dimension?

The scalar in the implementation being the scalar in the use case,
immutability of the scalar, directly working with b'' strings in and out
(and thus work with the Python codecs easily).

>> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).
>
> I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps? Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

HDF5 seems to support this, but only for ASCII and UTF8, not a large list
of encodings.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:53, Robert Kern wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor
> >
> wrote:
> 
>> Do you have comments on how to go forward, in particular in regards to
>> new dtype vs modify np.unicode?
> 
> Can we restate the use cases explicitly? I feel like we ended up with
> the current sub-optimal situation because we never really laid out the
> use cases. We just felt like we needed bytestring and unicode dtypes,
> more out of completionism than anything, and we made a bunch of
> assumptions just to get each one done. I think there may be broad
> agreement that many of those assumptions are "wrong", but it would be
> good to reference that against concretely-stated use cases.

We ended up in this situation because we did not take the opportunity to
break compatibility when python3 support was added.
We should have made the string dtype an encoded byte type (ascii or
latin1) in python3 instead of null terminated unencoded bytes which do
not make very much practical sense.

So the use case is very simple: Give users of the string dtype a
migration path that does not involve converting to full utf32 unicode.
The latin1 encoded bytes dtype would allow that.

As we already have the infrastructure this same dtype can allow more
than just latin1 with minimal effort, for the fixed size python
supported stuff it is literally adding an enum entry, two new switch
clauses and a little bit of dtype string parsing and testcases.


Having some form of variable string handling would be nice. But this is
another topic all together.
Having builtin support for variable strings only seems overkill as the
string dtype is not that important and object arrays should work
reasonably well for this usecase already.



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern  wrote:

> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
> jtaylor.deb...@googlemail.com> wrote:
>
> > Do you have comments on how to go forward, in particular in regards to
> > new dtype vs modify np.unicode?
>
> Can we restate the use cases explicitly? I feel like we ended up with the
> current sub-optimal situation because we never really laid out the use
> cases. We just felt like we needed bytestring and unicode dtypes, more out
> of completionism than anything, and we made a bunch of assumptions just to
> get each one done. I think there may be broad agreement that many of those
> assumptions are "wrong", but it would be good to reference that against
> concretely-stated use cases.
>

+1


> FWIW, if I need to work with in-memory arrays of strings in Python code,
> I'm going to use dtype=object a la pandas. It has almost no arbitrary
> constraints, and I can rely on Python's unicode facilities freely. There
> may be some cases where it's a little less memory-efficient (e.g.
> representing a column of enumerated single-character values like 'M'/'F'),
> but that's never prevented me from doing anything (compare to the
> uniform-length restrictions, which *have* prevented me from doing things).
>
> So what's left? Being able to memory-map to files that have string data
> conveniently laid out according to numpy assumptions (e.g. FITS). Being
> able to work with C/C++/Fortran APIs that have arrays of strings laid out
> according to numpy assumptions (e.g. HDF5). I think it would behoove us to
> canvass the needs of these formats and APIs before making any more
> assumptions.
>
> For example, to my understanding, FITS files more or less follow numpy
> assumptions for its string columns (i.e. uniform-length). But it enforces
> 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
> singular motivating use case for the trailing-NULL behavior of np.string.
>

Actually if I understood the spec, FITS header lines are 80 bytes long and
contain ASCII with no NULLs; strings are quoted and trailing spaces are
stripped.

[...]

> If I had to jump ahead and propose new dtypes, I might suggest this:
>
> * For the most part, treat the string dtypes as temporary communication
> formats rather than the preferred in-memory working format, similar to how
> we use `float16` to communicate with GPU APIs.
>
> * Acknowledge the use cases of the current NULL-terminated np.string
> dtype, but perhaps add a new canonical alias, document it as being for
> those specific use cases, and deprecate/de-emphasize the current name.
>
> * Add a dtype for holding uniform-length `bytes` strings. This would be
> similar to the current `void` dtype, but work more transparently with the
> `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
> like `float64` does with `float`. This would not be NULL-terminated. No
> encoding would be implied.
>

How would this differ from a numpy array of bytes with one more dimension?


> * Maybe add a dtype similar to `object_` that only permits `unicode/str`
> (2.x/3.x) strings (and maybe None to represent missing data a la pandas).
> This maintains all of the flexibility of using a `dtype=object` array while
> allowing code to specialize for working with strings without all kinds of
> checking on every item. But most importantly, we can serialize such an
> array to bytes without having to use pickle. Utility functions could be
> written for en-/decoding to/from the uniform-length bytestring arrays
> handling different encodings and things like NULL-termination (also working
> with the legacy dtypes and handling structured arrays easily, etc.).
>

I think there may also be a niche for fixed-byte-size null-terminated
strings of uniform encoding, that do decoding and encoding automatically.
The encoding would naturally be attached to the dtype, and they would
handle too-long strings by either truncating to a valid encoding or simply
raising an exception. As with the current fixed-length strings, they'd
mostly be for communication with other code, so the necessity depends on
whether such other codes exist at all. Databases, perhaps?  Custom hunks of
C that don't want to deal with variable-length packing of data? Actually
this last seems plausible - if I want to pass a great wodge of data,
including Unicode strings, to a C program, writing out a numpy array seems
maybe the easiest.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
Perhaps `np.encoded_str[encoding]` as the name for the new type, if we
decide a new type is necessary?

Am I right in thinking that the general problem here is that it's very easy
to discard metadata when working with dtypes, and that by adding metadata
to `unicode_`, we risk existing code carelessly dropping it? Is this a
problem in both C and python, or just C?

If that's the case, can we end up with a compromise where being careless
just causes old code to promote to ucs32?

On Thu, 20 Apr 2017 at 20:09 Anne Archibald 
wrote:

> On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor <
> jtaylor.deb...@googlemail.com> wrote:
>
>> I probably have formulated my goal with the proposal a bit better, I am
>> not very interested in a repetition of which encoding to use debate.
>> In the end what will be done allows any encoding via a dtype with
>> metadata like datetime.
>> This allows any codec (including truncated utf8) to be added easily (if
>> python supports it) and allows sidestepping the debate.
>>
>> My main concern is whether it should be a new dtype or modifying the
>> unicode dtype. Though the backward compatibility argument is strongly in
>> favour of adding a new dtype that makes the np.unicode type redundant.
>>
>
> Creating a new dtype to handle encoded unicode, with the encoding
> specified in the dtype, sounds perfectly reasonable to me. Changing the
> behaviour of the existing unicode dtype seems like it's going to lead to
> massive headaches unless exactly nobody uses it. The only downside to a new
> type is having to find an obvious name that isn't already in use. (And
> having to actively  maintain/deprecate the old one.)
>
> Anne
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor 
wrote:

> I probably have formulated my goal with the proposal a bit better, I am
> not very interested in a repetition of which encoding to use debate.
> In the end what will be done allows any encoding via a dtype with
> metadata like datetime.
> This allows any codec (including truncated utf8) to be added easily (if
> python supports it) and allows sidestepping the debate.
>
> My main concern is whether it should be a new dtype or modifying the
> unicode dtype. Though the backward compatibility argument is strongly in
> favour of adding a new dtype that makes the np.unicode type redundant.
>

Creating a new dtype to handle encoded unicode, with the encoding specified
in the dtype, sounds perfectly reasonable to me. Changing the behaviour of
the existing unicode dtype seems like it's going to lead to massive
headaches unless exactly nobody uses it. The only downside to a new type is
having to find an obvious name that isn't already in use. (And having to
actively  maintain/deprecate the old one.)

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern  wrote:

> I don't know of a format off-hand that works with numpy uniform-length
> strings and Unicode as well. HDF5 (to my recollection) supports arrays of
> NULL-terminated, uniform-length ASCII like FITS, but only variable-length
> UTF8 strings.
>

HDF5 supports two character sets, ASCII and UTF-8. Both come in fixed and
variable length versions:
https://github.com/PyTables/PyTables/issues/499
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

"Fixed length UTF-8" for HDF5 refers to the number of bytes used for
storage, not the number of characters.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> Do you have comments on how to go forward, in particular in regards to
> new dtype vs modify np.unicode?

Can we restate the use cases explicitly? I feel like we ended up with the
current sub-optimal situation because we never really laid out the use
cases. We just felt like we needed bytestring and unicode dtypes, more out
of completionism than anything, and we made a bunch of assumptions just to
get each one done. I think there may be broad agreement that many of those
assumptions are "wrong", but it would be good to reference that against
concretely-stated use cases.

FWIW, if I need to work with in-memory arrays of strings in Python code,
I'm going to use dtype=object a la pandas. It has almost no arbitrary
constraints, and I can rely on Python's unicode facilities freely. There
may be some cases where it's a little less memory-efficient (e.g.
representing a column of enumerated single-character values like 'M'/'F'),
but that's never prevented me from doing anything (compare to the
uniform-length restrictions, which *have* prevented me from doing things).

So what's left? Being able to memory-map to files that have string data
conveniently laid out according to numpy assumptions (e.g. FITS). Being
able to work with C/C++/Fortran APIs that have arrays of strings laid out
according to numpy assumptions (e.g. HDF5). I think it would behoove us to
canvass the needs of these formats and APIs before making any more
assumptions.

For example, to my understanding, FITS files more or less follow numpy
assumptions for its string columns (i.e. uniform-length). But it enforces
7-bit-clean ASCII and pads with terminating NULLs; I believe this was the
singular motivating use case for the trailing-NULL behavior of np.string.

I don't know of a format off-hand that works with numpy uniform-length
strings and Unicode as well. HDF5 (to my recollection) supports arrays of
NULL-terminated, uniform-length ASCII like FITS, but only variable-length
UTF8 strings.

We should look at some of the newer formats and APIs, like Parquet and
Arrow, and also consider the cross-language APIs with Julia and R.

If I had to jump ahead and propose new dtypes, I might suggest this:

* For the most part, treat the string dtypes as temporary communication
formats rather than the preferred in-memory working format, similar to how
we use `float16` to communicate with GPU APIs.

* Acknowledge the use cases of the current NULL-terminated np.string dtype,
but perhaps add a new canonical alias, document it as being for those
specific use cases, and deprecate/de-emphasize the current name.

* Add a dtype for holding uniform-length `bytes` strings. This would be
similar to the current `void` dtype, but work more transparently with the
`bytes` type, perhaps with the scalar type multiply-inheriting from `bytes`
like `float64` does with `float`. This would not be NULL-terminated. No
encoding would be implied.

* Maybe add a dtype similar to `object_` that only permits `unicode/str`
(2.x/3.x) strings (and maybe None to represent missing data a la pandas).
This maintains all of the flexibility of using a `dtype=object` array while
allowing code to specialize for working with strings without all kinds of
checking on every item. But most importantly, we can serialize such an
array to bytes without having to use pickle. Utility functions could be
written for en-/decoding to/from the uniform-length bytestring arrays
handling different encodings and things like NULL-termination (also working
with the legacy dtypes and handling structured arrays easily, etc.).

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Antoine Pitrou
On Thu, 20 Apr 2017 10:26:13 -0700
Stephan Hoyer  wrote:
> 
> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
> 
> In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
> 
> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.  

I think you want at least: ascii, utf8, ucs2 (aka utf16 without
surrogates), utf32.  That is, 3 common fixed width encodings and one
variable width encoding.

Regards

Antoine.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
> if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation
either, as combining characters are a thing. So I don't think this is a
good argument against UTF8.

Also, is silent truncation a think that we want to allow to happen anyway?
That sounds like something the user ought to be alerted to with an
exception.

> if you wanted to specify that a numpy element would be able to hold, say,
N characters
> ...
> It simply is not the right way to handle text if [...] you need
fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due
to combining characters. The only two meaningful things to count are:
* Graphemes, as that's what the user sees visually. These can span multiple
code-points
* Bytes of encoded data, as that's the space needed to store them

So I would argue that the approach of fixed-codepoint-length storage is
itself a flawed design, and so should not be used as a constraint on numpy.

Counting graphemes is hard, so that leaves the only sensible option as a
byte count.

I don't forsee variable-length encodings being a problem
implementation-wise - they only become one if numpy were to acquire a
vectorized substring function that is intended to return a view.

I think I'd be in favor of supporting all encodings, and falling back on
python to handle encoding/decoding them.


On Thu, 20 Apr 2017 at 18:44 Chris Barker  wrote:

> On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer  wrote:
>
>> I agree with Anne here. Variable-length encoding would be great to have,
>> but even fixed length UTF-8 (in terms of memory usage, not characters)
>> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
>> fixed size per array element, but that doesn't mean we need a fixed sized
>> per character. Each element in a UTF-8 array would be a string with a fixed
>> number of codepoints, not characters.
>>
>
> Ah, yes -- the nightmare of Unicode!
>
> No, it would not be a fixed number of codepoints -- it would be a fixed
> number of bytes (or "code units"). and an unknown number of characters.
>
> As Julian pointed out, if you wanted to specify that a numpy element would
> be able to hold, say, N characters (actually code points, combining
> characters make this even more confusing) then you would need to allocate
> N*4 bytes to make sure you could hold any string that long. Which would be
> pretty pointless -- better to use UCS-4.
>
> So Anne's suggestion that numpy truncates as needed would make sense --
> you'd specify say N characters, numpy would arbitrarily (or user specified)
> over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
> string that didn't fit. Then you'd need to make sure you truncated
> correctly, so as not to create an invalid string (that's just code, it
> could be made correct).
>
> But how much to over allocate? for english text, with an occasional
> scientific symbol, only a little. for, say, Japanese text, you'd need a
> factor 2 maybe?
>
> Anyway, the idea that "just use utf-8" solves your problems is really
> dangerous. It simply is not the right way to handle text if:
>
> you need fixed-length storage
> you care about compactness
>
> In fact, we already have this sort of distinction between element size and
>> memory usage: np.string_ uses null padding to store shorter strings in a
>> larger dtype.
>>
>
> sure -- but it is clear to the user that the dtype can hold "up to this
> many" characters.
>
>
>> The only reason I see for supporting encodings other than UTF-8 is for
>> memory-mapping arrays stored with those encodings, but that seems like a
>> lot of extra trouble for little gain.
>>
>
> I see it the other way around -- the only reason TO support utf-8 is for
> memory mapping with other systems that use it :-)
>
> On the other hand,  if we ARE going to support utf-8 -- maybe use it for
> all unicode support, rather than messing around with all the multiple
> encoding options.
>
> I think a 1-byte-per char latin-* encoded string is a good idea though --
> scientific use tend to be latin only and space constrained.
>
> All that being said, if the truncation code were carefully written, it
> would mostly "just work"
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer  wrote:

> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
>

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed
number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would
be able to hold, say, N characters (actually code points, combining
characters make this even more confusing) then you would need to allocate
N*4 bytes to make sure you could hold any string that long. Which would be
pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense --
you'd specify say N characters, numpy would arbitrarily (or user specified)
over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a
string that didn't fit. Then you'd need to make sure you truncated
correctly, so as not to create an invalid string (that's just code, it
could be made correct).

But how much to over allocate? for english text, with an occasional
scientific symbol, only a little. for, say, Japanese text, you'd need a
factor 2 maybe?

Anyway, the idea that "just use utf-8" solves your problems is really
dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
>

sure -- but it is clear to the user that the dtype can hold "up to this
many" characters.


> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.
>

I see it the other way around -- the only reason TO support utf-8 is for
memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for
all unicode support, rather than messing around with all the multiple
encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though --
scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it
would mostly "just work"

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Neal Becker
I'm no unicode expert, but can't we truncate unicode strings so that only
valid characters are included?

On Thu, Apr 20, 2017 at 1:32 PM Chris Barker  wrote:

> On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald  > wrote:
>
>> Is there any reason not to support all Unicode encodings that python
>> does, with the same names and semantics? This would surely be the simplest
>> to understand.
>>
>
> I think it should support all fixed-length encodings, but not the
> non-fixed length ones -- they just don't fit well into the numpy data model.
>
>
>> Also, if latin1 is to going to be the only practical 8-bit encoding,
>> maybe check with some non-Western users to make sure it's not going to
>> wreck their lives? I'd have selected ASCII as an encoding to treat
>> specially, if any, because Unicode already does that and the consequences
>> are familiar. (I'm used to writing and reading French without accents
>> because it's passed through ASCII, for example.)
>>
>
> latin-1 (or latin-9) only makes things better than ASCII -- it buys most
> of the accented characters for the European language and some symbols that
> are nice to have (I use the degree symbol a lot...). And it is ASCII
> compatible -- so there is NO reason to choose ASCII over Latin-*
>
> Which does no good for non-latin languages -- so we need to hear from the
> community -- is there a substantial demand for a non-latin one-byte per
> character encoding?
>
>
>> Variable-length encodings, of which UTF-8 is obviously the one that makes
>> good handling essential, are indeed more complicated. But is it strictly
>> necessary that string arrays hold fixed-length *strings*, or can the
>> encoding length be fixed instead? That is, currently if you try to assign a
>> longer string than will fit, the string is truncated to the number of
>> characters in the data type.
>>
>
> we could do that, yes, but an improperly truncated "string" becomes
> invalid -- just seems like a recipe for bugs that won't be found in testing.
>
> memory is cheap, compressing is fast -- we really shouldn't get hung up on
> this!
>
> Note: if you are storing a LOT of text (which I have no idea why you would
> use numpy anyway), then the memory size might matter, but then
> semi-arbitrary truncation would probably matter, too.
>
> I expect most text storage in numpy arrays is things like names of
> datasets, ids, etc, etc -- not massive amounts of text -- so storage space
> really isn't critical. but having an id or something unexpectedly truncated
> could be bad.
>
> I think practical experience has shown us that people do not handle
> "mostly fixed length but once in awhile not" text well -- see the nightmare
> of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so
> errors are far more likely to be found in tests (why would you use utf-8 is
> all your data are in ascii???). but still -- why invite hard to test for
> errors?
>
> Final point -- as Julian suggests, one reason to support utf-8 is for
> interoperability with other systems -- but that makes errors more of an
> issue -- if it doesn't pass through the numpy truncation machinery, invalid
> data could easily get put in a numpy array.
>
> -CHB
>
>  it would allow UTF-8 to be used just the way it usually is - as an
>> encoding that's almost 8-bit.
>>
>
> ouch! that perception is the route to way too many errors! it is by no
> means almost 8-bit, unless your data are almost ascii -- in which case, use
> latin-1 for pity's sake!
>
> This highlights my point though -- if we support UTF-8, people WILL use
> it, and only test it with mostly-ascii text, and not find the bugs that
> will crop up later.
>
> All this said, it seems to me that the important use cases for string
>> arrays involve interaction with existing binary formats, so people who have
>> to deal with such data should have the final say. (My own closest approach
>> to this is the FITS format, which is restricted by the standard to ASCII.)
>>
>
> yup -- not sure we'll get much guidance here though -- netdf does not
> solve this problem well, either.
>
> But if you are pulling, say, a utf-8 encoded string out of a netcdf file
> -- it's probably better to pull it out as bytes and pass it through the
> python decoding/encoding machinery than pasting the bytes straight to a
> numpy array and hope that the encoding and truncation are correct.
>
> -CHB
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
Julian -- thanks for taking this on. NumPy's handling of strings on Python
3 certainly needs fixing.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald 
wrote:

> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type. Instead, for encoded Unicode, the string could
> be truncated so that the encoding fits. Of course this is not completely
> trivial for variable-length encodings, but it should be doable, and it
> would allow UTF-8 to be used just the way it usually is - as an encoding
> that's almost 8-bit.
>

I agree with Anne here. Variable-length encoding would be great to have,
but even fixed length UTF-8 (in terms of memory usage, not characters)
would solve NumPy's Python 3 string problem. NumPy's memory model needs a
fixed size per array element, but that doesn't mean we need a fixed sized
per character. Each element in a UTF-8 array would be a string with a fixed
number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and
memory usage: np.string_ uses null padding to store shorter strings in a
larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for
memory-mapping arrays stored with those encodings, but that seems like a
lot of extra trouble for little gain.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor 
wrote:

> To please everyone I think we need to go with a dtype that supports
> multiple encodings via metadata, similar to how datatime supports
> multiple units.
> E.g.: 'U10[latin1]' are 10 characters in latin1 encoding
>
> Encodings we should support are:
> - latin1 (1 bytes):
> it is compatible with ascii and adds extra characters used in the
> western world.
> - utf-32 (4 bytes):
> can represent every character, equivalent with np.unicode
>
> Encodings we should maybe support:
> - utf-16 with explicitly disallowing surrogate pairs (2 bytes):
> this covers a very large range of possible characters in a reasonably
> compact representation
> - utf-8 (4 bytes):
> variable length encoding with minimum size of 1 bytes, but we would need
> to assume the worst case of 4 bytes so it would not save anything
> compared to utf-32 but may allow third parties replace an encoding step
> with trailing null trimming on serialization.
>

I should say first that I've never used even non-Unicode string arrays, but
is there any reason not to support all Unicode encodings that python does,
with the same names and semantics? This would surely be the simplest to
understand.

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
check with some non-Western users to make sure it's not going to wreck
their lives? I'd have selected ASCII as an encoding to treat specially, if
any, because Unicode already does that and the consequences are familiar.
(I'm used to writing and reading French without accents because it's passed
through ASCII, for example.)

Variable-length encodings, of which UTF-8 is obviously the one that makes
good handling essential, are indeed more complicated. But is it strictly
necessary that string arrays hold fixed-length *strings*, or can the
encoding length be fixed instead? That is, currently if you try to assign a
longer string than will fit, the string is truncated to the number of
characters in the data type. Instead, for encoded Unicode, the string could
be truncated so that the encoding fits. Of course this is not completely
trivial for variable-length encodings, but it should be doable, and it
would allow UTF-8 to be used just the way it usually is - as an encoding
that's almost 8-bit.

All this said, it seems to me that the important use cases for string
arrays involve interaction with existing binary formats, so people who have
to deal with such data should have the final say. (My own closest approach
to this is the FITS format, which is restricted by the standard to ASCII.)

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion