Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge wrote: > > On 04/20/2017 03:17 PM, Anne Archibald wrote: >> >> Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. > > FITS BINTABLE extensions can h

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Phil Hodge
On 04/20/2017 03:17 PM, Anne Archibald wrote: Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. FITS BINTABLE extensions can have columns containing strings, and in that case the value

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: >> >> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> > >> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> >> >> I don't know of a format off-hand that works wi

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Marten van Kerkwijk
> I suggest a new data type 'text[encoding]', 'T'. I like the suggestion very much (it is even in between S and U!). The utf-8 manifesto linked to above convinced me that the number that should follow is the number of bytes, which is nicely consistent with use in all numerical dtypes. Any way, m

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 20.04.2017 20:53, Robert Kern wrote: > > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Do you have comments on how to go forward, in particu

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern > wrote: > >> > >> I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to m

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald wrote: > > On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: >> For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with termi

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:59, Anne Archibald wrote: > On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Feng Yu
I suggest a new data type 'text[encoding]', 'T'. 1. text can be cast to python strings via decoding. 2. Conceptually casting to python bytes first cast to a string then calls encode(); the current encoding in the meta data is used by default, but the new encoding can be overridden. I slightly f

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Charles R Harris
On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explici

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:53, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > >> Do you have comments on how to go forward, in particular in regards to >> new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length AS

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitl

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
Perhaps `np.encoded_str[encoding]` as the name for the new type, if we decide a new type is necessary? Am I right in thinking that the general problem here is that it's very easy to discard metadata when working with dtypes, and that by adding metadata to `unicode_`, we risk existing code careless

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor wrote: > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. > In the end what will be done allows any encoding via a dtype with > metadata like datetime. > Thi

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: > I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only variable-length > UTF8 strings. >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Do you have comments on how to go forward, in particular in regards to > new dtype vs modify np.unicode? Can we restate the use cases explicitly? I feel like we ended up with the current sub-optimal situation

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Antoine Pitrou
On Thu, 20 Apr 2017 10:26:13 -0700 Stephan Hoyer wrote: > > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a > fixed size pe

Re: [Numpy-discussion] Relaxed stride checking fixup

2017-04-20 Thread Charles R Harris
On Thu, Apr 20, 2017 at 4:21 AM, Ralf Gommers wrote: > > > On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > >> Hi All, >> >> Currently numpy master has a bogus stride that will cause an error when >> downstream projects misuse it. That is done in order to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker wrote: > On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > >> I agree with Anne here. Variable-length encoding would be great to have, >> but even fixed length UTF-8 (in terms of memory usage, not characters) >> would solve NumPy's Python 3 s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Eric Wieser
> if you truncate a utf-8 bytestring, you may get invalid data Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8. Also, is silent truncation a think that we want to allow to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:36 AM, Neal Becker wrote: > I'm no unicode expert, but can't we truncate unicode strings so that only > valid characters are included? > sure -- it's just a bit fiddly -- and you need to make sure that everything gets passed through the proper mechanism. numpy is all a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer wrote: > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a > fixed size per a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Neal Becker
I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included? On Thu, Apr 20, 2017 at 1:32 PM Chris Barker wrote: > On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald > wrote: > >> Is there any reason not to support all Unicode encodings that python >> do

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald wrote: > Is there any reason not to support all Unicode encodings that python does, > with the same names and semantics? This would surely be the simplest to > understand. > I think it should support all fixed-length encodings, but not the non-fixe

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing. On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald wrote: > Variable-length encodings, of which UTF-8 is obviously the one that makes > good handling essential, are indeed more complicated. But is i

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Chris Barker
Thanks so much for reviving this conversation -- we really do need to address this. My thoughts: What people apparently want is a string type for Python3 which uses less > memory for the common science use case which rarely needs more than > latin1 encoding. > Yes -- I think there is a real dema

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor wrote: > To please everyone I think we need to go with a dtype that supports > multiple encodings via metadata, similar to how datatime supports > multiple units. > E.g.: 'U10[latin1]' are 10 characters in latin1 encoding > > Encodings we should suppo

[Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
Hello, As you probably know numpy does not deal well with strings in Python3. The np.string type is actually zero terminated bytes and not a string. In Python2 this happened to work out as it treats bytes and strings the same way. But in Python3 this type is pretty hard to work with as each time yo

Re: [Numpy-discussion] Relaxed stride checking fixup

2017-04-20 Thread Ralf Gommers
On Thu, Apr 20, 2017 at 6:28 AM, Charles R Harris wrote: > Hi All, > > Currently numpy master has a bogus stride that will cause an error when > downstream projects misuse it. That is done in order to help smoke out > errors. Previously that bogus stride has been fixed up for releases, but > that