Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 18:18 GMT+02:00 Chris Barker : > On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Chris Barker
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 13:27 GMT+02:00 Neal Becker : > So while compression+ucs-4 might be OK for out-of-core representation, > what about in-core? blosc+ucs-4? I don't think that works for mmap, does > it? > ​Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-mem

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Neal Becker
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it? On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > >> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > >> It's worthwhile enough that both major HDF5 bindings don't support >> Unicode arrays, despite user requests for years. The sticking point seems >> to be the difference between HDF5's view of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > It's worthwhile enough that both major HDF5 bindings don't support Unicode > arrays, despite user requests for years. The sticking point seems to be the > difference between HDF5's view of a Unicode string array (defined in size > by the b

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern wrote: > The proposal is for only latin-1 and UTF-32 to be supported at first, and > the eventual support of UTF-8 will be constrained by specification of the > width in terms of characters rather than bytes, which conflicts with the > use cases of UTF

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker wrote: > But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: > > * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. > > For THAT -- utf-8

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > > On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: >> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer wrote: > > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and > myself have already given), but we seem to be talking past each other here. > yeah -- I think it's not clear what the use cases we are talking about are. > I am s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: [...] > I have read every mail and it has been a large waste of time, Everything > has been said already many times in the last few years. > Even if you memory ma

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker wrote: > When a numpy user wants to put a string into a numpy array, they should > know how long a string they can fit -- with "length" defined how python > strings define it. > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern wrote: > >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many cases. > isn't UTF-32 pret

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right.

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith wrote: > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the model python 3 has chosen. I wrote a much > longer rant about that e

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right. (

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Indeed, > >> Most of this discussion is irrelevant to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Sebastian Berg
On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote: > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.co > > m>> > > wrote: > > > > > Indeed, > > > Most of this discussion is irrelevant to numpy. > > > Numpy only r

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread josef . pktd
On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith wrote: > On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" > wrote: > > > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" < chris.bar...@noaa.gov> wrote: UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much lo

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald wrote: > > On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: >> >> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: >>> >>> > The maximum length of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 19:08, Robert Kern wrote: > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > >> Indeed, >> Most of this discussion is irrelevant to numpy. >> Numpy only really deals with the in memory storage of strings. And in >> that it is limited

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you g

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker - NOAA Federal
> > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with > > a few extra characters" data. With all the sloppiness over the years, there > > are way to many files like that. > > That sloppiness that you mention is precisely the "unknown encoding" problem. Exactly -- but fro

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Eric Wieser
> I think we can implement viewers for strings as ndarray subclasses. Then one > could > do `my_string_array.view(latin_1)`, and so on. Essentially that just > changes the default > encoding of the 'S' array. That could also work for uint8 arrays if needed. > > Chuck To handle structured data-typ

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Charles R Harris
On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > On 26.04.2017 03:55, josef.p...@gmail.com wrote: > > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > > wrote: > >> > >> > >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern > wrote: > >>> > >>> On Tue, Apr

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Anne Archibald
On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: > On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > >> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < >> charlesr.har...@gmail.com> wrote: >> >> > The maximum length of an UTF-8 character is 4 bytes, so we could use >> that to size arr

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 03:55, josef.p...@gmail.com wrote: > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > wrote: >> >> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >>> wrote: >>> > Presumably you're getting byte stri

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Stephan Hoyer
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris wrote: > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Aldcroft, Thomas
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: > > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > > > Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > kno

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread josef . pktd
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >> wrote: >> >> >> Presumably you're getting byte strings (with unknown encoding. >> > >> > No -- thus is for cre

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < > chris.bar...@noaa.gov> wrote: > > >> Presumably you're getting byte strings (with unknown encoding. > > > > No -- thus is for creating and using mostly ascii string data with >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal wrote: >> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > >> Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > knows? That may be simply unso

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: >> Presumably you're getting byte strings (with unknown encoding. > > No -- thus is for creating and using mostly ascii string data with python and numpy. > > Unknown encoding bytes belong in byte arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > Eh... First, on Windows and MacOS, filenames are natively Unicode. Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable. > s. And then from in Python, if you want to actually work with those

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a characterbut this whole conversation s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 10:13 AM, "Anne Archibald" wrote: On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern > wrote: > >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < >> charlesr.har...@gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < >> peridot.face...@gma

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker" wrote: - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need t

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < > peridot.face...@gmail.com> wrote: > > >> Clearly there is a need for fixed-storage-size zero

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 11:53 AM, "Robert Kern" wrote: On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. B

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Eric Wieser
Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing. Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as: In [1]: a = array

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald wrote: > > On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > >> * HDF5 supports fixed-length and variable-length string arrays encoded in >> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite >> the documentation claiming

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge wrote: > On 04/25/2017 01:34 PM, Anne Archibald wrote: > > I know they're not numpy-compatible, but FITS header values are > > space-padded; does that occur elsewhere? > > Strings in FITS headers are delimited by single quotes. Some keywords > (only a h

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker wrote: > > This is essentially my rant about use-case (2): > > A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be compatible with anything in par

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Phil Hodge
On 04/25/2017 01:34 PM, Anne Archibald wrote: I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Strings in FITS headers are delimited by single quotes. Some keywords (only a handful) are required to have values that are blank-padded (in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > * HDF5 supports fixed-length and variable-length string arrays encoded in > ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite > the documentation claiming that there are more options). In practice, the > ASCII strings pe

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker wrote: > > On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: >> >> 2017-04-25 12:34 GMT-04:00 Chris Barker : >> > I am totally euro-centric, > >> But Shift-JIS is not one-byte; it's two-byte (unless you allow only >> half-width characters and nothin

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to nail down first -- as Robert has made > clear. > I w

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
Now my proposal for the other use cases: 2) There be some way to store mostly ascii-compatible strings in a single > byte-per-character array -- so not to be wasting space for "typical > european-language-oriented data". Note: this should ALSO be compatible with > Python's character-oriented strin

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. > > So I'll

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: > 2017-04-25 12:34 GMT-04:00 Chris Barker : > > I am totally euro-centric, > > But Shift-JIS is not one-byte; it's two-byte (unless you allow only > half-width characters and nothing else). :-) bad example then -- are their other non-euro-cen

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Ambrose LI
2017-04-25 12:34 GMT-04:00 Chris Barker : > I am totally euro-centric, but as I understand it, that is the whole point > of the desire for a compact one-byte-per character encoding. If there is a > strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we > should support that. But t

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
OK -- onto proposals: 1) The default behaviour for numpy arrays of strings is compatible with > Python3's string model: i.e. fully unicode supporting, and with a character > oriented interface. i.e. if you do:: > > arr = np.array(("this", "that",)) > > you get an array that can store ANY unicode

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern wrote: > > My question: What are those non-ASCII characters? How often are they > truly latin-1/9 vs. some other text encoding vs. non-string binary data? > > I don't know that we can reasonably make that accounting relevant. Number > of such character

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
This is essentially my rant about use-case (2): A compact dtype for mostly-ascii text: On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker > wrote: > >> On the other hand, if this is the use-case, perhaps we really want an >>> encoding closer

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Chris, you've mashed all of my emails together, some of them are in reply > to you, some in reply to others. Unfortunately, this dropped a lot of the > context from each of them, and appears to be creating some > misunderstandings about what e

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > > On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > > > >> That said, AFAICT what people actually want in most use cases is support > >> for arrays that can hold variable-len

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith wrote: > But also, is it important whether strings we're loading/saving to an > HDF5 file have the same in-memory representation in numpy as they > would in the file? I *know* [1] no-one is reading HDF5 files using > np.memmap :-). Of course they

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > >> That said, AFAICT what people actually want in most use cases is support >> for arrays that can hold variable-length strings, and the only place where >> the current approach is *opt

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith wrote: > That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats th

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Nathaniel Smith
On Apr 21, 2017 2:34 PM, "Stephan Hoyer" wrote: I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data. You may already know this, but

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 5:56 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 7:11 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < > aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern > wrote: > >> > >> I am not unfamiliar with this problem. I still work with files that > have fie

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: >>> >>> On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would sugg

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Let me make a counter-proposal for your latin-1 dtype (your #2) that might > address your, Thomas's, and Julian's use cases: > > 2) We want a single-byte-per-character, NULL-terminated string dtype that > can be used to represent mostly-ASCII

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: >> >> I am not unfamiliar with this problem. I still work with files that have fields that are supposed to be in EBCDIC but actually contain text in ASC

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker wrote: > On the other hand, if this is the use-case, perhaps we really want an >> encoding closer to "Python 2" string, i.e, "unknown", to let this be >> signaled more explicitly. I would suggest that "text[unknown]" should >> support operations like

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
Chris, you've mashed all of my emails together, some of them are in reply to you, some in reply to others. Unfortunately, this dropped a lot of the context from each of them, and appears to be creating some misunderstandings about what each person is advocating. On Mon, Apr 24, 2017 at 2:00 PM, Ch

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern wrote: > I am not unfamiliar with this problem. I still work with files that have > fields that are supposed to be in EBCDIC but actually contain text in > ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit > encodings. In that expe

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern wrote: > > I agree -- it is a VERY common case for scientific data sets. But a > one-byte-per-char encoding would handle it nicely, or UCS-4 if you want > Unicode. The wasted space is not that big a deal with short strings... > > Unless if you have hu

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:56 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: >> >> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >> > >> > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barke

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > >>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decis

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 2:47 PM, Robert Kern wrote: > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < > aldcr...@head.cfa.harvard.edu> wrote: > > > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker > wrote: > > >> - round-tripping of binary data (at least with Python's > encoding/decoding) --

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > > On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: >> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Robert Kern
On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker wrote: > > On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: >>> >>> BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings,

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote: > BTW -- maybe we should keep the pathological use-case in mind: really >> short strings. I think we are all thinking in terms of longer strings, >> maybe a name field, where you might assign 32 bytes or so

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Mon, Apr 24, 2017 at 10:51 AM, Stephan Hoyer wrote: > - round-tripping of binary data (at least with Python's encoding/decoding) >> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the >> same bytes back. You may get garbage, but you won't get an EncodingError. >> > > For

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Aldcroft, Thomas
On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker wrote: > On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > > >> In this case, we want something compatible with Python's string (i.e. >>> full Unicode supporting) and I think should be as transparent as possible. >>> Python's string has made th

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Stephan Hoyer
On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker wrote: > latin-1 or latin-9 buys you (over ASCII): > > ... > > - round-tripping of binary data (at least with Python's encoding/decoding) > -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the > same bytes back. You may get garb

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-24 Thread Chris Barker
On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer wrote: > In this case, we want something compatible with Python's string (i.e. full >> Unicode supporting) and I think should be as transparent as possible. >> Python's string has made the decision to present a character oriented API >> to users (de

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Stephan Hoyer
On Fri, Apr 21, 2017 at 11:34 AM, Chris Barker wrote: > 1) Use with/from Python -- both creating and working with numpy arrays. > > In this case, we want something compatible with Python's string (i.e. full > Unicode supporting) and I think should be as transparent as possible. > Python's strin

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-21 Thread Chris Barker
I just re-read the "Utf-8" manifesto, and it helped me clarify my thoughts: 1) most of it is focused on utf-8 vs utf-16. And that is a strong argument -- utf-16 is the worst of both worlds. 2) it isn't really addressing how to deal with fixed-size string storage as needed by numpy. It does bring

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 1:16 PM, Phil Hodge wrote: > > On 04/20/2017 03:17 PM, Anne Archibald wrote: >> >> Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. > > FITS BINTABLE extensions can h

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Phil Hodge
On 04/20/2017 03:17 PM, Anne Archibald wrote: Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. FITS BINTABLE extensions can have columns containing strings, and in that case the value

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:51 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: >> >> On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: >> > >> > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> >> >> I don't know of a format off-hand that works wi

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Marten van Kerkwijk
> I suggest a new data type 'text[encoding]', 'T'. I like the suggestion very much (it is even in between S and U!). The utf-8 manifesto linked to above convinced me that the number that should follow is the number of bytes, which is nicely consistent with use in all numerical dtypes. Any way, m

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:27 PM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 20.04.2017 20:53, Robert Kern wrote: > > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Do you have comments on how to go forward, in particu

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Stephan Hoyer
On Thu, Apr 20, 2017 at 12:17 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern > wrote: > >> > >> I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to m

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:17 PM, Anne Archibald wrote: > > On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: >> For example, to my understanding, FITS files more or less follow numpy assumptions for its string columns (i.e. uniform-length). But it enforces 7-bit-clean ASCII and pads with termi

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:59, Anne Archibald wrote: > On Thu, Apr 20, 2017 at 8:17 PM Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > > I probably have formulated my goal with the proposal a bit better, I am > not very interested in a repetition of which encoding to use debate. >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Feng Yu
I suggest a new data type 'text[encoding]', 'T'. 1. text can be cast to python strings via decoding. 2. Conceptually casting to python bytes first cast to a string then calls encode(); the current encoding in the meta data is used by default, but the new encoding can be overridden. I slightly f

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Charles R Harris
On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explici

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Julian Taylor
On 20.04.2017 20:53, Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > >> Do you have comments on how to go forward, in particular in regards to >> new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Robert Kern
On Thu, Apr 20, 2017 at 12:05 PM, Stephan Hoyer wrote: > > On Thu, Apr 20, 2017 at 11:53 AM, Robert Kern wrote: >> >> I don't know of a format off-hand that works with numpy uniform-length strings and Unicode as well. HDF5 (to my recollection) supports arrays of NULL-terminated, uniform-length AS

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-20 Thread Anne Archibald
On Thu, Apr 20, 2017 at 8:55 PM Robert Kern wrote: > On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitl

  1   2   >