Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Stephan Hoyer
On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris wrote: > The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Aldcroft, Thomas
On Tue, Apr 25, 2017 at 7:11 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: > > On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > > > Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > kno

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread josef . pktd
On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >> wrote: >> >> >> Presumably you're getting byte strings (with unknown encoding. >> > >> > No -- thus is for cre

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < > chris.bar...@noaa.gov> wrote: > > >> Presumably you're getting byte strings (with unknown encoding. > > > > No -- thus is for creating and using mostly ascii string data with >

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Tue, Apr 25, 2017 at 4:11 PM, Chris Barker - NOAA Federal wrote: >> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > >> Eh... First, on Windows and MacOS, filenames are natively Unicode. > > Yeah, though once they are stored I. A text file -- who the heck > knows? That may be simply unso

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal < chris.bar...@noaa.gov> wrote: >> Presumably you're getting byte strings (with unknown encoding. > > No -- thus is for creating and using mostly ascii string data with python and numpy. > > Unknown encoding bytes belong in byte arrays

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
> On Apr 25, 2017, at 12:38 PM, Nathaniel Smith wrote: > Eh... First, on Windows and MacOS, filenames are natively Unicode. Yeah, though once they are stored I. A text file -- who the heck knows? That may be simply unsolvable. > s. And then from in Python, if you want to actually work with those

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
Actually, for what it's worth, the FITS spec says that in such values trailing spaces are not significant, see page 7: https://fits.gsfc.nasa.gov/standard40/fits_standard40draft1.pdf But they're not really relevant to numpy's situation, because as here you need to do elaborate de-quoting before the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker - NOAA Federal
A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a characterbut this whole conversation s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 10:13 AM, "Anne Archibald" wrote: On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 1:30 PM, Charles R Harris wrote: > > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern > wrote: > >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < >> charlesr.har...@gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < >> peridot.face...@gma

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 9:35 AM, "Chris Barker" wrote: - filenames File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need t

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 12:30 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: >> >> On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: >> > >> > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald <

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 12:52 PM, Robert Kern wrote: > On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < > peridot.face...@gmail.com> wrote: > > >> Clearly there is a need for fixed-storage-size zero

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Nathaniel Smith
On Apr 25, 2017 11:53 AM, "Robert Kern" wrote: On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. B

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Eric Wieser
Chuck: That sounds like something we want to deprecate, for the same reason that python3 no longer allows str(b'123') to do the right thing. Specifically, it seems like astype should always be forbidden to go between unicode and byte arrays - so that would need to be written as: In [1]: a = array

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Charles R Harris
On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald wrote: > > On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > >> * HDF5 supports fixed-length and variable-length string arrays encoded in >> ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite >> the documentation claiming

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:52 PM Phil Hodge wrote: > On 04/25/2017 01:34 PM, Anne Archibald wrote: > > I know they're not numpy-compatible, but FITS header values are > > space-padded; does that occur elsewhere? > > Strings in FITS headers are delimited by single quotes. Some keywords > (only a h

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:36 PM Chris Barker wrote: > > This is essentially my rant about use-case (2): > > A compact dtype for mostly-ascii text: > I'm a little confused about exactly what you're trying to do. Do you need your in-memory format for this data to be compatible with anything in par

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Phil Hodge
On 04/25/2017 01:34 PM, Anne Archibald wrote: I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Strings in FITS headers are delimited by single quotes. Some keywords (only a handful) are required to have values that are blank-padded (in

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 7:09 PM Robert Kern wrote: > * HDF5 supports fixed-length and variable-length string arrays encoded in > ASCII and UTF-8. In all cases, these strings are NULL-terminated (despite > the documentation claiming that there are more options). In practice, the > ASCII strings pe

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 10:04 AM, Chris Barker wrote: > > On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: >> >> 2017-04-25 12:34 GMT-04:00 Chris Barker : >> > I am totally euro-centric, > >> But Shift-JIS is not one-byte; it's two-byte (unless you allow only >> half-width characters and nothin

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Anne Archibald
On Tue, Apr 25, 2017 at 6:05 PM Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with > the use-cases, so I'm not sure if there is any consensus on the use cases > -- which I think we really do need to nail down first -- as Robert has made > clear. > I w

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
Now my proposal for the other use cases: 2) There be some way to store mostly ascii-compatible strings in a single > byte-per-character array -- so not to be wasting space for "typical > european-language-oriented data". Note: this should ALSO be compatible with > Python's character-oriented strin

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Robert Kern
On Tue, Apr 25, 2017 at 9:01 AM, Chris Barker wrote: > Anyway, I think I made the mistake of mingling possible solutions in with the use-cases, so I'm not sure if there is any consensus on the use cases -- which I think we really do need to nail down first -- as Robert has made clear. > > So I'll

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Tue, Apr 25, 2017 at 9:57 AM, Ambrose LI wrote: > 2017-04-25 12:34 GMT-04:00 Chris Barker : > > I am totally euro-centric, > > But Shift-JIS is not one-byte; it's two-byte (unless you allow only > half-width characters and nothing else). :-) bad example then -- are their other non-euro-cen

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Ambrose LI
2017-04-25 12:34 GMT-04:00 Chris Barker : > I am totally euro-centric, but as I understand it, that is the whole point > of the desire for a compact one-byte-per character encoding. If there is a > strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we > should support that. But t

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
OK -- onto proposals: 1) The default behaviour for numpy arrays of strings is compatible with > Python3's string model: i.e. fully unicode supporting, and with a character > oriented interface. i.e. if you do:: > > arr = np.array(("this", "that",)) > > you get an array that can store ANY unicode

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern wrote: > > My question: What are those non-ASCII characters? How often are they > truly latin-1/9 vs. some other text encoding vs. non-string binary data? > > I don't know that we can reasonably make that accounting relevant. Number > of such character

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
This is essentially my rant about use-case (2): A compact dtype for mostly-ascii text: On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer wrote: > On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker > wrote: > >> On the other hand, if this is the use-case, perhaps we really want an >>> encoding closer

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-25 Thread Chris Barker
On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern wrote: > Chris, you've mashed all of my emails together, some of them are in reply > to you, some in reply to others. Unfortunately, this dropped a lot of the > context from each of them, and appears to be creating some > misunderstandings about what e