Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 03:55, josef.p...@gmail.com wrote: > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > wrote: >> >> >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal >>> wrote: >>> > Presumably you're getting byte stri

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Anne Archibald
On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: > On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: > >> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < >> charlesr.har...@gmail.com> wrote: >> >> > The maximum length of an UTF-8 character is 4 bytes, so we could use >> that to size arr

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Charles R Harris
On Wed, Apr 26, 2017 at 3:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > On 26.04.2017 03:55, josef.p...@gmail.com wrote: > > On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris > > wrote: > >> > >> > >> On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern > wrote: > >>> > >>> On Tue, Apr

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Eric Wieser
> I think we can implement viewers for strings as ndarray subclasses. Then one > could > do `my_string_array.view(latin_1)`, and so on. Essentially that just > changes the default > encoding of the 'S' array. That could also work for uint8 arrays if needed. > > Chuck To handle structured data-typ

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker - NOAA Federal
> > I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with > > a few extra characters" data. With all the sloppiness over the years, there > > are way to many files like that. > > That sloppiness that you mention is precisely the "unknown encoding" problem. Exactly -- but fro

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > Indeed, > Most of this discussion is irrelevant to numpy. > Numpy only really deals with the in memory storage of strings. And in > that it is limited to fixed length strings (in bytes/codepoints). > How you g

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Julian Taylor
On 26.04.2017 19:08, Robert Kern wrote: > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > mailto:jtaylor.deb...@googlemail.com>> > wrote: > >> Indeed, >> Most of this discussion is irrelevant to numpy. >> Numpy only really deals with the in memory storage of strings. And in >> that it is limited

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 3:27 AM, Anne Archibald wrote: > > On Wed, Apr 26, 2017 at 7:20 AM Stephan Hoyer wrote: >> >> On Tue, Apr 25, 2017 at 9:21 PM Robert Kern wrote: >>> >>> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < charlesr.har...@gmail.com> wrote: >>> >>> > The maximum length of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" < chris.bar...@noaa.gov> wrote: UTF-8 does not match the character-oriented Python text model. Plenty of people argue that that isn't the "correct" model for Unicode text -- maybe so, but it is the model python 3 has chosen. I wrote a much lo

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread josef . pktd
On Wed, Apr 26, 2017 at 2:31 PM, Nathaniel Smith wrote: > On Apr 26, 2017 9:30 AM, "Chris Barker - NOAA Federal" > wrote: > > > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Sebastian Berg
On Wed, 2017-04-26 at 19:43 +0200, Julian Taylor wrote: > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.co > > m>> > > wrote: > > > > > Indeed, > > > Most of this discussion is irrelevant to numpy. > > > Numpy only r

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > > On 26.04.2017 19:08, Robert Kern wrote: > > On Wed, Apr 26, 2017 at 2:15 AM, Julian Taylor > > mailto:jtaylor.deb...@googlemail.com>> > > wrote: > > > >> Indeed, > >> Most of this discussion is irrelevant to

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right. (

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:31 AM, Nathaniel Smith wrote: > UTF-8 does not match the character-oriented Python text model. Plenty > of people argue that that isn't the "correct" model for Unicode text > -- maybe so, but it is the model python 3 has chosen. I wrote a much > longer rant about that e

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg wrote: > I remember talking with a colleague about something like that. And > basically an annoying thing there was that if you strip the zero bytes > in a zero padded string, some encodings (UTF16) may need one of the > zero bytes to work right.

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 10:45 AM, Robert Kern wrote: > >>> > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many cases. > isn't UTF-32 pret

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 3:27 PM, Chris Barker wrote: > When a numpy user wants to put a string into a numpy array, they should > know how long a string they can fit -- with "length" defined how python > strings define it. > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Nathaniel Smith
On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: [...] > I have read every mail and it has been a large waste of time, Everything > has been said already many times in the last few years. > Even if you memory ma

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 4:30 PM, Stephan Hoyer wrote: > > Sorry, I remain unconvinced (for the reasons that Robert, Nathaniel and > myself have already given), but we seem to be talking past each other here. > yeah -- I think it's not clear what the use cases we are talking about are. > I am s

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > > On Apr 26, 2017 12:09 PM, "Robert Kern" wrote: >> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Robert Kern
On Wed, Apr 26, 2017 at 5:02 PM, Chris Barker wrote: > But a bunch of folks have brought up that while we're messing around with string encoding, let's solve another problem: > > * Exchanging unicode text at the binary level with other systems that generally don't use UCS-4. > > For THAT -- utf-8

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Chris Barker
On Wed, Apr 26, 2017 at 5:17 PM, Robert Kern wrote: > The proposal is for only latin-1 and UTF-32 to be supported at first, and > the eventual support of UTF-8 will be constrained by specification of the > width in terms of characters rather than bytes, which conflicts with the > use cases of UTF

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-26 Thread Stephan Hoyer
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > It's worthwhile enough that both major HDF5 bindings don't support Unicode > arrays, despite user requests for years. The sticking point seems to be the > difference between HDF5's view of a Unicode string array (defined in size > by the b