On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote:
> > > On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > >> >> >> On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas < >> aldcr...@head.cfa.harvard.edu> wrote: >> >>> >>> >>> On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith <n...@pobox.com> wrote: >>> >>>> On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas >>>> <aldcr...@head.cfa.harvard.edu> wrote: >>>> > The idea of a one-byte string dtype has been extensively discussed >>>> twice >>>> > before, with a lot of good input and ideas, but no action [1, 2]. >>>> > >>>> > tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte >>>> string >>>> > dtype named 's' that uses latin-1 encoding as a bridge to enable >>>> Python 3 >>>> > usage in the near term? >>>> >>>> I think this is a good idea. I think overall it would be good for >>>> numpy to switch to using variable-length strings in most cases (cf. >>>> pandas), which is a different kind of change, but fixed-length 8-bit >>>> encoded text is obviously a common on-disk format in scientific >>>> applications, so numpy will still need some way to deal with it >>>> conveniently. In the long run we'd like to have more flexibility (e.g. >>>> allowing choice of character encoding), but since this proposal is a >>>> subset of that functionality, then it won't interfere with later >>>> improvements. I can see an argument for utf8 over latin1, but it >>>> really doesn't matter that much so whatever, blue and purple bikesheds >>>> are both fine. >>>> >>>> The tricky bit here is "just" :-). Do you want to implement this? Do >>>> you know someone who does? It's possible but will be somewhat >>>> annoying, since to do it directly without refactoring how dtypes work >>>> first then you'll have to add lots of copy-paste code to all the >>>> different ufuncs. >>>> >>> >>> I'm would be happy to have a go at this, with the caveat that someone >>> who understands numpy would need to get me started with a minimal >>> prototype. From there I can do the "annoying" copy-paste for ufuncs etc, >>> writing tests and docs. I'm assuming that with a prototype then the rest >>> can be done without any deep understanding of numpy internals (which I do >>> not have). >>> >>> - Tom >>> >>> >> >> The last two new types added to numpy were float16 and datetime64. Might >> be worth looking at the steps needed to implement those. There was also a >> user type, `rational` that got added, that could also provide a template. >> Maybe we need to have a way to add 'numpy certified' user data types. It >> might also be possible to reuse the `c` data type, currently implemented as >> `S1` IIRC, but that could cause some problems. >> > > OK I'll have a look at those. > On second thought.. Maybe I'm being naive, but I think that starting from scratch looking at entirely new dtypes is harder than it needs to be, or at least not the most straightforward path [EDIT: just saw email from Nathan agreeing here]. What is being proposed is essentially: - For Python 2, the 's' type is exactly a clone of 'S'. In other words 's' will interface with Python as a bytes (aka str) object just like 'S'. - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in all operations, but interfaces with Python as a latin-1 encoded string. So the only difference is at the interface layer with Python (initialization, comparison, iteration, etc). So as a starting point we would want to clone 'S' to 's', then fix up the interface to Python 3. Does that sound about right? - Tom > > Thanks, > Tom > > >> >> Chuck >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion