On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V <mikhail...@gmail.com> wrote:
> Likely it was about some new string array type... yes, it was. > Obviously there is demand. Terror of unicode touches many aspects > of programmers life. I don't know that I'd call it Terror, but frankly, the fact that you need up to 4 bytes for a single character is really not the big issues. Given that computer memory has grown by literally orders of magnitude since Unicode was introduced, I don't know why there is such a hang up about it. But we're scientific programmers we like to be efficient ! > Foremost, it comes down to the question of defining this "optimal > 8-bit character table". > And "Latin-1", (exactly as it is) is not that optimal table, there is no such thing as a single "optimal" set of characters when you are limited to 255 of them... latin-1 is pretty darn good for the, well, latin-based languages.... > But, granted, if define most accented letters as > "optional", i.e . delete them > then it is quite reasonable basic char table to start with. > Then you are down to ASCII, no? but anyway, I don't think a new encoding is really the topic at hand here.... >> I don't know what you're doing, but I don't think numpy is normally the > >> right tool for text manipulation... > > > > > > I agree here. But if one were to add such a thing (vectorized string > > operations) -- I'd think the thing to do would be to wrap (or port) the > > python string methods. But it shoudl only work for actual string dtypes, > of > > course. > > > > note that another part of the discussion previously suggested that we > have a > > dtype that wraps a native python string object -- then you'd get all for > > free. This is essentially an object array with strings in it, which you > can > > do now. > > > > Well here I must admit I don't quite understand the whole idea of > "numpy array of string type". How used? What is main bebefit/feature...? > here you go -- you can do this now: In [74]: s_arr = np.array([s, "another string"], dtype=np.object) In [75]: In [75]: s_arr Out[75]: array(['012 АБВ', 'another string'], dtype=object) In [76]: s_arr.shape Out[76]: (2,) You now have an array with python string object in it -- thus access to all the string functionality: In [81]: s_arr[1] = s_arr[1].upper() In [82]: s_arr Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object) and the ability to have each string be a different length. If numpy were to know that those were string objects, rather than arbitrary python objects, it could do vectorized operations on them, etc. You can do that now with numpy.vectorize, but it's pretty klunky. In [87]: np_upper = np.vectorize(str.upper) In [88]: np_upper(s_arr) Out[88]: array(['012 АБВ', 'ANOTHER STRING'], dtype='<U14') > Example integer array usage in context of textual data in my case: > - holding data in a text editor (mutability+indexing/slicing) > you really want to use regular old python data structures for that... > - filtering, transformations (e.g. table translations, cryptography, etc.) > that may be something to do with ordinals and numpy -- but then you need to work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32 dtype -- that's that. String type array? Will this be a string array you describe: > > s= "012 abc" > arr = np.array(s) > print ("type ", arr.dtype) > print ("shape ", arr.shape) > print ("my array: ", arr) > arr = np.roll(arr[0],2) > print ("my array: ", arr) > -> > type <U7 > shape () > my array: 012 abc > my array: 012 abc > > > So what it does? What's up with shape? > shape is an empty tuple, meaning this is a numpy scalar, containing a single string type '<U7' means little endian, unicode, 7 characters > e.g. here I wanted to 'roll' the string. > How would I replace chars? or delete? > What is the general idea behind? > the numpy string type (unicode type) works with fixed length strings -- not characters, but you can reshape it and make a view: In [89]: s= "012 abc" In [90]: arr.shape = (1,) In [91]: arr.shape Out[91]: (1,) In [93]: c_arr = arr.view(dtype = '<U1') In [97]: np.roll(c_arr, 3) Out[97]: array(['a', 'b', 'c', '0', '1', '2', ' '], dtype='<U1') You could also create it as a character array in the first place by unpacking it into a list first: In [98]: c_arr = np.array(list(s)) In [99]: c_arr Out[99]: array(['0', '1', '2', ' ', 'a', 'b', 'c'], dtype='<U1') -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion