Sorry to resurrect a long-dead thread, but I've been continuing Chris Hanley's investigation of chararray at Space Telescope Science Institute (and the broader astronomical community) for a while and have some findings to report back.
What I've taken from this thread is that chararray is in need of a maintainer. I am able to spend some time to the cause, but first would like to clarify what it will take to make it's continued inclusion more comfortable. Let me start with the use case. chararrays are extensively returned from pyfits (a tool to handle the standard astronomy data format). pyfits is the basis of many applications, and it would be impossible to audit all of that code. Most authors of those tools do not track numpy-discussion closely, which is why we don't hear from them on this list, but there is a great deal of pyfits-using code. Doing some spot-checking on this code, a common thing I see is SQL-like queries on recarrays of objects. For instance, it is very common to a have a table of objects, with a "Target" column which is a string, and do something like (where c is a chararray of the 'Target' column): subset = array[np.where(c.startswith('NGC'))] Strictly speaking, this is a use case for "vectorized string operations", not necessarily for the chararray class as it presently stands. One could almost as easily do: subset = array[np.where([x.startswith('NGC') for x in c])] ...and the latter is even slightly faster, since chararray currently loops in Python anyway. Even better, though, I have some experimental code to perform the loop in C, and I get 5x speed up on a table with ~120,000 rows. If that were to be included in numpy, that's a strong argument against recommending list comprehensions in user code. The use case suggests the continued existence of vectorized string operations in numpy -- whether that continues to be chararray, or some newer/better interface + chararray for backward compatibility, is an open question. Personally I think a less object-oriented approach and just having a namespace full of vectorized string functions might be cleaner than the current situation of needing to create a view class around an ndarray. I'm suggesting something like the following, using the same example, where {STR} is some namespace we would fill with vectorized string operations: subset = array[np.where(np.{STR}.startswith(c, 'NGC'))] Now on to chararray as it now stands. I view chararray as really two separable pieces of functionality: 1) Convenience to perform vectorized string operations using '.method' syntax, or in some cases infix operators (+, *) 2) Implicit "rstrip"ping of values (Note that raw ndarray's truncate values at the first NULL character, like C strings, but chararrays will strip any and all whitespace characters from the end). Changing (2) just seems to be asking to be the source of subtle bugs. Unfortunately, there's an inconsistency between 1) and 2) in the present implementation. For example: In [9]: a = np.char.array(['a ']) In [10]: a Out[10]: chararray(['a'], dtype='|S3') In [11]: a[0] == 'a' Out[11]: True In [12]: a.endswith('a') Out[12]: array([False], dtype=bool) This is *the* design wart of chararray, IMHO, and one that's difficult to fix without breaking compatibility. It might be a worthwhile experiment to remove (2) and see how much we really break, but it would be impossible to know for sure. Now to address the concerns iterated in this thread. Unfortunately, I don't know where this thread began before it landed on the Numpy list, so I may be missing details which would help me address them. > 0) "it gets very little use" (an assumption you presumably dispute); > Certainly not true from where I stand. > 1) "is pretty much undocumented" (less true than a week ago, but still true > for several of the attributes, with another handful or so falling into the > category of "poorly documented"); > I don't quite understand this one -- 99% of the methods are wrappers around standard Python string methods. I don't think we should redocument those. I agree it needs a better top level docstring about its purpose (see functionalities (1) and (2) above) and its status (for backward compatibility). > 2) "probably more buggy than most other parts of NumPy" ("probably" being a > euphemism, IMO); > Trac has these bugs. Any others? http://projects.scipy.org/numpy/ticket/1199 http://projects.scipy.org/numpy/ticket/1200 http://projects.scipy.org/numpy/ticket/856 http://projects.scipy.org/numpy/ticket/855 http://projects.scipy.org/numpy/ticket/1231 > 3) "there is not a really good use-case for it" (a conjecture, but one that > has yet to be challenged by counter-example); > See above. > 4) it's not the first time its presence in NumPy has been questioned ("as > Stefan pointed out when asking this same question last year") > Hopefully we're addressing that now. > 5) NumPy already has a (perhaps superior) alternative ("object arrays would > do nicely if one needs this functionality"); > No -- that gives the problem of even slower Python-looping to do vectorized string operations. > to which I'll add: > > 6) it is, on its face, "counter to the spirit" of NumPy. > I don't quite know what this means -- but I do find the fact that it's a view class with methods a little bit clumsy. Is that what you meant? So here's my TODO list related to all this: 1) Fix bugs in Trac 2) Improve documentation (though probably not in a method-by-method way) 3) Improve unit test coverage 4a) Create C-based vectorized string operations 4b) Refactor chararray in terms of those 4c) Design and create an interface to those methods that will be the "right way" going forward Anything else? Mike -- Michael Droettboom Science Software Branch Operations and Engineering Division Space Telescope Science Institute Operated by AURA for NASA _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion