Michael: First, thank you very much for your detailed and thorough analysis and recap of the situation - it sounds to me like chararray is now in good hands! :-)
On Tue, Sep 22, 2009 at 10:58 AM, Michael Droettboom <md...@stsci.edu>wrote: > Sorry to resurrect a long-dead thread, but I've been continuing Chris > IMO, no apology necessary! > Hanley's investigation of chararray at Space Telescope Science Institute > (and the broader astronomical community) for a while and have some > findings to report back. > > What I've taken from this thread is that chararray is in need of a > maintainer. I am able to spend some time to the cause, but first would > Yes, thank you! > like to clarify what it will take to make it's continued inclusion more > comfortable. > > Let me start with the use case. chararrays are extensively returned > from pyfits (a tool to handle the standard astronomy data format). > pyfits is the basis of many applications, and it would be impossible to > audit all of that code. Most authors of those tools do not track > numpy-discussion closely, which is why we don't hear from them on this > list, but there is a great deal of pyfits-using code. > > Doing some spot-checking on this code, a common thing I see is SQL-like > queries on recarrays of objects. For instance, it is very common to a > have a table of objects, with a "Target" column which is a string, and > do something like (where c is a chararray of the 'Target' column): > > subset = array[np.where(c.startswith('NGC'))] > > Strictly speaking, this is a use case for "vectorized string > operations", not necessarily for the chararray class as it presently > stands. One could almost as easily do: > > subset = array[np.where([x.startswith('NGC') for x in c])] > > ...and the latter is even slightly faster, since chararray currently > loops in Python anyway. > > Even better, though, I have some experimental code to perform the loop > in C, and I get 5x speed up on a table with ~120,000 rows. If that were > to be included in numpy, that's a strong argument against recommending > list comprehensions in user code. The use case suggests the continued > existence of vectorized string operations in numpy -- whether that > continues to be chararray, or some newer/better interface + chararray > for backward compatibility, is an open question. Personally I think a > less object-oriented approach and just having a namespace full of > vectorized string functions might be cleaner than the current situation > of needing to create a view class around an ndarray. I'm suggesting > something like the following, using the same example, where {STR} is > some namespace we would fill with vectorized string operations: > > subset = array[np.where(np.{STR}.startswith(c, 'NGC'))] > > Now on to chararray as it now stands. I view chararray as really two > separable pieces of functionality: > > 1) Convenience to perform vectorized string operations using > '.method' syntax, or in some cases infix operators (+, *) > 2) Implicit "rstrip"ping of values > > (Note that raw ndarray's truncate values at the first NULL character, > like C strings, but chararrays will strip any and all whitespace > characters from the end). > > Changing (2) just seems to be asking to be the source of subtle bugs. > Unfortunately, there's an inconsistency between 1) and 2) in the present > implementation. For example: > > In [9]: a = np.char.array(['a ']) > > In [10]: a > Out[10]: chararray(['a'], dtype='|S3') > > In [11]: a[0] == 'a' > Out[11]: True > > In [12]: a.endswith('a') > Out[12]: array([False], dtype=bool) > > This is *the* design wart of chararray, IMHO, and one that's difficult > to fix without breaking compatibility. It might be a worthwhile > experiment to remove (2) and see how much we really break, but it would > be impossible to know for sure. > > Now to address the concerns iterated in this thread. Unfortunately, I > don't know where this thread began before it landed on the Numpy list, > so I may be missing details which would help me address them. > > > 0) "it gets very little use" (an assumption you presumably dispute); > > > Certainly not true from where I stand. > I'm convinced. > > 1) "is pretty much undocumented" (less true than a week ago, but still > true for several of the attributes, with another handful or so falling into > the category of "poorly documented"); > > > I don't quite understand this one -- 99% of the methods are wrappers around standard Python string methods. I don't think we should > redocument those. I agree it needs a better top level docstring about > OK, that's what I needed to hear (that I don't believe anyone stated explicitly before - I'm sure I'll be corrected if I'm wrong): in that case, finishing these off is as simple as stating that in the functions' docstrings (albeit in a way compliant w/ the numpy docstring standard, of course; see below). <snip> > > 6) it is, on its face, "counter to the spirit" of NumPy. > > > I don't quite know what this means -- but I do find the fact that it's a > view class with methods a little bit clumsy. Is that what you meant? > The rest of the arguments effectively become moot, but I will clarify what I meant by 6), which was simply that as I understood - and understand - it, the central purpose of numpy is to provide a fast (i.e., implemented in C), Python API for a _numerical_ multidimensional array object; it sounds like there is a need for a fast Python API for vectorized string operations, but IMO, numpy is not the place for it (maybe a sub-package in scipy? it could still use numpy "under the hood," of course); that said, my primary concern presently is getting everything that _is_ presently in numpy documented, and now, so it shall be. > So here's my TODO list related to all this: > > 1) Fix bugs in Trac > 2) Improve documentation (though probably not in a method-by-method way) > So, you're volunteering to do this? Great, thanks! (Please be sure, of course, to conform to the numpy docstring standard: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard with clarification of referral practice, such as it is, at: http://docs.scipy.org/numpy/Questions+Answers/#documenting-equivalent-functions-and-methods ) > 3) Improve unit test coverage > 4a) Create C-based vectorized string operations > 4b) Refactor chararray in terms of those > 4c) Design and create an interface to those methods that will be the > "right way" going forward > > Anything else? > Looks great to me! With much thanks again!!! DG > > Mike > > > -- > Michael Droettboom > Science Software Branch > Operations and Engineering Division > Space Telescope Science Institute > Operated by AURA for NASA > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion