On Fri, Mar 9, 2012 at 5:48 PM, David Gowers (kampu) <00a...@gmail.com> wrote: > Hi, > > On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bry...@continuum.io> wrote: >> Hi all, >> >> I have started working on a NEP for adding an enumerated type to NumPy. >> It is on my GitHub: >> >> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst >> >> It is still very rough, and incomplete in places. But I would like to >> get feedback sooner rather than later in order to refine it. In >> particular there are a few questions inline in the document that I would >> like input on. Any comments, suggestions, questions, concerns, etc. are >> very welcome. > > "t = np.dtype('enum', map=(n,v))" > > ^ Is this supposed to be indicating 'this is an enum with values > ranging between n and v'? It could be a bit more clear. > > Is it possible to partially define an enum? That is, give the maximum > and minimum values, and only some of the enumeration value:name > mappings? > For example, an enum where 0 means 'n/a', +n means 'Type A Object > #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map > the non-scalar values, while having a way to avoid treating valid > scalar values (eg +64) as out-of-range. > Example of what I mean: > > "t = np.dtype('enum[N_A:0]', range = (-127, 127))" > (defined values being printed as a string, undefined being printed as a > number.) > > David > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion
I'll have to think about this (a little brain dump here). I have many use cases in pandas where this would be useful which are basically direct translations of R's factor data type. Note that R always coerces the levels (the unique values) AFAICT to string type. However, mapping back to a well-dtyped array is important, too. So the temptation might be to do something like this: ndarray: dtype storage type (uint32 or something) mapping : khash with type PyObject* -> uint32 Now, one problem with this is that you want the mapping + dtype to be invertible (otherwise you're left doing some type inference). The way that I implement the mapping is to restrict the labeling to be from 0 to N - 1 which makes things easier. If we decide that having an explicit value mapping The nice thing about this is that the same set of core algorithms can be used to fix numpy.unique. For example you would like to be able to do: enum_arr = np.enum(arr) (this seems like a reasonable API to me) and that is a direct equivalent of R's factor function. You need to be able to pass an explicit ordering when calling the enum/factor function. If not specified, you should have an option to either sort or not-- for example suppose you convert an array of 1 million integers to enum but you don't particularly care about the uniques (which could be very large, up to the size of the array) being ordered (no need to pay N log N for large N). One nice thing about khash is that it can be serialized fairly easily. Have you looked much at how I use enum-like ideas in pandas? It would be great if I could offload some of this data algorithmic work to NumPy. We will want the enum data type to integrate with text file readers-- if you "factorize as you go" you can drastically reduce the memory usage of a structured array (or pandas DataFrame) columns with long-ish strings and relatively few unique values. - Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion