On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven <bry...@continuum.io> wrote: > On 6/13/12 1:54 PM, Wes McKinney wrote: >> OK, I need to spend some time on this as it will directly impact me. >> Random thoughts here. >> >> It looks like the levels can only be strings. This is too limited for >> my needs. Why not support all possible NumPy dtypes? In pandas world, >> the levels can be any unique Index object (note, I'm going to change >> the name of the Factor class to Categorical before 0.8.0 final per >> discussion with Nathaniel): > > The current for-discussion prototype currently only supports strings. I > had mentioned integral levels in the NEP but wanted to get more feedback > first. It looks like you are using intervals as levels in things like > qcut? This would add some complexity. I can think of a couple of > possible approaches I will have to try a few of them out to see what > would make the most sense. > >> The API for constructing an enum/factor/categorical array from fixed >> levels and an array of labels seems somewhat weak to me. A very common >> scenario is to need to construct a factor from an array of integers >> with an associated array of levels: >> >> >> In [13]: labels >> Out[13]: >> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, >> 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, >> 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, >> 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, >> 7, 9, 7, 3, 3, 0, 4, 4]) >> >> In [14]: levels >> Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >> >> In [15]: Factor(labels, levels) >> Out[15]: >> Factor: >> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, >> 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, >> 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, >> 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, >> 7, 9, 7, 3, 3, 0, 4, 4]) >> Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) > > I originally had a very similar interface in the NEP. I was persuaded by > Mark that this would be redundant: > > In [10]: levels = np.factor(['a', 'b', 'c']) # or levels = > np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype > In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels) > Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], > dtype='factor({'c': 2, 'a': 0, 'b': 1})') > > This should also spell even more closely to your example as: > > labels.astype(levels) > > but I have not done much with casting yet, so this currently complains. > However, would this satisfy your needs (modulo the separate question > about more general integral or object levels)? > >> What is the story for NA values (NaL?) in a factor array? I code them >> as -1 in the labels, though you could use INT32_MAX or something. This >> is very important in the context of groupby operations. > I am just using INT32_MIN at the moment. >> Are the levels ordered (Nathaniel brought this up already looks like)? >> It doesn't look like it. That is also necessary. You also need to be > > They currently compare based on their value: > > In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], > np.factor({'c':0, 'b':1, 'a':2})) > In [21]: arr > Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], > dtype='factor({'c': 0, 'a': 2, 'b': 1})') > In [22]: arr.sort() > In [23]: arr > Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], > dtype='factor({'c': 0, 'a': 2, 'b': 1})') > > >> able to sort the levels (which is a relabeling, I have lots of code in >> use for this). In the context of groupby in pandas, when processing a >> key (array of values) to a factor to be used for aggregating some >> data, you have the option of returning an object that has the levels >> as observed in the data or sorting. Sorting can obviously be very >> expensive depending on the number of groups in the data >> (http://wesmckinney.com/blog/?p=437). Example: >> >> from pandas import DataFrame >> from pandas.util.testing import rands >> import numpy as np >> >> df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10, >> 'data' : np.random.randn(1000000)}) >> >> In [32]: timeit df.groupby('key').sum() >> 1 loops, best of 3: 374 ms per loop >> >> In [33]: timeit df.groupby('key', sort=False).sum() >> 10 loops, best of 3: 185 ms per loop >> >> The "factorization time" for the `key` column dominates the runtime; >> the factor is computed once then reused if you keep the GroupBy object >> around: >> >> In [36]: timeit grouped.sum() >> 100 loops, best of 3: 6.05 ms per loop > Just some numbers for comparison. Factorization times: > > In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] > In [42]: levels = np.factor(lets) > In [43]: data = [lets[int(x)] for x in np.random.randn(1000000)] > In [44]: %timeit np.array(data, levels) > 10 loops, best of 3: 137 ms per loop > > And retrieving group indicies/summing: > > In [8]: %timeit arr=='a' > 1000 loops, best of 3: 1.52 ms per loop > In [10]: vals = np.random.randn(1000000) > In [20]: inds = [arr==x for x in lets] > In [23]: %timeit for ind in inds: vals[ind].sum() > 10 loops, best of 3: 48.3 ms per loop
(FYI you're comparing an O(NK) algorithm with an O(N) algorithm for small K) > On my laptop your grouped.sum() took 22ms, so this is roughly off by > about a factor of two. But we should compare it on the same hardware, > and with the same level data types. There is no doubt room for > improvement, though. > > It would not be too bad to add some groupby functionality on top of > this. I still need to add a mechanism for accessing and iterating over > the levels. > >> As another example of why ordered factors matter, consider a quantile >> cut (google for the "cut" function in R) function I wrote recently: >> >> >> In [40]: arr = Series(np.random.randn(1000000)) >> >> In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1]) >> >> In [43]: arr.groupby(cats).describe().unstack(0) >> Out[43]: >> (-4.85, -0.673] (-0.673, 0.00199] (0.00199, 0.677] (0.677, 4.914] >> count 250000.000000 250000.000000 250000.000000 250000.000000 >> mean -1.270623 -0.323092 0.326325 1.271519 >> std 0.491317 0.193254 0.193044 0.490611 >> min -4.839798 -0.673224 0.001992 0.677177 >> 25% -1.533021 -0.487450 0.158736 0.888502 >> 50% -1.150136 -0.317501 0.320352 1.150480 >> 75% -0.887974 -0.155197 0.490456 1.534709 >> max -0.673224 0.001990 0.677176 4.913536 >> >> If you don't have ordered levels, then the quantiles might come out in >> the wrong order depending on how the strings sort or fall out of the >> hash table. > We do have ordered levels. :) Now, there's currently no way to get a > list of the levels, in order, but that should be trivial to add. > >> Nathaniel: my experience (see blog posting above for a bit more) is >> that khash really crushes PyDict for two reasons: you can use it with >> primitive types and avoid boxing, and secondly you can preallocate. >> Its memory footprint with large hashtables is also a fraction of >> PyDict. The Python memory allocator is not problematic-- if you create >> millions of Python objects expect the RAM usage of the Python process >> to balloon absurdly. >> >> Anyway, this is exciting work assuming we get the API right and >> hitting all the use cases. On top of all this I am _very_ performance >> sensitive so you'll have to be pretty aggressive with benchmarking >> things. I have concerns about ceding control over critical >> functionality that I need for pandas (which has become a large and >> very important library these days for a lot of people), but as long as >> the pieces in NumPy are suitably mature and robust for me to switch to >> them eventually that would be great. >> >> I'll do my best to stay involved in the discussion, though I'm >> juggling a lot of things these days (e.g. I have the PyData book >> deadline approaching like a freight train). >> >> - Wes >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion