Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-17 Thread Nathaniel Smith
On Wed, Jun 13, 2012 at 11:06 PM, Bryan Van de Ven  wrote:
> On 6/13/12 1:12 PM, Nathaniel Smith wrote:
>> Yes, of course we *could* write the code to implement these "open"
>> dtypes, and then write the documentation, examples, tutorials, etc. to
>> help people work around their limitations. Or, we could just implement
>> np.fromfile properly, which would require no workarounds and take less
>> code to boot.
>>
>> [snip]
>> So would a proper implementation of np.fromfile that normalized the
>> level ordering.
>
> My understanding of the impetus for the open type was sensitivity to the
> performance of having to make two passes over large text datasets. We'll
> have to get more feedback from users here and input from Travis, I think.

You definitely don't want to make two passes over large text datasets,
but that's not required. While reading through the data, you keep a
dict mapping levels to integer values, which you assign arbitrarily as
new levels are encountered, and an integer array holding the integer
value for each line of the file. Then at the end of the file, you sort
the levels, figure out what the proper integer value for each level
is, and do a single in-memory pass through your array, swapping each
integer value for the new correct integer value. Since your original
integer values are assigned densely, you can map the old integers to
the new integers using a single array lookup. This is going to be much
faster than any text file reader.

There may be some rare people who have huge data files, fast storage,
a very large number of distinct levels, and don't care about
normalizing level order. But I really think the default should be to
normalize level ordering, and then once you can do that, it's trivial
to add a "don't normalize please" option for anyone who wants it.

>>> I think I like "categorical" over "factor" but I am not sure we should
>>> ditch "enum". There are two different use cases here: I have a pile of
>>> strings (or scalars) that I want to treat as discrete things
>>> (categories), and: I have a pile of numbers that I want to give
>>> convenient or meaningful names to (enums). This latter case was the
>>> motivation for possibly adding "Natural Naming".
>> So mention the word "enum" in the documentation, so people looking for
>> that will find the categorical data support? :-)
>
> I'm not sure I follow.

So the above discussion was just about what to name things, and I was
saying that we don't need to use the word "enum" in the API itself,
whatever the design ends up looking like.

That said, I am not personally sold on the idea of using these things
in enum-like roles. There are already tons of "enum" libraries on PyPI
(I linked some of them in the last thread on this), and I don't see
how this design could handle all the basic use cases for enums. Flag
bits are one of the most common enums, after all, but red|green is
just NaL. So I'm +0 on just sticking to categorical data.

> Natural Naming seems like a great idea for people
> that want something like an actual enum (i.e., a way to avoid magic
> numbers). We could even imagine some nice with-hacks:
>
>     colors = enum(['red', 'green', 'blue')
>     with colors:
>         foo.fill(red)
>         bar.fill(blue)

FYI you can't really do this with a context manager. This is the
closest I managed:
  https://gist.github.com/2347382
and you'll note that it still requires reaching up the stack and
directly rewriting the C fields of a PyFrameObject while it is in the
middle of executing... this is surprisingly less horrible than it
sounds, but that still leaves a lot of room for horribleness.

 I'm disturbed to see you adding special cases to the core ufunc
 dispatch machinery for these things. I'm -1 on that. We should clean
 up the generic ufunc machinery so that it doesn't need special cases
 to handle adding a simple type like this.
>>> This could certainly be improved, I agree.
>> I don't want to be Mr. Grumpypants here, but I do want to make sure
>> we're speaking the same language: what "-1" means is "I consider this
>> a show-stopper and will oppose merging any code that does not improve
>> on this". (Of course you also always have the option of trying to
>> change my mind. Even Mr. Grumpypants can be swayed by logic!)
> Well, a few comments. The special case in array_richcompare is due to
> the lack of string ufuncs. I think it would be great to have string
> ufuncs, but I also think it is a separate concern and outside the scope
> of this proposal. The special case in arraydescr_typename_get is for the
> same reason as datetime special case, the need to access dtype metadata.
> I don't think you are really concerned about these two, though?
>
> That leaves the special case in
> PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit
> when I put that in. On the other hand, having dtypes with this extent of
> attached metadata, and potentially dynamic metadata, is unique in NumPy.
> It was s

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-17 Thread Nathaniel Smith
On Sun, Jun 17, 2012 at 9:04 PM, Wes McKinney  wrote:
> On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith  wrote:
>> On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney  wrote:
>>> It looks like the levels can only be strings. This is too limited for
>>> my needs. Why not support all possible NumPy dtypes? In pandas world,
>>> the levels can be any unique Index object
>>
>> It seems like there are three obvious options, from most to least general:
>>
>> 1) Allow levels to be an arbitrary collection of hashable Python objects
>> 2) Allow levels to be a homogenous collection of objects of any
>> arbitrary numpy dtype
>> 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)
>>
>> I agree that (3) is a bit limiting. (1) is probably easier to
>> implement than (2). (2) is the most general, since of course
>> "arbitrary Python object" is a dtype. Is it useful to be able to
>> restrict levels to be of homogenous type? The main difference between
>> dtypes and python types is that (most) dtype scalars can be unboxed --
>> is that substantively useful for levels?
[...]
> I'm in favor of option #2 (a lite version of what I'm doing
> currently-- I handle a few dtypes (PyObject, int64, datetime64,
> float64), though you'd have to go the code-generation route for all
> the dtypes to keep yourself sane if you do that.

Why would you do code generation? dtype's already expose a generic API
for doing boxing/unboxing/etc. Are you thinking this would just be too
slow or...?

-N
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-17 Thread Wes McKinney
On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith  wrote:
> On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney  wrote:
>> It looks like the levels can only be strings. This is too limited for
>> my needs. Why not support all possible NumPy dtypes? In pandas world,
>> the levels can be any unique Index object
>
> It seems like there are three obvious options, from most to least general:
>
> 1) Allow levels to be an arbitrary collection of hashable Python objects
> 2) Allow levels to be a homogenous collection of objects of any
> arbitrary numpy dtype
> 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)
>
> I agree that (3) is a bit limiting. (1) is probably easier to
> implement than (2). (2) is the most general, since of course
> "arbitrary Python object" is a dtype. Is it useful to be able to
> restrict levels to be of homogenous type? The main difference between
> dtypes and python types is that (most) dtype scalars can be unboxed --
> is that substantively useful for levels?
>
>> What is the story for NA values (NaL?) in a factor array? I code them
>> as -1 in the labels, though you could use INT32_MAX or something. This
>> is very important in the context of groupby operations.
>
> If we have a type restriction on levels (options (2) or (3) above),
> then how to handle out-of-bounds values is quite a problem, yeah. Once
> we have NA dtypes then I suppose we could use those, but we don't yet.
> It's tempting to just error out of any operation that encounters such
> values.
>
>> Nathaniel: my experience (see blog posting above for a bit more) is
>> that khash really crushes PyDict for two reasons: you can use it with
>> primitive types and avoid boxing, and secondly you can preallocate.
>> Its memory footprint with large hashtables is also a fraction of
>> PyDict. The Python memory allocator is not problematic-- if you create
>> millions of Python objects expect the RAM usage of the Python process
>> to balloon absurdly.
>
> Right, I saw that posting -- it's clear that khash has a lot of
> advantages as internal temporary storage for a specific operation like
> groupby on unboxed types. But I can't tell whether those arguments
> still apply now that we're talking about a long-term storage
> representation for data that has to support a variety of operations
> (many of which would require boxing/unboxing, since the API is in
> Python), might or might not use boxed types, etc. Obviously this also
> depends on which of the three options above we go with -- unboxing
> doesn't even make sense for option (1).
>
> -n
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'm in favor of option #2 (a lite version of what I'm doing
currently-- I handle a few dtypes (PyObject, int64, datetime64,
float64), though you'd have to go the code-generation route for all
the dtypes to keep yourself sane if you do that.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-17 Thread Nathaniel Smith
On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney  wrote:
> It looks like the levels can only be strings. This is too limited for
> my needs. Why not support all possible NumPy dtypes? In pandas world,
> the levels can be any unique Index object

It seems like there are three obvious options, from most to least general:

1) Allow levels to be an arbitrary collection of hashable Python objects
2) Allow levels to be a homogenous collection of objects of any
arbitrary numpy dtype
3) Allow levels to be chosen a few fixed types (strings and ints, I guess)

I agree that (3) is a bit limiting. (1) is probably easier to
implement than (2). (2) is the most general, since of course
"arbitrary Python object" is a dtype. Is it useful to be able to
restrict levels to be of homogenous type? The main difference between
dtypes and python types is that (most) dtype scalars can be unboxed --
is that substantively useful for levels?

> What is the story for NA values (NaL?) in a factor array? I code them
> as -1 in the labels, though you could use INT32_MAX or something. This
> is very important in the context of groupby operations.

If we have a type restriction on levels (options (2) or (3) above),
then how to handle out-of-bounds values is quite a problem, yeah. Once
we have NA dtypes then I suppose we could use those, but we don't yet.
It's tempting to just error out of any operation that encounters such
values.

> Nathaniel: my experience (see blog posting above for a bit more) is
> that khash really crushes PyDict for two reasons: you can use it with
> primitive types and avoid boxing, and secondly you can preallocate.
> Its memory footprint with large hashtables is also a fraction of
> PyDict. The Python memory allocator is not problematic-- if you create
> millions of Python objects expect the RAM usage of the Python process
> to balloon absurdly.

Right, I saw that posting -- it's clear that khash has a lot of
advantages as internal temporary storage for a specific operation like
groupby on unboxed types. But I can't tell whether those arguments
still apply now that we're talking about a long-term storage
representation for data that has to support a variety of operations
(many of which would require boxing/unboxing, since the API is in
Python), might or might not use boxed types, etc. Obviously this also
depends on which of the three options above we go with -- unboxing
doesn't even make sense for option (1).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-14 Thread Thouis (Ray) Jones
On Wed, Jun 13, 2012 at 8:54 PM, Wes McKinney  wrote:
> Nathaniel: my experience (see blog posting above for a bit more) is
> that khash really crushes PyDict for two reasons: you can use it with
> primitive types and avoid boxing, and secondly you can preallocate.
> Its memory footprint with large hashtables is also a fraction of
> PyDict. The Python memory allocator is not problematic-- if you create
> millions of Python objects expect the RAM usage of the Python process
> to balloon absurdly.

The other big reason to consider allowing khash (or some other hash
implementation) within numpy is that you can use it without the GIL.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-14 Thread Francesc Alted
On 6/13/12 8:12 PM, Nathaniel Smith wrote:
>>> I'm also worried that I still don't see any signs that you're working
>>> with the downstream libraries that this functionality is intended to
>>> be useful for, like the various HDF5 libraries and pandas. I really
>>> don't think this functionality can be merged to numpy until we have
>>> affirmative statements from those developers that they are excited
>>> about it and will use it, and since they're busy people, it's pretty
>>> much your job to track them down and make sure that your code will
>>> solve their problems.
>> Francesc is certainly aware of this work, and I emailed Wes earlier this
>> week, I probably should have mentioned that, though. Hopefully they will
>> have time to contribute their thoughts. I also imagine Travis can speak
>> on behalf of the users he has interacted with over the last several
>> years that have requested this feature that don't happen to follow
>> mailing lists.
> I'm glad Francesc and Wes are aware of the work, but my point was that
> that isn't enough. So if I were in your position and hoping to get
> this code merged, I'd be trying to figure out how to get them more
> actively on board?

Sorry to chime in late.  Yes, I am aware of the improvements that Bryan 
(and Mark) are proposing.  My position here is that I'm very open to 
this (at least from a functional point of view; I have to recognize that 
I have not had a look into the code).

The current situation for the HDF5 wrappers (at least PyTables ones) is 
that, due to the lack of support of enums in NumPy itself, we had to 
come with a specific solution for this.  Our approach was pretty simple: 
basically providing an exhaustive set or list of possible, named values 
for different integers.  And although I'm not familiar with the 
implementation details (it was Ivan Vilata who implemented this part), I 
think we used an internal dictionary for doing the translation while 
PyTables is presenting the enums to the user.

Bryan is implementing a much more complete (and probably more efficient) 
support for enums in NumPy.  As this is new functionality, and PyTables 
does not trust on it, there is not an immediate danger (i.e. a backward 
incompatibility) on introducing the new enums in NumPy.  But they could 
be used for future PyTables versions (and other HDF5 wrappers), which is 
a good thing indeed.

My 2 cents,

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-14 Thread Dag Sverre Seljebotn
On 06/14/2012 12:06 AM, Bryan Van de Ven wrote:
> On 6/13/12 1:12 PM, Nathaniel Smith wrote:
>> your-branch's-base-master but not in your-repo's-master are new stuff
>> that you did on your branch. Solution is just to do
>> git push   master
>
> Fixed, thanks.
>
>> Yes, of course we *could* write the code to implement these "open"
>> dtypes, and then write the documentation, examples, tutorials, etc. to
>> help people work around their limitations. Or, we could just implement
>> np.fromfile properly, which would require no workarounds and take less
>> code to boot.
>>
>> [snip]
>> So would a proper implementation of np.fromfile that normalized the
>> level ordering.
>
> My understanding of the impetus for the open type was sensitivity to the
> performance of having to make two passes over large text datasets. We'll
> have to get more feedback from users here and input from Travis, I think.

Can't you just build up the file using uint8, collecting enum values in 
a separate dict, and then recast the array with the final enum in the end?

Or, recast the array with a new enum type every time one wants to add an 
enum value? (Similar to how you append to a tuple...)

(Yes, normalizing level ordering requires another pass through the 
parsed data array, but that's unavoidable and rather orthogonal to 
whether one has an open enum dtype API or not.)

A mutable dtype gives me the creeps. dtypes currently implements 
__hash__ and __eq__ and can be used as dict keys, which I think is very 
valuable. Making them sometimes mutable would cause a confusing 
situations. There are cases for mutable objects that become immutable, 
but it should be very well motivated as it makes for a much more 
confusing API...

Dag
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Bryan Van de Ven
On 6/13/12 5:11 PM, Wes McKinney wrote:
> And retrieving group indicies/summing:
>
> In [8]: %timeit arr=='a'
> 1000 loops, best of 3: 1.52 ms per loop
> In [10]: vals = np.random.randn(100)
> In [20]: inds = [arr==x for x in lets]
> In [23]: %timeit for ind in inds: vals[ind].sum()
> 10 loops, best of 3: 48.3 ms per loop
> (FYI you're comparing an O(NK) algorithm with an O(N) algorithm for small K)

I am not familiar with the details of your groupby implementation 
(evidently!), consider me appropriately chastised.

Bryan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Bryan Van de Ven
On 6/13/12 1:12 PM, Nathaniel Smith wrote:
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
>git push  master

Fixed, thanks.

> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
>
> [snip]
> So would a proper implementation of np.fromfile that normalized the
> level ordering.

My understanding of the impetus for the open type was sensitivity to the 
performance of having to make two passes over large text datasets. We'll 
have to get more feedback from users here and input from Travis, I think.

> categories in their data, I don't know. But all your arguments here
> seem to be of the form "hey, it's not *that* bad", and it seems like
> there must be some actual affirmative advantages it has over PyDict if
> it's going to be worth using.

I should have been more specific about the performance concerns. Wes 
summed them up, though: better space efficiency, and not having to 
box/unbox native types.

>> I think I like "categorical" over "factor" but I am not sure we should
>> ditch "enum". There are two different use cases here: I have a pile of
>> strings (or scalars) that I want to treat as discrete things
>> (categories), and: I have a pile of numbers that I want to give
>> convenient or meaningful names to (enums). This latter case was the
>> motivation for possibly adding "Natural Naming".
> So mention the word "enum" in the documentation, so people looking for
> that will find the categorical data support? :-)

I'm not sure I follow. Natural Naming seems like a great idea for people 
that want something like an actual enum (i.e., a way to avoid magic 
numbers). We could even imagine some nice with-hacks:

 colors = enum(['red', 'green', 'blue')
 with colors:
 foo.fill(red)
 bar.fill(blue)

But natural naming will not work with many category names ("VERY HIGH") 
if they have spaces, etc. So, we could add a parameter to factor(...) 
that turns on and off natural naming for a dtype object when it is created:

colors = factor(['red', 'green', 'blue'], closed=True, natural_naming=False)

vs

colors = enum(['red', 'green', 'blue'])

I think the latter is better, not only because it is more parsimonious, 
but because it also expresses intent better. Or we can just not have 
natural naming at all, if no one wants it. It hasn't been implemented 
yet, so that would be a snap. :) Hopefully we'll get more feedback from 
the list.

>>> I'm disturbed to see you adding special cases to the core ufunc
>>> dispatch machinery for these things. I'm -1 on that. We should clean
>>> up the generic ufunc machinery so that it doesn't need special cases
>>> to handle adding a simple type like this.
>> This could certainly be improved, I agree.
> I don't want to be Mr. Grumpypants here, but I do want to make sure
> we're speaking the same language: what "-1" means is "I consider this
> a show-stopper and will oppose merging any code that does not improve
> on this". (Of course you also always have the option of trying to
> change my mind. Even Mr. Grumpypants can be swayed by logic!)
Well, a few comments. The special case in array_richcompare is due to 
the lack of string ufuncs. I think it would be great to have string 
ufuncs, but I also think it is a separate concern and outside the scope 
of this proposal. The special case in arraydescr_typename_get is for the 
same reason as datetime special case, the need to access dtype metadata. 
I don't think you are really concerned about these two, though?

That leaves the special case in 
PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit 
when I put that in. On the other hand, having dtypes with this extent of 
attached metadata, and potentially dynamic metadata, is unique in NumPy. 
It was simple and straightforward to add those few lines of code, and 
does not affect performance. How invasive will the changes to core ufunc 
machinery be to accommodate a type like this more generally? I took the 
easy way because I was new to the numpy codebase and did not feel 
confident mucking with the central ufunc code. However, maybe the 
dispatch can be accomplished easily with the casting machinery. I am not 
so sure, I will have to investigate.  Of course, I welcome input, 
suggestions, and proposals on the best way to improve this.

>> I'm glad Francesc and Wes are aware of the work, but my point was that
>> that isn't enough. So if I were in your position and hoping to get
>> this code merged, I'd be trying to figure out how to get them more
>> actively on board?

Is there some other way besides responding to and attempting to 
accommodate technical needs?

Bryan



___
Num

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Wes McKinney
On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven  wrote:
> On 6/13/12 1:54 PM, Wes McKinney wrote:
>> OK, I need to spend some time on this as it will directly impact me.
>> Random thoughts here.
>>
>> It looks like the levels can only be strings. This is too limited for
>> my needs. Why not support all possible NumPy dtypes? In pandas world,
>> the levels can be any unique Index object (note, I'm going to change
>> the name of the Factor class to Categorical before 0.8.0 final per
>> discussion with Nathaniel):
>
> The current for-discussion prototype currently only supports strings. I
> had mentioned integral levels in the NEP but wanted to get more feedback
> first. It looks like you are using intervals as levels in things like
> qcut? This would add some complexity. I can think of a couple of
> possible approaches I will have to try a few of them out to see what
> would make the most sense.
>
>> The API for constructing an enum/factor/categorical array from fixed
>> levels and an array of labels seems somewhat weak to me. A very common
>> scenario is to need to construct a factor from an array of integers
>> with an associated array of levels:
>>
>>
>> In [13]: labels
>> Out[13]:
>> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>>         7, 9, 7, 3, 3, 0, 4, 4])
>>
>> In [14]: levels
>> Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>
>> In [15]: Factor(labels, levels)
>> Out[15]:
>> Factor:
>> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>>         7, 9, 7, 3, 3, 0, 4, 4])
>> Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>
> I originally had a very similar interface in the NEP. I was persuaded by
> Mark that this would be redundant:
>
> In [10]: levels = np.factor(['a', 'b', 'c'])   # or levels =
> np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype
> In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels)
> Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> dtype='factor({'c': 2, 'a': 0, 'b': 1})')
>
> This should also spell even more closely to your example as:
>
> labels.astype(levels)
>
> but I have not done much with casting yet, so this currently complains.
> However, would this satisfy your needs (modulo the separate question
> about more general integral or object levels)?
>
>> What is the story for NA values (NaL?) in a factor array? I code them
>> as -1 in the labels, though you could use INT32_MAX or something. This
>> is very important in the context of groupby operations.
> I am just using INT32_MIN at the moment.
>> Are the levels ordered (Nathaniel brought this up already looks like)?
>> It doesn't look like it. That is also necessary. You also need to be
>
> They currently compare based on their value:
>
> In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> np.factor({'c':0, 'b':1, 'a':2}))
> In [21]: arr
> Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'],
> dtype='factor({'c': 0, 'a': 2, 'b': 1})')
> In [22]: arr.sort()
> In [23]: arr
> Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'],
> dtype='factor({'c': 0, 'a': 2, 'b': 1})')
>
>
>> able to sort the levels (which is a relabeling, I have lots of code in
>> use for this). In the context of groupby in pandas, when processing a
>> key (array of values) to a factor to be used for aggregating some
>> data, you have the option of returning an object that has the levels
>> as observed in the data or sorting. Sorting can obviously be very
>> expensive depending on the number of groups in the data
>> (http://wesmckinney.com/blog/?p=437). Example:
>>
>> from pandas import DataFrame
>> from pandas.util.testing import rands
>> import numpy as np
>>
>> df = DataFrame({'key' : [rands(10) for _ in xrange(10)] * 10,
>>             'data' : np.random.randn(100)})
>>
>> In [32]: timeit df.groupby('key').sum()
>> 1 loops, best of 3: 374 ms per loop
>>
>> In [33]: timeit df.groupby('key', sort=False).sum()
>> 10 loops, best of 3: 185 ms per loop
>>
>> The "factorization time" for the `key` column dominates the runtime;
>> the factor is computed once then reused if you keep the GroupBy object
>> around:
>>
>> In [36]: timeit grouped.sum()
>> 100 loops, best of 3: 6.05 ms per loop
> Just some numbers for comparison. Factorization times:
>
> In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
> In [42]: levels = np.factor(lets)
> In [43]: data = [lets[int(x)] for x in np.random.randn(100)]
> In [44]: %timeit np.arr

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Bryan Van de Ven
On 6/13/12 1:54 PM, Wes McKinney wrote:
> OK, I need to spend some time on this as it will directly impact me.
> Random thoughts here.
>
> It looks like the levels can only be strings. This is too limited for
> my needs. Why not support all possible NumPy dtypes? In pandas world,
> the levels can be any unique Index object (note, I'm going to change
> the name of the Factor class to Categorical before 0.8.0 final per
> discussion with Nathaniel):

The current for-discussion prototype currently only supports strings. I 
had mentioned integral levels in the NEP but wanted to get more feedback 
first. It looks like you are using intervals as levels in things like 
qcut? This would add some complexity. I can think of a couple of 
possible approaches I will have to try a few of them out to see what 
would make the most sense.

> The API for constructing an enum/factor/categorical array from fixed
> levels and an array of labels seems somewhat weak to me. A very common
> scenario is to need to construct a factor from an array of integers
> with an associated array of levels:
>
>
> In [13]: labels
> Out[13]:
> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
> 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
> 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
> 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
> 7, 9, 7, 3, 3, 0, 4, 4])
>
> In [14]: levels
> Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>
> In [15]: Factor(labels, levels)
> Out[15]:
> Factor:
> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
> 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
> 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
> 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
> 7, 9, 7, 3, 3, 0, 4, 4])
> Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

I originally had a very similar interface in the NEP. I was persuaded by 
Mark that this would be redundant:

In [10]: levels = np.factor(['a', 'b', 'c'])   # or levels = 
np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype
In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels)
Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
dtype='factor({'c': 2, 'a': 0, 'b': 1})')

This should also spell even more closely to your example as:

labels.astype(levels)

but I have not done much with casting yet, so this currently complains. 
However, would this satisfy your needs (modulo the separate question 
about more general integral or object levels)?

> What is the story for NA values (NaL?) in a factor array? I code them
> as -1 in the labels, though you could use INT32_MAX or something. This
> is very important in the context of groupby operations.
I am just using INT32_MIN at the moment.
> Are the levels ordered (Nathaniel brought this up already looks like)?
> It doesn't look like it. That is also necessary. You also need to be

They currently compare based on their value:

In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
np.factor({'c':0, 'b':1, 'a':2}))
In [21]: arr
Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
dtype='factor({'c': 0, 'a': 2, 'b': 1})')
In [22]: arr.sort()
In [23]: arr
Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], 
dtype='factor({'c': 0, 'a': 2, 'b': 1})')


> able to sort the levels (which is a relabeling, I have lots of code in
> use for this). In the context of groupby in pandas, when processing a
> key (array of values) to a factor to be used for aggregating some
> data, you have the option of returning an object that has the levels
> as observed in the data or sorting. Sorting can obviously be very
> expensive depending on the number of groups in the data
> (http://wesmckinney.com/blog/?p=437). Example:
>
> from pandas import DataFrame
> from pandas.util.testing import rands
> import numpy as np
>
> df = DataFrame({'key' : [rands(10) for _ in xrange(10)] * 10,
> 'data' : np.random.randn(100)})
>
> In [32]: timeit df.groupby('key').sum()
> 1 loops, best of 3: 374 ms per loop
>
> In [33]: timeit df.groupby('key', sort=False).sum()
> 10 loops, best of 3: 185 ms per loop
>
> The "factorization time" for the `key` column dominates the runtime;
> the factor is computed once then reused if you keep the GroupBy object
> around:
>
> In [36]: timeit grouped.sum()
> 100 loops, best of 3: 6.05 ms per loop
Just some numbers for comparison. Factorization times:

In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
In [42]: levels = np.factor(lets)
In [43]: data = [lets[int(x)] for x in np.random.randn(100)]
In [44]: %timeit np.array(data, levels)
10 loops, best of 3: 137 ms per loop

And retrieving group indicies/summing:

In [8]: %timeit arr=='a'
1000 loops, best of 3: 1.52 ms per loop
In [10]: vals = np.random.randn(10

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Wes McKinney
On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith  wrote:
> On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven  wrote:
>> On 6/13/12 8:33 AM, Nathaniel Smith wrote:
>>> Hi Bryan,
>>>
>>> I skimmed over the diff:
>>>     https://github.com/bryevdv/numpy/compare/master...enum
>>> It was a bit hard to read since it seems like about half the changes
>>> in that branch are datatime cleanups or something? I hope you'll
>>> separate those out -- it's much easier to review self-contained
>>> changes, and the more changes you roll together into a big lump, the
>>> more risk there is that they'll get lost all together.
>>
>> I'm not quite sure what happened there, my git skills are not advanced
>> by any measure. I think the datetime changes are a much smaller fraction
>> than fifty percent, but I will see what I can do to separate them out in
>> the near future.
>
> Looking again, it looks like a lot of it is actually because when I
> asked github to show me the diff between your branch and master, it
> showed me the diff between your branch and *your repository's* version
> of master. And your branch is actually based off a newer version of
> 'master' than you have in your repository. So, as far as git and
> github are concerned, all those changes that are included in
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
>  git push  master
>
>>>  From the updated NEP I actually understand the use case for "open
>>> types" now, so that's good :-). But I don't think they're actually
>>> workable, so that's bad :-(. The use case, as I understand it, is for
>>> when you want to extend the levels set on the fly as you read through
>>> a file. The problem with this is that it produces a non-deterministic
>>> level ordering, where level 0 is whatever was seen first in the file,
>>> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
>>> I read in:
>>>
>>>      subject,initial_skill,skill_after_training
>>>      1,LOW,HIGH
>>>      2,LOW,LOW
>>>      3,HIGH,HIGH
>>>      ...
>>>
>>> With the scheme described in the NEP, my initial_skill dtype will have
>>> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
>>> levels ["HIGH","LOW"], which means that their storage will be
>>> incompatible, comparisons won't work (or will have to go through some
>>
>> I imagine users using the same open dtype object in both fields of the
>> structure dtype used to read in the file, if both fields of the file
>> contain the same categories. If they don't contain the same categories,
>> they are incomparable in any case. I believe many users have this
>> simpler use case where each field is a separate category, and they want
>> to read them all individually, separately on the fly.  For these simple
>> cases, it would "just work". For your case example there would
>> definitely be a documentation, examples, tutorials, education issue, to
>> avoid the "gotcha" you describe.
>
> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
>
>>> nasty convert-to-string-and-back path), etc. Another situation where
>>> this will occur is if you have multiple data files in the same format;
>>> whether or not you're able to compare the data from them will depend
>>> on the order the data happens to occur in in each file. The solution
>>> is that whenever we automagically create a set of levels from some
>>> data, and the user hasn't specified any order, we should pick an order
>>> deterministically by sorting the levels. (This is also what R does.
>>> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
>>> ->  "a", "b".)
>>
>> A solution is to create the dtype object when reading in the first file,
>> and to reuse that same dtype object when reading in subsequent files.
>> Perhaps it's not ideal, but it does enable the work to be done.
>
> So would a proper implementation of np.fromfile that normalized the
> level ordering.
>
>>> Can you explain why you're using khash instead of PyDict? It seems to
>>> add a *lot* of complexity -- like it seems like you're using about as
>>> many lines of code just marshalling data into and out of the khash as
>>> I used for my old npenum.pyx prototype (not even counting all the
>>> extra work required to , and AFAICT my prototype has about the same
>>> amount of functionality as this. (Of course that's not entirely fair,
>>> because I was working in Cython... but why not work in Cython?) And
>>> you'll need to expose a Python dict interface sooner or later anyway,
>>> I'd think?
>>
>> I suppose I agree with the sentiment that the core of NumPy really ought
>> to be less dependent on the Python C API, not more. I also think the
>> khash API is pretty dead sim

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Nathaniel Smith
On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven  wrote:
> On 6/13/12 8:33 AM, Nathaniel Smith wrote:
>> Hi Bryan,
>>
>> I skimmed over the diff:
>>     https://github.com/bryevdv/numpy/compare/master...enum
>> It was a bit hard to read since it seems like about half the changes
>> in that branch are datatime cleanups or something? I hope you'll
>> separate those out -- it's much easier to review self-contained
>> changes, and the more changes you roll together into a big lump, the
>> more risk there is that they'll get lost all together.
>
> I'm not quite sure what happened there, my git skills are not advanced
> by any measure. I think the datetime changes are a much smaller fraction
> than fifty percent, but I will see what I can do to separate them out in
> the near future.

Looking again, it looks like a lot of it is actually because when I
asked github to show me the diff between your branch and master, it
showed me the diff between your branch and *your repository's* version
of master. And your branch is actually based off a newer version of
'master' than you have in your repository. So, as far as git and
github are concerned, all those changes that are included in
your-branch's-base-master but not in your-repo's-master are new stuff
that you did on your branch. Solution is just to do
  git push  master

>>  From the updated NEP I actually understand the use case for "open
>> types" now, so that's good :-). But I don't think they're actually
>> workable, so that's bad :-(. The use case, as I understand it, is for
>> when you want to extend the levels set on the fly as you read through
>> a file. The problem with this is that it produces a non-deterministic
>> level ordering, where level 0 is whatever was seen first in the file,
>> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
>> I read in:
>>
>>      subject,initial_skill,skill_after_training
>>      1,LOW,HIGH
>>      2,LOW,LOW
>>      3,HIGH,HIGH
>>      ...
>>
>> With the scheme described in the NEP, my initial_skill dtype will have
>> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
>> levels ["HIGH","LOW"], which means that their storage will be
>> incompatible, comparisons won't work (or will have to go through some
>
> I imagine users using the same open dtype object in both fields of the
> structure dtype used to read in the file, if both fields of the file
> contain the same categories. If they don't contain the same categories,
> they are incomparable in any case. I believe many users have this
> simpler use case where each field is a separate category, and they want
> to read them all individually, separately on the fly.  For these simple
> cases, it would "just work". For your case example there would
> definitely be a documentation, examples, tutorials, education issue, to
> avoid the "gotcha" you describe.

Yes, of course we *could* write the code to implement these "open"
dtypes, and then write the documentation, examples, tutorials, etc. to
help people work around their limitations. Or, we could just implement
np.fromfile properly, which would require no workarounds and take less
code to boot.

>> nasty convert-to-string-and-back path), etc. Another situation where
>> this will occur is if you have multiple data files in the same format;
>> whether or not you're able to compare the data from them will depend
>> on the order the data happens to occur in in each file. The solution
>> is that whenever we automagically create a set of levels from some
>> data, and the user hasn't specified any order, we should pick an order
>> deterministically by sorting the levels. (This is also what R does.
>> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
>> ->  "a", "b".)
>
> A solution is to create the dtype object when reading in the first file,
> and to reuse that same dtype object when reading in subsequent files.
> Perhaps it's not ideal, but it does enable the work to be done.

So would a proper implementation of np.fromfile that normalized the
level ordering.

>> Can you explain why you're using khash instead of PyDict? It seems to
>> add a *lot* of complexity -- like it seems like you're using about as
>> many lines of code just marshalling data into and out of the khash as
>> I used for my old npenum.pyx prototype (not even counting all the
>> extra work required to , and AFAICT my prototype has about the same
>> amount of functionality as this. (Of course that's not entirely fair,
>> because I was working in Cython... but why not work in Cython?) And
>> you'll need to expose a Python dict interface sooner or later anyway,
>> I'd think?
>
> I suppose I agree with the sentiment that the core of NumPy really ought
> to be less dependent on the Python C API, not more. I also think the
> khash API is pretty dead simple and straightforward, and the fact that
> it is contained in a singe header is attractive.  It's also quite
> performant in time and space. But if others disagree 

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Dag Sverre Seljebotn


Nathaniel Smith  wrote:

>On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn
> wrote:
>> On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
>>> I'm inclined to say therefore that we should just drop the "open
>type"
>>> idea, since it adds complexity but doesn't seem to actually solve
>the
>>> problem it's designed for.
>>
>> If one wants to have an "open", hassle-free enum, an alternative
>would
>> be to cryptographically hash the enum string. I'd trust 64 bits of
>hash
>> for this purpose.
>>
>> The obvious disadvantage is the extra space used, but it'd be a bit
>more
>> hassle-free compared to regular enums; you'd never have to fix the
>set
>> of enum strings and they'd always be directly comparable across
>> different arrays. HDF libraries etc. could compress it at the storage
>> layer, storing the enum mapping in the metadata.
>
>You'd trust 64 bits to be collision-free for all strings ever stored
>in numpy, eternally? I wouldn't. Anyway, if the goal is to store an
>arbitrary set of strings in 64 bits apiece, then there is no downside
>to just using an object array + interning (like pandas does now), and
>this *is* guaranteed to be collision free. Maybe it would be useful to
>have a "heap string" dtype, but that'd be something different.

Heh, we've been having this discussion before :-)

The 'interned heap string dtype' may be something different, but it could be 
something that could meet the 'open enum' usecases (assuming they exist) in a 
better way than making enums complicated.

Consider it a backup strategy if one can't put the open enum idea dead 
otherwise..

>
>AFAIK all the cases where an explicit categorical type adds value over
>this are the ones where having an explicit set of levels is useful.
>Representing HDF5 enums or R factors requires a way to specify
>arbitrary string<->integer mappings, and there are algorithms (e.g. in
>charlton) that are much more efficient if they can figure out what the
>set of possible levels is directly without scanning the whole array.

For interned strings, the set of strings present could be stored in the array 
in principle (though I guess it would be very difficult to implement in current 
numpy).

The perfect hash schemes we've explored on the Cython list lately uses around 
10-20 microseconds on my 1.8 GHz for 64-element table rehashing (worst case 
insertion, happens more often than with insertion in regular hash tables) and 
0.5-2 nanoseconds for a lookup in L1 (which always hits on first try if the 
entry is in the table).

Dag


>___
>NumPy-Discussion mailing list
>NumPy-Discussion@scipy.org
>http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Bryan Van de Ven
On 6/13/12 8:33 AM, Nathaniel Smith wrote:
> Hi Bryan,
>
> I skimmed over the diff:
> https://github.com/bryevdv/numpy/compare/master...enum
> It was a bit hard to read since it seems like about half the changes
> in that branch are datatime cleanups or something? I hope you'll
> separate those out -- it's much easier to review self-contained
> changes, and the more changes you roll together into a big lump, the
> more risk there is that they'll get lost all together.

I'm not quite sure what happened there, my git skills are not advanced 
by any measure. I think the datetime changes are a much smaller fraction 
than fifty percent, but I will see what I can do to separate them out in 
the near future.

>  From the updated NEP I actually understand the use case for "open
> types" now, so that's good :-). But I don't think they're actually
> workable, so that's bad :-(. The use case, as I understand it, is for
> when you want to extend the levels set on the fly as you read through
> a file. The problem with this is that it produces a non-deterministic
> level ordering, where level 0 is whatever was seen first in the file,
> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
> I read in:
>
>  subject,initial_skill,skill_after_training
>  1,LOW,HIGH
>  2,LOW,LOW
>  3,HIGH,HIGH
>  ...
>
> With the scheme described in the NEP, my initial_skill dtype will have
> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
> levels ["HIGH","LOW"], which means that their storage will be
> incompatible, comparisons won't work (or will have to go through some

I imagine users using the same open dtype object in both fields of the 
structure dtype used to read in the file, if both fields of the file 
contain the same categories. If they don't contain the same categories, 
they are incomparable in any case. I believe many users have this 
simpler use case where each field is a separate category, and they want 
to read them all individually, separately on the fly.  For these simple 
cases, it would "just work". For your case example there would 
definitely be a documentation, examples, tutorials, education issue, to 
avoid the "gotcha" you describe.

> nasty convert-to-string-and-back path), etc. Another situation where
> this will occur is if you have multiple data files in the same format;
> whether or not you're able to compare the data from them will depend
> on the order the data happens to occur in in each file. The solution
> is that whenever we automagically create a set of levels from some
> data, and the user hasn't specified any order, we should pick an order
> deterministically by sorting the levels. (This is also what R does.
> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
> ->  "a", "b".)

A solution is to create the dtype object when reading in the first file, 
and to reuse that same dtype object when reading in subsequent files. 
Perhaps it's not ideal, but it does enable the work to be done.

> Can you explain why you're using khash instead of PyDict? It seems to
> add a *lot* of complexity -- like it seems like you're using about as
> many lines of code just marshalling data into and out of the khash as
> I used for my old npenum.pyx prototype (not even counting all the
> extra work required to , and AFAICT my prototype has about the same
> amount of functionality as this. (Of course that's not entirely fair,
> because I was working in Cython... but why not work in Cython?) And
> you'll need to expose a Python dict interface sooner or later anyway,
> I'd think?

I suppose I agree with the sentiment that the core of NumPy really ought 
to be less dependent on the Python C API, not more. I also think the 
khash API is pretty dead simple and straightforward, and the fact that 
it is contained in a singe header is attractive.  It's also quite 
performant in time and space. But if others disagree strongly, all of 
it's uses are hidden behind the interface in leveled_dtypes.c, it could 
be replaced with some other mechanism easily enough.

> I can't tell if it's worth having categorical scalar types. What value
> do they provide over just using scalars of the level type?

I'm not certain they are worthwhile either, which is why I did not spend 
any time on them yet. Wes has expressed a desire for very broad 
categorical types (even more than just scalar categories), hopefully he 
can chime in with his motivations.

> Terminology: I'd like to suggest we prefer the term "categorical" for
> this data, rather than "factor" or "enum". Partly this is because it
> makes my life easier ;-):
>
> https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
> and partly because numpy has a very diverse set of users and I suspect
> that "categorical" will just be a more transparent name to those who
> aren't already familiar with the particular statistical and
> programming traditions that "factor" and "enum" come from.

I thin

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Nathaniel Smith
On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn
 wrote:
> On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
>> I'm inclined to say therefore that we should just drop the "open type"
>> idea, since it adds complexity but doesn't seem to actually solve the
>> problem it's designed for.
>
> If one wants to have an "open", hassle-free enum, an alternative would
> be to cryptographically hash the enum string. I'd trust 64 bits of hash
> for this purpose.
>
> The obvious disadvantage is the extra space used, but it'd be a bit more
> hassle-free compared to regular enums; you'd never have to fix the set
> of enum strings and they'd always be directly comparable across
> different arrays. HDF libraries etc. could compress it at the storage
> layer, storing the enum mapping in the metadata.

You'd trust 64 bits to be collision-free for all strings ever stored
in numpy, eternally? I wouldn't. Anyway, if the goal is to store an
arbitrary set of strings in 64 bits apiece, then there is no downside
to just using an object array + interning (like pandas does now), and
this *is* guaranteed to be collision free. Maybe it would be useful to
have a "heap string" dtype, but that'd be something different.

AFAIK all the cases where an explicit categorical type adds value over
this are the ones where having an explicit set of levels is useful.
Representing HDF5 enums or R factors requires a way to specify
arbitrary string<->integer mappings, and there are algorithms (e.g. in
charlton) that are much more efficient if they can figure out what the
set of possible levels is directly without scanning the whole array.

-N
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Dag Sverre Seljebotn
On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
> On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven  
> wrote:
>> Hi all,
>>
>> It has been some time, but I do have an update regarding this proposed
>> feature. I thought it would be helpful to flesh out some parts of a
>> possible implementation to learn what can be spelled reasonably in
>> NumPy. Mark Wiebe helped out greatly in navigating the NumPy code
>> codebase. Here is a link to my branch with this code;
>>
>>  https://github.com/bryevdv/numpy/tree/enum
>>
>> and the updated NEP:
>>
>>  https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>>
>> Not everything in the NEP is implemented (integral levels and natural
>> naming in particular) and some parts definitely need more fleshing out.
>> However, things currently work basically as described in the NEP, and
>> there is also a small set of tests that demonstrate current usage. A few
>> things will crash python (astype especially). More tests are needed. I
>> would appreciate as much feedback and discussion as you can provide!
>
> Hi Bryan,
>
> I skimmed over the diff:
> https://github.com/bryevdv/numpy/compare/master...enum
> It was a bit hard to read since it seems like about half the changes
> in that branch are datatime cleanups or something? I hope you'll
> separate those out -- it's much easier to review self-contained
> changes, and the more changes you roll together into a big lump, the
> more risk there is that they'll get lost all together.
>
>  From the updated NEP I actually understand the use case for "open
> types" now, so that's good :-). But I don't think they're actually
> workable, so that's bad :-(. The use case, as I understand it, is for
> when you want to extend the levels set on the fly as you read through
> a file. The problem with this is that it produces a non-deterministic
> level ordering, where level 0 is whatever was seen first in the file,
> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
> I read in:
>
>  subject,initial_skill,skill_after_training
>  1,LOW,HIGH
>  2,LOW,LOW
>  3,HIGH,HIGH
>  ...
>
> With the scheme described in the NEP, my initial_skill dtype will have
> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
> levels ["HIGH","LOW"], which means that their storage will be
> incompatible, comparisons won't work (or will have to go through some
> nasty convert-to-string-and-back path), etc. Another situation where
> this will occur is if you have multiple data files in the same format;
> whether or not you're able to compare the data from them will depend
> on the order the data happens to occur in in each file. The solution
> is that whenever we automagically create a set of levels from some
> data, and the user hasn't specified any order, we should pick an order
> deterministically by sorting the levels. (This is also what R does.
> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
> ->  "a", "b".)
>
> I'm inclined to say therefore that we should just drop the "open type"
> idea, since it adds complexity but doesn't seem to actually solve the
> problem it's designed for.

If one wants to have an "open", hassle-free enum, an alternative would 
be to cryptographically hash the enum string. I'd trust 64 bits of hash 
for this purpose.

The obvious disadvantage is the extra space used, but it'd be a bit more 
hassle-free compared to regular enums; you'd never have to fix the set 
of enum strings and they'd always be directly comparable across 
different arrays. HDF libraries etc. could compress it at the storage 
layer, storing the enum mapping in the metadata.

Just a thought.

Dag

>
> Can you explain why you're using khash instead of PyDict? It seems to
> add a *lot* of complexity -- like it seems like you're using about as
> many lines of code just marshalling data into and out of the khash as
> I used for my old npenum.pyx prototype (not even counting all the
> extra work required to , and AFAICT my prototype has about the same
> amount of functionality as this. (Of course that's not entirely fair,
> because I was working in Cython... but why not work in Cython?) And
> you'll need to expose a Python dict interface sooner or later anyway,
> I'd think?
>
> I can't tell if it's worth having categorical scalar types. What value
> do they provide over just using scalars of the level type?
>
> Terminology: I'd like to suggest we prefer the term "categorical" for
> this data, rather than "factor" or "enum". Partly this is because it
> makes my life easier ;-):
>
> https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
> and partly because numpy has a very diverse set of users and I suspect
> that "categorical" will just be a more transparent name to those who
> aren't already familiar with the particular statistical and
> programming traditions that "factor" and "enum" come from.
>
> I'm disturbed to see you adding special cases to the core

Re: [Numpy-discussion] Enum/Factor NEP (now with code)

2012-06-13 Thread Nathaniel Smith
On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven  wrote:
> Hi all,
>
> It has been some time, but I do have an update regarding this proposed
> feature. I thought it would be helpful to flesh out some parts of a
> possible implementation to learn what can be spelled reasonably in
> NumPy. Mark Wiebe helped out greatly in navigating the NumPy code
> codebase. Here is a link to my branch with this code;
>
>     https://github.com/bryevdv/numpy/tree/enum
>
> and the updated NEP:
>
>     https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> Not everything in the NEP is implemented (integral levels and natural
> naming in particular) and some parts definitely need more fleshing out.
> However, things currently work basically as described in the NEP, and
> there is also a small set of tests that demonstrate current usage. A few
> things will crash python (astype especially). More tests are needed. I
> would appreciate as much feedback and discussion as you can provide!

Hi Bryan,

I skimmed over the diff:
   https://github.com/bryevdv/numpy/compare/master...enum
It was a bit hard to read since it seems like about half the changes
in that branch are datatime cleanups or something? I hope you'll
separate those out -- it's much easier to review self-contained
changes, and the more changes you roll together into a big lump, the
more risk there is that they'll get lost all together.

From the updated NEP I actually understand the use case for "open
types" now, so that's good :-). But I don't think they're actually
workable, so that's bad :-(. The use case, as I understand it, is for
when you want to extend the levels set on the fly as you read through
a file. The problem with this is that it produces a non-deterministic
level ordering, where level 0 is whatever was seen first in the file,
level 1 is whatever was seen second, etc. E.g., say I have a CSV file
I read in:

subject,initial_skill,skill_after_training
1,LOW,HIGH
2,LOW,LOW
3,HIGH,HIGH
...

With the scheme described in the NEP, my initial_skill dtype will have
levels ["LOW", "HIGH"], and by skill_after_training dtype will have
levels ["HIGH","LOW"], which means that their storage will be
incompatible, comparisons won't work (or will have to go through some
nasty convert-to-string-and-back path), etc. Another situation where
this will occur is if you have multiple data files in the same format;
whether or not you're able to compare the data from them will depend
on the order the data happens to occur in in each file. The solution
is that whenever we automagically create a set of levels from some
data, and the user hasn't specified any order, we should pick an order
deterministically by sorting the levels. (This is also what R does.
levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a")))
-> "a", "b".)

I'm inclined to say therefore that we should just drop the "open type"
idea, since it adds complexity but doesn't seem to actually solve the
problem it's designed for.

Can you explain why you're using khash instead of PyDict? It seems to
add a *lot* of complexity -- like it seems like you're using about as
many lines of code just marshalling data into and out of the khash as
I used for my old npenum.pyx prototype (not even counting all the
extra work required to , and AFAICT my prototype has about the same
amount of functionality as this. (Of course that's not entirely fair,
because I was working in Cython... but why not work in Cython?) And
you'll need to expose a Python dict interface sooner or later anyway,
I'd think?

I can't tell if it's worth having categorical scalar types. What value
do they provide over just using scalars of the level type?

Terminology: I'd like to suggest we prefer the term "categorical" for
this data, rather than "factor" or "enum". Partly this is because it
makes my life easier ;-):
  https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
and partly because numpy has a very diverse set of users and I suspect
that "categorical" will just be a more transparent name to those who
aren't already familiar with the particular statistical and
programming traditions that "factor" and "enum" come from.

I'm disturbed to see you adding special cases to the core ufunc
dispatch machinery for these things. I'm -1 on that. We should clean
up the generic ufunc machinery so that it doesn't need special cases
to handle adding a simple type like this.

I'm also worried that I still don't see any signs that you're working
with the downstream libraries that this functionality is intended to
be useful for, like the various HDF5 libraries and pandas. I really
don't think this functionality can be merged to numpy until we have
affirmative statements from those developers that they are excited
about it and will use it, and since they're busy people, it's pretty
much your job to track them down and make sure that your code will
solve their problems.

Hope that helps -- it's exciting to see s