On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmck...@gmail.com> wrote: > It looks like the levels can only be strings. This is too limited for > my needs. Why not support all possible NumPy dtypes? In pandas world, > the levels can be any unique Index object
It seems like there are three obvious options, from most to least general: 1) Allow levels to be an arbitrary collection of hashable Python objects 2) Allow levels to be a homogenous collection of objects of any arbitrary numpy dtype 3) Allow levels to be chosen a few fixed types (strings and ints, I guess) I agree that (3) is a bit limiting. (1) is probably easier to implement than (2). (2) is the most general, since of course "arbitrary Python object" is a dtype. Is it useful to be able to restrict levels to be of homogenous type? The main difference between dtypes and python types is that (most) dtype scalars can be unboxed -- is that substantively useful for levels? > What is the story for NA values (NaL?) in a factor array? I code them > as -1 in the labels, though you could use INT32_MAX or something. This > is very important in the context of groupby operations. If we have a type restriction on levels (options (2) or (3) above), then how to handle out-of-bounds values is quite a problem, yeah. Once we have NA dtypes then I suppose we could use those, but we don't yet. It's tempting to just error out of any operation that encounters such values. > Nathaniel: my experience (see blog posting above for a bit more) is > that khash really crushes PyDict for two reasons: you can use it with > primitive types and avoid boxing, and secondly you can preallocate. > Its memory footprint with large hashtables is also a fraction of > PyDict. The Python memory allocator is not problematic-- if you create > millions of Python objects expect the RAM usage of the Python process > to balloon absurdly. Right, I saw that posting -- it's clear that khash has a lot of advantages as internal temporary storage for a specific operation like groupby on unboxed types. But I can't tell whether those arguments still apply now that we're talking about a long-term storage representation for data that has to support a variety of operations (many of which would require boxing/unboxing, since the API is in Python), might or might not use boxed types, etc. Obviously this also depends on which of the three options above we go with -- unboxing doesn't even make sense for option (1). -n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion