"Richard O'Keefe" <rao...@gmail.com> writes: > Neither of those use cases actually works.
In practice, they do. > Consider the following partial class hierarchy from my Smalltalk system: That's not what I have to deal with in practice. My data comes in as a CSV file or a JSON file, i.e. formats with very shallow data classification. All I can expect to distinguish are strings vs. numbers, plus a few well-defined subsets of each, such as integers, specific constants (nil, true, false), or data recognizable by their format, such as dates. For internal processing, I might then want to redefine a column's data type to be something more specific. In particular, I might want to distinguish between "categorical" (a predefined finite set of string values) vs. "generic string". But most of the classes in a Smalltalk hierarchy just never occur in data frames. It's a simple data structure for simply structured data. > "You might want to compute an average..." But dataType is no use for > that either, as I was at pains to explain. If you have a bunch of > angles expressed as Numbers, you *can* compute an arithmetic mean of > them, but you *shouldn't*, because that's not how you compute the > average of circular measures. I agree. There are many things you shouldn't do for any given dataset but which no formal structure will prevent you from doing. Data science is very much about shallow computations, with automation limited to simple stuff that is then applied to large datasets. Tools like data type analysis are no more than a help for the data scientist, who is the ultimate arbiter of what is or is not OK to do. Think of this as the data equivalent of a spell checker. A spell checker won't recognize bad grammar, but it's still a useful tool to have. Cheers, Konrad.