"Richard O'Keefe" <rao...@gmail.com> writes:

> Neither of those use cases actually works.

In practice, they do.

> Consider the following partial class hierarchy from my Smalltalk system:

That's not what I have to deal with in practice. My data comes in as a
CSV file or a JSON file, i.e. formats with very shallow data
classification. All I can expect to distinguish are strings vs. numbers,
plus a few well-defined subsets of each, such as integers, specific
constants (nil, true, false), or data recognizable by their format, such
as dates.

For internal processing, I might then want to redefine a column's data
type to be something more specific. In particular, I might want to
distinguish between "categorical" (a predefined finite set of string
values) vs. "generic string". But most of the classes in a Smalltalk
hierarchy just never occur in data frames. It's a simple data structure
for simply structured data.

> "You might want to compute an average..."  But dataType is no use for
> that either, as I was at pains to explain.  If you have a bunch of
> angles expressed as Numbers, you *can* compute an arithmetic mean of
> them, but you *shouldn't*, because that's not how you compute the
> average of circular measures.

I agree. There are many things you shouldn't do for any given dataset
but which no formal structure will prevent you from doing. Data science
is very much about shallow computations, with automation limited to
simple stuff that is then applied to large datasets. Tools like data
type analysis are no more than a help for the data scientist, who is the
ultimate arbiter of what is or is not OK to do.

Think of this as the data equivalent of a spell checker. A spell checker
won't recognize bad grammar, but it's still a useful tool to have.

Cheers,
  Konrad.

Reply via email to