[Pharo-users] Re: Introducing new feature dataTypes for the PolyMath/DataFrame project

Richard O'Keefe Sun, 08 Aug 2021 16:58:56 -0700

Neither of those use cases actually works.
Consider the following partial class hierarchy from my Smalltalk system:
Object
 VectorSpace
  Complex
  Quaternion
 Magnitude
  MagnitudeWithAddition
   DateAndTime
   QuasiArithmetic
    Duration
    Number
     AbstractRationalNumber
      Integer
       SmallInteger

There is a whole fleet of "numeric" things like Matrix3x3 which have
some arithmetic properties
but which cannot be given a total order consistent with those
properties.  Complex is one of them.
It makes less than no sense to make Complex inherit from Magnitude, so
it cannot inherit from
Number, This means that the common superclass of 1 and 1 - 2 i is
Object.  Yet it makes perfect
sense to have a column of Gaussian integers some of which have zero
imaginary part.
So "the dataType is Object means there's an error" fails at the first
hurdle.  Conversely, the
common superclass of 1 and DateAndTime now is MagnitudeWithAddition,
which is not Object,
but the combination is probably wrong, and the dataType test fails at
the second hurdle.

"You might want to compute an average..."  But dataType is no use for
that either, as I was at
pains to explain.  If you have a bunch of angles expressed as Numbers,
you *can* compute an
arithmetic mean of them, but you *shouldn't*, because that's not how
you compute the
average of circular measures.  The obvious algorithm (self sum / self
size) does not work at
all for a collection of DateAndTimes, but the notion of average makes
perfect sense and a
subtly different algorithm works well.  (I wrote a technical report
about this, if anyone is interested.)
dataType will tell you you CAN take an average when you cannot or should not.
dataType will tell you you CAN'T take an average when you really honestly can.

The distinctions we need to make are not the distinctions that the
class hierarchy makes.

For example, how about the distinction between *ordered* factors and
*unordered* factors?

On Mon, 9 Aug 2021 at 03:03, Konrad Hinsen <konrad.hin...@fastmail.net> wrote:
>
> "Richard O'Keefe" <rao...@gmail.com> writes:
>
> > My difficulty is that  from a statistics/data science perspective,
> > it doesn't seem terribly *useful*.
>
> There are two common use cases in my experience:
>
> 1) Error checking, most frequently right after reading in a dataset.
>    A quick look at the data types of all columns shows if it is coherent
>    with your expectations. If you have a column called "data" of data
>    type "Object", then most probably something went wrong with parsing
>    some date format.
>
> 2) Type checking for specific operations. For example, you might want to
>    compute an average over all rows for each numerical column in your
>    dataset.  That's easiest to do by selecting columns of the right data
>    type.
>
> You are completely right that data type information is not sufficient
> for checking for all possible problems, such as unit mismatch. But it
> remains a useful tool.
>
> Cheers,
>   Konrad.

[Pharo-users] Re: Introducing new feature dataTypes for the PolyMath/DataFrame project

Reply via email to