I am not quite sure what the point of the datatypes feature is.

x := nil.
aSequence do: [:each |
  each ifNotNil: [
    x := x ifNil: [each class] ifNotNil: [x commonSuperclassWith: each class]]].

doesn't seem terribly complicated.
My difficulty is that  from a statistics/data science perspective,
it doesn't seem terribly *useful*.

I'm currently reading a book about geostatistics with R (based on a
survey of the Kola
peninsula).  For that task, it is ESSENTIAL to know the units in which
the items are
recorded.  If Calcium is measured in mg/kg and Caesium is measured in µg/kg,
you really really need to know that.  This is not information you can
derive by looking
at the representation of the data in Pharo.  Consider for example
1.  mass of animals in kg
2.  maximum speed of cars in km/h
3.  volume of rain in successive dates, in mL (for fixed area)
4.  directions taken by sand-hoppers released at different times of
day, in degrees
5.  region of space illuminated by light bulbs in steradians.
These might all have the *same* representation in Pharo, but they are
*semantically*
very different.  1 and 2 are linear, but cannot be negative.  3 also
cannot be negative,
but the variable is a *time series*, which 1 and 2 are not.  4 is a
circular measure,
and taking the usual arithmetic mean or median would be an elementary blunder
producing meaningless answers.  5 is perhaps best viewed as a proportion.
(These are all actual examples, by the way.)
THIS kind of information IS valuable for analysis.  The difference
between SmallInteger
and Float64 is nowhere near as interesting.

There's a bunch of weather data that i'm very interested in which has
things like
air temperature, soil temperature, relative humidity, wind speed and
direction (last
5 minutes), gust speed and direction (maximum in last 5 minutes), illumination
in W/m^2 (visible, UVB, UVA), rainfall, and of course date+time.
Temperatures are measured on an interval scale, so dividing them makes no sense.
Nor does adding them.  If it's 10C today and 10C tomorrow, nothing is 20C.  But
oddly enough arithmetic means DO make sense.
Humidity is bounded between 0 and 100; adding two relative humidities makes no
sense at all.  Medians make sense but means do not.
Wind speed and direction are reported as separate variables,
but they are arguably one 2D vector quantity.
Illumination is on a ratio scale.  Dividing one illumination by another makes
sense, or would if there were no moonless nights...
The total illumination over a day makes sense.
Rainfall is also on a ratio scale.  Dividing the rainfall on one day by that
on another would make sense if only the usual measurement were not 0.
Total rainfall over a day makes sense.

The whole problem a statistician/data scientist faces is that there is important
information you need to know even which *basic* operations make sense
that has already disappeared by the time Pharo stores it, and cannot be
inferred from the DataFrame.  I remember one time I was given a CSV file
with about 50 variables and it took me about 2 weeks to recover this missing
meta-information.

On Sat, 7 Aug 2021 at 04:23, Balaji G <gbalaji20002...@gmail.com> wrote:
>
> Hello Everyone,
>
> I have been working on the addition of a new feature, DataFrame >> dataTypes, 
> which briefs us about the data type of columns in dataframes we work on.
> Summarising a dataset is really important during the initial stage of any 
> Data Science and Machine Learning tasks. Knowing the data type of the 
> attribute is one major thing to begin with.
> I have tried to work with some sample datasets for a clear understanding of 
> this new feature.
> Please go through the following blog post. Any kind of suggestion or feedback 
> is welcome.
>
> Link to the Post : 
> https://balaji612141526.wordpress.com/2021/08/06/introducing-new-feature-in-dataframe-project-datatypes/
>
> Previous discussions can be found here : 
> https://lists.pharo.org/empathy/thread/BFOHPRUU72MDYVTJP3YV2DQ5LAZHXELE and 
> here :   
> https://lists.pharo.org/empathy/thread/JZXKXGHSURC3DCDA2NXA7KDWZ2EINAZ5
>
>
>
> Cheers
> Balaji G

Reply via email to