Hi Serge,

as I said I do not really have the time now to get involved in a gsoc
proposal, but I can give you my perspective. There are two sides to
the story.

The first one is complementary to SciSmalltalk: in order to analize
data, you need to get data in first. So, one may want to read - say -
a CSV, and have a number of heuristics, such as:

- autodetection of encoding
- autodetection of quotes and delimiter
- autodetection of columns containing numbers or dates
- the possibility to indicate that some markers, such as "N/A",
represent missing values
- the possibility to indicate a replacement for missing values, such
as 0, or "", or the average or the minimum of the other values in the
colums

See http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#csv-text-files
for some examples.

It may be worth to consider making this into a sequence that is read
and processed lazily, to deal with CSV files bigger than memory.

When data is finally in, usually the first task is doing some
processing, inspection or visualization. The Smalltalk collections are
good for processing (although some lazy variants might help), and
Roassal and the inspectors are perfect for visualization and browsing.

The second part comes the time when one wants to run some algorithm.
While there is no need to have the fanciest ones, there should be some
of the basics, such as:

- some form or regression (linear, logistic...)
- some form of clustering (kmeans, dbscan, canopy...)
- SVM

Another thing which would be useful is support for linear algebra,
leveraging native libraries such as BLAS or LAPACK.

In short: just copying R, or numpy + pandas + scikit-learn would
already be a giant leap forward.

Actually, some of the things I have mentioned above are already (I
think) in SciSmalltalk, which brings me to the next point:
documentation. There is really no point in having all these tools if
people do not know they are there.

For this to become useful, there should be a dedicated site,
highlighting what is already available, in what state (experimental,
partial, stable...) and how to use it.

Ideally, I would include also some tutorials, for instance for dealing
with standard problems such as Kaggle competitions. Here I think
Smalltalk would have an edge, since these tutorial could be in the
form of Prof Stef. Still, it would be nice if some form of the
tutorials was also on the web, which makes it discoverable.

Best,
Andrea

2015-02-18 11:14 GMT+01:00 Serge Stinckwich <serge.stinckw...@gmail.com>:
> On Wed, Feb 18, 2015 at 11:01 AM, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>> OK, try making a proposal then, http://gsoc.pharo.org has the instructions 
>> and the current list, you probably know more about data science than I do.
>>
>>> On 18 Feb 2015, at 10:53, Andrea Ferretti <ferrettiand...@gmail.com> wrote:
>>>
>>> I am sorry if the previous messages came off as too harsh. The Neo
>>> tools are perfectly fine for their intended use.
>>>
>>> What I was trying to say is that a good idea for a SoC project would
>>> be to develop a framework for data analysis that would be useful for
>>> data scientists, and in particular this would include something to
>>> import unstructured data more freely.
>
> Sorry Andrea. I didn't see you message because I'm not pharo-users
> mailing-list, only on pharo-dev.
> I'm also really interested to have a gsoc project to develop data
> analysis framework.
> Please let's talk together in order to discuss about a proposal.
>
> Regards,
> --
> Serge Stinckwich
> UCBN & UMI UMMISCO 209 (IRD/UPMC)
> Every DSL ends up being Smalltalk
> http://www.doesnotunderstand.org/
>

Reply via email to