17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Oleksandr Zaytsev Tue, 16 May 2017 10:04:11 -0700

I would love to, but to go to Lille from my country I would need a visa.
Which is not that easy to acquire.
So maybe I will come to PharoDays 2018.
And I will definitely try to come to ESUG Conference in September.


Oleks

On Tue, May 16, 2017 at 7:26 PM, <[email protected]> wrote:

>
>
> Envoyé de mon iPhone
>
> Le 11 mai 2017 à 11:43, "[email protected]" <[email protected]> a
> écrit :
>
> ---------- Message transféré ----------
> De : "[email protected]" <[email protected]>
> Date : 11 mai 2017 10:54
> Objet : Re: 11/05/17 - Tabular Data Structures for Data Analysis -
> Oleksandr Zaytsev
> À : "Nick Papoylias" <[email protected]>
> Cc :
>
>
>
> On Thu, May 11, 2017 at 10:20 AM, Nick Papoylias <[email protected]>
> wrote:
>
>>
>>
>> On Thu, May 11, 2017 at 5:24 AM, Oleksandr Zaytsev <[email protected]
>> > wrote:
>>
>>>
>>> *A. Work done*
>>>
>>>    - Downloaded the threaded VM as suggested by Esteban Lorenzano to
>>>    make Iceberg work. And it does! I have successfully pushed my 
>>> NeuralNetwork
>>>    code to GitHub: https://github.com/olekscode/MLNeuralNetwork
>>>    - Joined the PolyMath organization on GitHub
>>>    - Created a repository for the TabularDataset project
>>>    https://github.com/PolyMathOrg/TabularDataset
>>>    <https://github.com/PolyMathOrg/TabularDataset> as a part of
>>>    PolyMath organization on GitHub
>>>    - Fixed a PolyMath issue #25 and made a PR
>>>    - Read an article from Wolfram Mathematica documentation regarding
>>>    Dataset. It was one of the reading suggestions sent to me by Nick 
>>> Papoylias
>>>
>>>
>>> *B. Next steps*
>>>
>>>    - Fix more issues of PolyMath, using Iceberg. I have to get used to
>>>    it by the time the coding phase starts
>>>    - Read the rest of Nick Papoylias's suggestions
>>>
>>>
>>> *C. Help needed*
>>>
>>>    - The Dataset in Wolfram, as well as Pandas in Python, has a very
>>>    advanced indexing system. Smalltalk has its own special conventions for
>>>    indexing, so I think that it would be great if I got familiar with them.
>>>    Could you suggest me some reading on this topic (what are the indexing
>>>    conventions in Smalltalk?).
>>>    For example, in Wolfram, I can write *dataset[[-1]]* to extract the
>>>    last row. But in Pharo indexes can not be negative. In Pharo I would say 
>>> *dataset
>>>    last*. But how about *dataset[[-5]]*?
>>>
>>> This would be a good exercise for you ;) In Pharo you can easily add
>> negative indexing yourself.
>>
>> *Hint:* You know the index of the last element, since this is the size
>> of the collection, so... ;)
>>
>> No need for changes, this exists already.
>
> Use atWrap: index put: value and atWrap: with negative indexes.
> 'hello' atWrap: -2
>
> There is a specific version for Array using a primitive.
> #[ 10 20 30 40 ] atWrap: -1
>
> atWrap:0 gives you the last item.
> atWrap: -1 gives 30
>
> This is different from 0 based index languages.
>
> The interesing thing about atWrap: is that it uses modulo interally so you
> do not need to care about that.
>
> ($/ split: 'abc/def/ghi/jkl') atWrap: -1
> --> 'ghi'
>
> The Matrix class has a bunch of things API wise but the class is highly
> inefficient, doing copies all the time etc. It would be nice to have some
> kind of futures/copy on write style things in there.
>
> I miss cbind and rbind. These are useful. I have some half baked super
> inefficient implementations of these things for Matrix.
>
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html
>
> The ability to name columns is also nice to have.
>
> In R one does:
>
> df <- dataframe()
> cbind(df, c(1,2,3))
> cbind(df, c(4,5,6))
> names(df)<-("C1", "C2", "C3")
> names can be found back with:
>
> names(df)
>
> A Smalltalkish style would be welcome.
>
>
>
>
> Interesting ! Are you coming to PharoDays ? We can talk about that if we
> found time.
>
> Maybe looking at the Voyage queries can be helpful.
>
> Phil
>
>
>
>> Try adding an extention method to Ordrered or SequenceableCollection.
>>
>> If the Pharo by example chapter is not enough or the MOOC, read the source
>> itself in the core, to see how basic methods are implemented (it is less
>> scary,
>> than it sounds).
>>
>> You can also try Chapters 9, 10, 11 of the blue book (some API changes
>> may apply):
>>
>> <http://goog_1902892863>
>> http://sdmeta.gforge.inria.fr/FreeBooks/BlueBook/Bluebook.pdf
>>
>>
>>>    - Or what is the best way of implementing this index:
>>>    *dataset[["name"]]* (extracts a named row), *dataset[[1]*] (extracts
>>>    the first row)? Should I create two separate messages: *dataset
>>>    rowNamed: 'name'* and *dataset rowAt: 1*?
>>>
>>> rowNamed:
> rowAt:
>
> yes, look like it.
>
> But if we want to model things like R dataframes for example, this has to
> be seen as a vectorized operation, so you can to use row slices, column
> slices, and logical indexes.
>
> Check this out:
>
> http://www.r-tutor.com/r-introduction/data-frame/data-frame-row-slice
> https://www.r-bloggers.com/working-with-data-frames/
>
>
>
>> The internal representation of your data-structure can be anything at the
>> moment, *as long as you encapsulate it.*
>>
>> (ie it can be nested OrderedCollections with meta-data for column-names
>> to indexes, or dictionary of collections etc).
>>
>> *If you don't expose it to the user* (ie return it from the public api,
>> or expect knowledge of it in argument passing),
>> we can easily change it later. So *first make it work, and we optimize
>> later ;)*
>>
>> For your case it will be a little bit trickier because *you also have
>> the notions of a) rows and b) columns*, which
>> are exposed to the user. So *you would need to create abstractions* for
>> these too.
>>
>> Cheers,
>>
>> Nick
>>
>>>
>>>    -
>>>
>>>
>>> If someone else is having problems with Iceberg on Linux, try
>>> downloading the threaded VM:
>>>
>>> wget -O- get.pharo.org/vmT60 | bash
>>>
>>> And use SSH (not HTTPS) remote URL.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Pharo Google Summer of Code" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8
>>> qkTqfQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/pharo-gsoc/CAEp0Uzu-8fw3dA6ezVoj-QptvLcB8cWPHvZ1tfLg1Ce8qkTqfQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Pharo Google Summer of Code" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7G
>> h1c0sM%3DA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/pharo-gsoc/CACEStOgLC6HbYJ8HBLHWfs5%2BwqN3ib_kdVGuVizx7Gh1c0sM%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

Re: [Pharo-users] Fwd: Re: 11/05/17 - Tabular Data Structures for Data Analysis - Oleksandr Zaytsev

Reply via email to