[julia-users] Re: Are dataframes the best way to manipulate data?

Shahbaz Chaudhary Tue, 23 Feb 2016 18:49:41 -0800

After following links by Milan and Tomas, I think I can rephrase the 
discussion. I like what the Apache Spark folks are doing. They have a basic 
data structure similar to data frames. Much like relational algebra, they 
are not stuffing features into the same data structure. They are making it 
more compositional so it can be combined in various ways to increase 
expressiveness. Basically good api design. I've noticed functional 
programmers are good at such design where as object oriented programmers, 
or programmers with no computer science training often don't appreciate 
compositionally. I'm glad to see LINQ and F# mentioned since these two were 
designed by folks who deep backgrounds in programming language theory (Erik 
Meijer and Don Syme, respectively).

Project like JuliaDB isn't quite what I was looking for since I'm arguing 
for a basic, core interface which will be used by julia's ecosystem of 
plotting libraries, statistical packages, ML libraries, etc. The data may 
come from a database or a csv or even generated on-the-fly.

Basically similar to R's dataframes, but with a cleaner design. I don't 
know what this design should be. However, relational algebra is not a bad 
start. By itself it isn't enough because relational algebra defines a set 
of primitives over an _unordered set of relations_. Stats package users, on 
the other hand, work mostly with data which IS ordered, is not a set 
(multiple identical rows of data have meaning) and is not defined by a 
relationship (missing values or N/A cause problems there). I think the 
solution is not to throw our relational algebra, but to use it as a base.

In case you haven't noticed, this is not a practical post. Just curious 
about how Julia folks are planning on handling basic data 
processing/representation primitives.

On Tuesday, February 23, 2016 at 6:59:48 AM UTC-6, ben wrote:
>
> Dear Shahbaz,
>
> Welcome to Julia. Various data manipulation tools will cater to various 
> needs/individuals. Sounds like you might be interested in the various 
> projects developed by the JuliaDB organization:
> https://github.com/JuliaDB
> Other people need or like to use DataFrames. In my work I often find that 
> native arrays and .csv reading and writing does the job.
>
> All the best,
>
> Ben
>
> On Monday, February 22, 2016 at 10:55:05 AM UTC-5, Shahbaz Chaudhary wrote:
>>
>> I'm pretty new to Julia and only have marginally more experience with R 
>> so please excuse me if I don't understand something basic.
>>
>> According to Julia's website, the final api/syntax for manipulating data 
>> has not been finalized yet, although the momentum seems to be moving 
>> towards a dataframe style api.
>>
>> Since Julia is still a new language, doesn't it make sense to base the 
>> model on something closer to the relational algebra/sql/list 
>> comprehensions? I realize these three are not synonyms for each other, but 
>> relational algebra is supposed to have a more rigorous mathematical 
>> foundation in building the primitives used to manipulate data. SQL now has 
>> decades of use and has unarguably democratized data manipulation (I've seen 
>> lawyers and traders use sql, who would never use a full blown programming 
>> language). 
>>
>> At least R's dataframe feel extremely clunky, although I'll admit that I 
>> may be missing something fundamental since Julia/Spark/Pandas seem to be 
>> adopting this model instead of the relational model.
>>
>> A language, built from the ground up to process datasets should have a 
>> more intuitive syntax.
>>
>> One potential issue is that relational algebra/sql don't handle ordered 
>> data well. I don't know enough about recent advances but surely an 
>> extension to relational primitives is more sound than adapting dataframes.
>>
>> Frankly I'm curious to learn from the more experienced here what I'm not 
>> understanding about dataframes and why they are so popular.
>>
>>
>>

[julia-users] Re: Are dataframes the best way to manipulate data?

Reply via email to