Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-05 Thread Ista Zahn
On Fri, May 5, 2017 at 1:00 PM, Antonin Klima  wrote:
> Dear Sir or Madam,
>
> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
> computer science, and have been using R heavily during my PhD. As such, I 
> have put together a list of certain features in R that, in my opinion, would 
> be beneficial to add, or could be improved. The first two are already 
> implemented in packages, but given that it is implemented as user-defined 
> operators, it greatly restricts its usefulness.

Why do you think being implemented in a contributed package restricts
the usefulness of a feature?

I hope you will find my suggestions interesting. If you find time, I
will welcome any feedback as to whether you find the suggestions
useful, or why you do not think they should be implemented. I will
also welcome if you enlighten me with any features I might be unaware
of, that might solve the issues I have pointed out below.
>
> 1) piping
> Currently available in package magrittr, piping makes the code better 
> readable by having the line start at its natural starting point, and 
> following with functions that are applied - in order. The readability of 
> several nested calls with a number of parameters each is almost zero, it’s 
> almost as if one would need to come up with the solution himself. Pipeline in 
> comparison is very straightforward, especially together with the point (2).

You may be surprised to learn that not everyone thinks pipes are a
good idea. Personally I see some advantages, but there is also a big
downside with is that they mess up the call stack and make tracking
down errors via traceback() more difficult.

There is a simple alternative to pipes already built in to R that
gives you some of the advantages of %>% without messing up the call
stack.  Using Hadley's famous "little bunny foo foo" example:

foo_foo <- little_bunny()

## nesting (it is rough)
bop(
  scoop(
hop(foo_foo, through = forest),
up = field_mice
  ),
  on = head
)

## magrittr
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

## regular R assignment
foo_foo -> .
  hop(., through = forest) -> .
  scoop(., up = field_mouse) -> .
  bop(., on = head)

This is more limited that magrittr's %>%, but it gives you a lot of
the advantages without the disadvantages.

>
> The package here works rather good nevertheless, the shortcomings of piping 
> not being native are not quite as severe as in point (2). Nevertheless, an 
> intuitive symbol such as | would be helpful, and it sometimes bothers me that 
> I have to parenthesize anonymous function, which would probably not be 
> required in a native pipe-operator, much like it is not required in f.ex. 
> lapply. That is,
> 1:5 %>% function(x) x+2
> should be totally fine

That seems pretty small-potatoes to me.

>
> 2) currying
> Currently available in package Curry. The idea is that, having a function 
> such as foo = function(x, y) x+y, one would like to write for example 
> lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not 
> make a value result, but it can still give a function result - a function of 
> y. This would be indeed most useful for various apply functions, rather than 
> writing function(x) foo(3,x).

You can already do

lapply(1:5, foo, y = 3)

(assuming that the first argument to foo is named "y")

I'm stopping here since I don't have anything useful to say about your
subsequent points.

Best,
Ista

>
> I suggest that currying would make the code easier to write, and more 
> readable, especially when using apply functions. One might imagine that there 
> could be some confusion with such a feature, especially from people 
> unfamiliar with functional programming, although R already does take function 
> as first-order arguments, so it could be just fine. But one could address it 
> with special syntax, such as $foo(3) [$foo(x=3)] for partial application.  
> The current currying package has very limited usefulness, as, being limited 
> by the user-defined operator framework, it only rarely can contribute to less 
> code/more readability. Compare yourself:
> $foo(x=3) vs foo %<% 3
> goo = function(a,b,c)
> $goo(b=3) vs goo %><% list(b=3)
>
> Moreover, one would often like currying to have highest priority. For 
> example, when piping:
> data %>% foo %>% foo1 %<% 3
> if one wants to do data %>% foo %>% $foo(x=3)
>
> 3) Code executable only when running the script itself
> Whereas the first two suggestions are somewhat stealing from Haskell and the 
> like, this suggestion would be stealing from Python. I’m building quite a 
> complicated pipeline, using S4 classes. After defining the class and its 
> methods, I also define how to build the class to my likings, based on my 
> input data, using various now-defined methods. So I end up having a list of 
> command line arguments to process, and the way to create the class instance 
> based on them. If I write it to the class file, however, I 

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-05 Thread Gabor Grothendieck
Regarding the anonymous-function-in-a-pipeline point one can already
do this which does use brackets but even so it involves fewer
characters than the example shown.  Here { . * 2 } is basically a
lambda whose argument is dot. Would this be sufficient?

  library(magrittr)

  1.5 %>% { . * 2 }
  ## [1] 3

Regarding currying note that with magrittr Ista's code could be written as:

  1:5 %>% lapply(foo, y = 3)

or at the expense of slightly more verbosity:

  1:5 %>% Map(f = . %>% foo(y = 3))


On Fri, May 5, 2017 at 1:00 PM, Antonin Klima  wrote:
> Dear Sir or Madam,
>
> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
> computer science, and have been using R heavily during my PhD. As such, I 
> have put together a list of certain features in R that, in my opinion, would 
> be beneficial to add, or could be improved. The first two are already 
> implemented in packages, but given that it is implemented as user-defined 
> operators, it greatly restricts its usefulness. I hope you will find my 
> suggestions interesting. If you find time, I will welcome any feedback as to 
> whether you find the suggestions useful, or why you do not think they should 
> be implemented. I will also welcome if you enlighten me with any features I 
> might be unaware of, that might solve the issues I have pointed out below.
>
> 1) piping
> Currently available in package magrittr, piping makes the code better 
> readable by having the line start at its natural starting point, and 
> following with functions that are applied - in order. The readability of 
> several nested calls with a number of parameters each is almost zero, it’s 
> almost as if one would need to come up with the solution himself. Pipeline in 
> comparison is very straightforward, especially together with the point (2).
>
> The package here works rather good nevertheless, the shortcomings of piping 
> not being native are not quite as severe as in point (2). Nevertheless, an 
> intuitive symbol such as | would be helpful, and it sometimes bothers me that 
> I have to parenthesize anonymous function, which would probably not be 
> required in a native pipe-operator, much like it is not required in f.ex. 
> lapply. That is,
> 1:5 %>% function(x) x+2
> should be totally fine
>
> 2) currying
> Currently available in package Curry. The idea is that, having a function 
> such as foo = function(x, y) x+y, one would like to write for example 
> lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not 
> make a value result, but it can still give a function result - a function of 
> y. This would be indeed most useful for various apply functions, rather than 
> writing function(x) foo(3,x).
>
> I suggest that currying would make the code easier to write, and more 
> readable, especially when using apply functions. One might imagine that there 
> could be some confusion with such a feature, especially from people 
> unfamiliar with functional programming, although R already does take function 
> as first-order arguments, so it could be just fine. But one could address it 
> with special syntax, such as $foo(3) [$foo(x=3)] for partial application.  
> The current currying package has very limited usefulness, as, being limited 
> by the user-defined operator framework, it only rarely can contribute to less 
> code/more readability. Compare yourself:
> $foo(x=3) vs foo %<% 3
> goo = function(a,b,c)
> $goo(b=3) vs goo %><% list(b=3)
>
> Moreover, one would often like currying to have highest priority. For 
> example, when piping:
> data %>% foo %>% foo1 %<% 3
> if one wants to do data %>% foo %>% $foo(x=3)
>
> 3) Code executable only when running the script itself
> Whereas the first two suggestions are somewhat stealing from Haskell and the 
> like, this suggestion would be stealing from Python. I’m building quite a 
> complicated pipeline, using S4 classes. After defining the class and its 
> methods, I also define how to build the class to my likings, based on my 
> input data, using various now-defined methods. So I end up having a list of 
> command line arguments to process, and the way to create the class instance 
> based on them. If I write it to the class file, however, I end up running the 
> code when it is sourced from the next step in the pipeline, that needs the 
> previous class definitions.
>
> A feature such as pythonic “if __name__ == __main__” would thus be useful. As 
> it is, I had to create run scripts as separate files. Which is actually not 
> so terrible, given the class and its methods often span a few hundred lines, 
> but still.
>
> 4) non-exported global variables
> I also find it lacking, that I seem to be unable to create constants that 
> would not get passed to files that source the class definition. That is, if 
> class1 features global constant CONSTANT=3, then if class2 sources class1, it 
> will also include the constant. This 1) clutters the namespace when running 
> the code interactively,

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-08 Thread Antonin Klima
Thanks for the answers,

I’m aware of the ‘.’ option, just wanted to give a very simple example.

But the lapply ‘…' parameter use has eluded me and thanks for enlightening me. 

What do you mean by messing up the call stack. As far as I understand it, 
piping should translate into same code as deep nesting. So then I only see a 
tiny downside for debugging here. No loss of time/space efficiency or anything. 
With a change of inadvertent error in your example, coming from the fact that a 
variable is being reused and noone now checks for me whether it is being passed 
between the lines. And with having to specify the variable every single time. 
For me, that solution is clearly inferior.

Too bad you didn’t find my other comments interesting though.

>Why do you think being implemented in a contributed package restricts
>the usefulness of a feature?

I guess it depends on your philosophy. It may not restrict it per say, although 
it would make a lot of sense to me reusing the bash-style ‘|' and have a 
shorter, more readable version. One has extra dependence on a package for an 
item that fits the language so well that it should be its part.  It is without 
doubt my most used operator at least. Going to some of my folders I found 101 
uses in 750 lines, and 132 uses in 3303 lines. I would compare it to having a 
computer game being really good with a fan-created mod, but lacking otherwise. 
:) 

So to me, it makes sense that if there is no doubt that a feature improves the 
language, and especially if people extensively use it through a package 
already, it should be part of the “standard”. Question is whether it is indeed 
very popular, and whether you share my view. But that’s now up to you, I just 
wanted to point it out I guess.

Best Regards,
Antonin

> On 05 May 2017, at 22:33, Gabor Grothendieck  wrote:
> 
> Regarding the anonymous-function-in-a-pipeline point one can already
> do this which does use brackets but even so it involves fewer
> characters than the example shown.  Here { . * 2 } is basically a
> lambda whose argument is dot. Would this be sufficient?
> 
>  library(magrittr)
> 
>  1.5 %>% { . * 2 }
>  ## [1] 3
> 
> Regarding currying note that with magrittr Ista's code could be written as:
> 
>  1:5 %>% lapply(foo, y = 3)
> 
> or at the expense of slightly more verbosity:
> 
>  1:5 %>% Map(f = . %>% foo(y = 3))
> 
> 
> On Fri, May 5, 2017 at 1:00 PM, Antonin Klima  wrote:
>> Dear Sir or Madam,
>> 
>> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
>> computer science, and have been using R heavily during my PhD. As such, I 
>> have put together a list of certain features in R that, in my opinion, would 
>> be beneficial to add, or could be improved. The first two are already 
>> implemented in packages, but given that it is implemented as user-defined 
>> operators, it greatly restricts its usefulness. I hope you will find my 
>> suggestions interesting. If you find time, I will welcome any feedback as to 
>> whether you find the suggestions useful, or why you do not think they should 
>> be implemented. I will also welcome if you enlighten me with any features I 
>> might be unaware of, that might solve the issues I have pointed out below.
>> 
>> 1) piping
>> Currently available in package magrittr, piping makes the code better 
>> readable by having the line start at its natural starting point, and 
>> following with functions that are applied - in order. The readability of 
>> several nested calls with a number of parameters each is almost zero, it’s 
>> almost as if one would need to come up with the solution himself. Pipeline 
>> in comparison is very straightforward, especially together with the point 
>> (2).
>> 
>> The package here works rather good nevertheless, the shortcomings of piping 
>> not being native are not quite as severe as in point (2). Nevertheless, an 
>> intuitive symbol such as | would be helpful, and it sometimes bothers me 
>> that I have to parenthesize anonymous function, which would probably not be 
>> required in a native pipe-operator, much like it is not required in f.ex. 
>> lapply. That is,
>> 1:5 %>% function(x) x+2
>> should be totally fine
>> 
>> 2) currying
>> Currently available in package Curry. The idea is that, having a function 
>> such as foo = function(x, y) x+y, one would like to write for example 
>> lapply(foo(3), 1:5), and have the interpreter figure out ok, foo(3) does not 
>> make a value result, but it can still give a function result - a function of 
>> y. This would be indeed most useful for various apply functions, rather than 
>> writing function(x) foo(3,x).
>> 
>> I suggest that currying would make the code easier to write, and more 
>> readable, especially when using apply functions. One might imagine that 
>> there could be some confusion with such a feature, especially from people 
>> unfamiliar with functional programming, although R already does take 
>> function as first-order arguments, so it cou

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-08 Thread Ista Zahn
On Mon, May 8, 2017 at 8:08 AM, Antonin Klima  wrote:
> Thanks for the answers,
>
> I’m aware of the ‘.’ option, just wanted to give a very simple example.
>
> But the lapply ‘…' parameter use has eluded me and thanks for enlightening me.
>
> What do you mean by messing up the call stack. As far as I understand it, 
> piping should translate into same code as deep nesting.

Perhaps, but then magrittr is not really a pipe. Here is a simple example

library(magrittr)
data.frame(x = 1) %>%
subset(y == 1)
traceback()

> Error in eval(e, x, parent.frame()) : object 'y' not found
> 12: eval(e, x, parent.frame())
11: eval(e, x, parent.frame())
10: subset.data.frame(., y == 1)
9: subset(., y == 1)
8: function_list[[k]](value)
7: withVisible(function_list[[k]](value))
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: data.frame(x = 1) %>% subset(y == 1)
>

subset(data.frame(x = 1),
  y == 1)
traceback()

> Error in eval(e, x, parent.frame()) : object 'y' not found
> 4: eval(e, x, parent.frame())
3: eval(e, x, parent.frame())
2: subset.data.frame(data.frame(x = 1), y == 1)
1: subset(data.frame(x = 1), y == 1)
>

It does pollute the call stack, making debugging harder.

 So then I only see a tiny downside for debugging here. No loss of
time/space efficiency or anything. With a change of inadvertent error
in your example, coming from the fact that a variable is being reused
and noone now checks for me whether it is being passed between the
lines. And with having to specify the variable every single time. For
me, that solution is clearly inferior.

There are tradeoffs. As demonstrated above, the pipe is clearly
inferior in that it is doing a lot of complicated stuff under the
hood, and when you try to traceback() through the call stack you have
to sift through all that complicated stuff. That's a pretty big
drawback in my opinion.

>
> Too bad you didn’t find my other comments interesting though.

I did not say that.

>
>>Why do you think being implemented in a contributed package restricts
>>the usefulness of a feature?
>
> I guess it depends on your philosophy. It may not restrict it per say, 
> although it would make a lot of sense to me reusing the bash-style ‘|' and 
> have a shorter, more readable version. One has extra dependence on a package 
> for an item that fits the language so well that it should be its part.  It is 
> without doubt my most used operator at least. Going to some of my folders I 
> found 101 uses in 750 lines, and 132 uses in 3303 lines. I would compare it 
> to having a computer game being really good with a fan-created mod, but 
> lacking otherwise. :)

One of the key strengths of R is that packages are not akin to "fan
created mods". They are a central and necessary part of the R system.

>
> So to me, it makes sense that if there is no doubt that a feature improves 
> the language, and especially if people extensively use it through a package 
> already, it should be part of the “standard”. Question is whether it is 
> indeed very popular, and whether you share my view. But that’s now up to you, 
> I just wanted to point it out I guess.

>
> Best Regards,
> Antonin
>
>> On 05 May 2017, at 22:33, Gabor Grothendieck  wrote:
>>
>> Regarding the anonymous-function-in-a-pipeline point one can already
>> do this which does use brackets but even so it involves fewer
>> characters than the example shown.  Here { . * 2 } is basically a
>> lambda whose argument is dot. Would this be sufficient?
>>
>>  library(magrittr)
>>
>>  1.5 %>% { . * 2 }
>>  ## [1] 3
>>
>> Regarding currying note that with magrittr Ista's code could be written as:
>>
>>  1:5 %>% lapply(foo, y = 3)
>>
>> or at the expense of slightly more verbosity:
>>
>>  1:5 %>% Map(f = . %>% foo(y = 3))
>>
>>
>> On Fri, May 5, 2017 at 1:00 PM, Antonin Klima  wrote:
>>> Dear Sir or Madam,
>>>
>>> I am in 2nd year of my PhD in bioinformatics, after taking my Master’s in 
>>> computer science, and have been using R heavily during my PhD. As such, I 
>>> have put together a list of certain features in R that, in my opinion, 
>>> would be beneficial to add, or could be improved. The first two are already 
>>> implemented in packages, but given that it is implemented as user-defined 
>>> operators, it greatly restricts its usefulness. I hope you will find my 
>>> suggestions interesting. If you find time, I will welcome any feedback as 
>>> to whether you find the suggestions useful, or why you do not think they 
>>> should be implemented. I will also welcome if you enlighten me with any 
>>> features I might be unaware of, that might solve the issues I have pointed 
>>> out below.
>>>
>>> 1) piping
>>> Currently available in package magrittr, piping makes the code better 
>>> readable by having the line start at its natural starting point, and 
>>> following with functions that are applied - 

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-08 Thread Hadley Wickham
> There are tradeoffs. As demonstrated above, the pipe is clearly
> inferior in that it is doing a lot of complicated stuff under the
> hood, and when you try to traceback() through the call stack you have
> to sift through all that complicated stuff. That's a pretty big
> drawback in my opinion.

To be precise, that is a problem with the current implementation of
the pipe. It's not a limitation of the pipe per se.

Hadley

-- 
http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Hilmar Berger

Hi,

On 08/05/17 16:37, Ista Zahn wrote:

One of the key strengths of R is that packages are not akin to "fan
created mods". They are a central and necessary part of the R system.

I would tend to disagree here. R packages are in their majority not 
maintained by the core R developers. Concepts, features and lifetime 
depend mainly on the maintainers of the package (even though in theory 
GPL will allow to somebody to take over anytime). Several packages that 
are critical for processing big data and providing "modern" 
visualizations introduce concepts quite different from the legacy S/R 
language. I do feel that in a way, current core R shows strongly its 
origin in S, while modern concepts (e.g. data.table, dplyr, ggplot, ...) 
are often only available via extension packages. This is fine if one 
considers R to be a statistical toolkit; as a programming language, 
however, it introduces inconsistencies and uncertainties which could be 
avoided if some of the "modern" parts (including language concepts) 
could be more integrated in core-R.


Best regards,
Hilmar

--
Dr. Hilmar Berger, MD
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

Phone:  + 49 30 28460 430
Fax:+ 49 30 28460 401
 
E-Mail: ber...@mpiib-berlin.mpg.de

Web   : www.mpiib-berlin.mpg.de

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Joris Meys
On Tue, May 9, 2017 at 9:47 AM, Hilmar Berger 
wrote:

> Hi,
>
> On 08/05/17 16:37, Ista Zahn wrote:
>
>> One of the key strengths of R is that packages are not akin to "fan
>> created mods". They are a central and necessary part of the R system.
>>
>> I would tend to disagree here. R packages are in their majority not
> maintained by the core R developers. Concepts, features and lifetime depend
> mainly on the maintainers of the package (even though in theory GPL will
> allow to somebody to take over anytime). Several packages that are critical
> for processing big data and providing "modern" visualizations introduce
> concepts quite different from the legacy S/R language. I do feel that in a
> way, current core R shows strongly its origin in S, while modern concepts
> (e.g. data.table, dplyr, ggplot, ...) are often only available via
> extension packages. This is fine if one considers R to be a statistical
> toolkit; as a programming language, however, it introduces inconsistencies
> and uncertainties which could be avoided if some of the "modern" parts
> (including language concepts) could be more integrated in core-R.
>
> Best regards,
> Hilmar
>

And I would tend to disagree here. R is build upon the paradigm of a
functional programming language, and falls in the same group as clojure,
haskell and the likes. It is a turing complete programming language on its
own. That's quite a bit more than "a statistical toolkit". You can say that
about eg the macro language of SPSS, but not about R.

Second, there's little "modern" about the ideas behind the tidyverse.
Piping is about as old as unix itself. The grammar of graphics, on which
ggplot is based, stems from the SYStat graphics system from the nineties.
Hadley and colleagues did (and do) a great job implementing these ideas in
R, but the ideas do have a respectable age.

Third, there's a lot of nonstandard evaluation going on in all these
packages. Using them inside your own functions requires serious attention
(eg the difference between aes() and aes_() in ggplot2). Actually, even
though I definitely see the merits of these packages in data analysis, the
tidyverse feels like a (clean and powerful) macro language on top of R. And
that's good, but that doesn't mean these parts are essential to transform R
into a programming language. Rather the contrary actually: too heavily
relying on these packages does complicate things when you start to develop
your own packages in R.

Forth, the tidyverse masks quite some native R functions. Obviously they
took great care in keeping the functionality as close as one would expect,
but that's not always the case. The lag() function of dplyr() masks an S3
generic from the stats package for example. So if you work with time series
in the stats package, loading the tidyverse gives you trouble.

Fifth, many of the tidyverse packages are a version 0.x.y : they're still
in beta development and their functionality might (and will) change.
Functions disappear, arguments are called different, tags change,... Often
the changes improve the packages, but they did break older code for me more
than once. You can't expect the R core team to incorporate something that
is bound to change.

Last but not least, the tidyverse actually sometimes works against new R
users. At least R users that go beyond the classic data workflow. I
literally rewrote some code -from a consultant- that abused the _ply
functions to create nested loops. Removing all that stuff and rewriting the
code using a simple list in combination with a simple for-loop, sped up the
code with a factor 150. That has nothing to do with dplyr, it's very fast.
That has everything to do with that person having a hammer and thinking
everything he sees is a nail. The tidyverse is no reason to not learn the
concepts of the language it's built upon.

The one thing I would like to see though, is the adaptation of the
statistical toolkit so that it can work with data.table and tibble objects
directly, as opposed to having to convert to a data.frame once you start
building the models. And I believe that eventually there will be a
replacement for the data.frame that increases R's performance and lessens
its burden on the memory.

So all in all, I do admire the tidyverse and how it speeds up data
preparation for analysis. But tidyverse is a powerful data toolkit, not a
programming language. And it won't make R a programming language either.
Because R is already.

Cheers
Joris

>
> --
> Dr. Hilmar Berger, MD
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin
> GERMANY
>
> Phone:  + 49 30 28460 430
> Fax:+ 49 30 28460 401
>  E-Mail: ber...@mpiib-berlin.mpg.de
> Web   : www.mpiib-berlin.mpg.de
>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling,

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Lionel Henry
> Third, there's a lot of nonstandard evaluation going on in all these
> packages. Using them inside your own functions requires serious attention
> (eg the difference between aes() and aes_() in ggplot2). Actually, even
> though I definitely see the merits of these packages in data analysis, the
> tidyverse feels like a (clean and powerful) macro language on top of R.

That is going to change as we have put a lot of effort into learning
how to deal with capturing functions. See the tidyeval framework which
will enable full and flexible programmability of tidyverse grammars.

That said I agree that data analysis and package programming often
require different sets of tools.

Lionel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Hilmar Berger


On 09/05/17 11:22, Joris Meys wrote:
>
>
> On Tue, May 9, 2017 at 9:47 AM, Hilmar Berger 
> mailto:ber...@mpiib-berlin.mpg.de>> wrote:
>
> Hi,
>
> On 08/05/17 16:37, Ista Zahn wrote:
>
> One of the key strengths of R is that packages are not akin to
> "fan
> created mods". They are a central and necessary part of the R
> system.
>
> I would tend to disagree here. R packages are in their majority
> not maintained by the core R developers. Concepts, features and
> lifetime depend mainly on the maintainers of the package (even
> though in theory GPL will allow to somebody to take over anytime).
> Several packages that are critical for processing big data and
> providing "modern" visualizations introduce concepts quite
> different from the legacy S/R language. I do feel that in a way,
> current core R shows strongly its origin in S, while modern
> concepts (e.g. data.table, dplyr, ggplot, ...) are often only
> available via extension packages. This is fine if one considers R
> to be a statistical toolkit; as a programming language, however,
> it introduces inconsistencies and uncertainties which could be
> avoided if some of the "modern" parts (including language
> concepts) could be more integrated in core-R.
>
> Best regards,
> Hilmar
>
>
> And I would tend to disagree here. R is build upon the paradigm of a 
> functional programming language, and falls in the same group as 
> clojure, haskell and the likes. It is a turing complete programming 
> language on its own. That's quite a bit more than "a statistical 
> toolkit". You can say that about eg the macro language of SPSS, but 
> not about R.
>
My point was that inconsistencies are harder to tolerate when using R as 
a programming language as opposed to a toolkit that just has to do a job.
> Second, there's little "modern" about the ideas behind the tidyverse. 
> Piping is about as old as unix itself. The grammar of graphics, on 
> which ggplot is based, stems from the SYStat graphics system from the 
> nineties. Hadley and colleagues did (and do) a great job implementing 
> these ideas in R, but the ideas do have a respectable age.
Those ideas seem still to be more modern than e.g. stock R graphics 
designed probably in the seventies or eighties. Which still do their job 
for lots and lots of applications, however, the fact that many newer 
packages use ggplot in stead of plot() forces users to learn and use 
different paradigms for things so simple as drawing a line.

I also would like to make clear that I do not advocate for including the 
whole tidyverse in core R. I just believe that having core concepts well 
supported in core R instead of implemented in a package might make 
things more consistent. E.g. method chaining ("%>%") is a core language 
feature in many languages.
>
> The one thing I would like to see though, is the adaptation of the 
> statistical toolkit so that it can work with data.table and tibble 
> objects directly, as opposed to having to convert to a data.frame once 
> you start building the models. And I believe that eventually there 
> will be a replacement for the data.frame that increases R's 
> performance and lessens its burden on the memory.
>
Which is a perfect example of what I mean: improved functionality should 
find their way into core R at some time point, replacing or extending 
outdated functionality. Otherwise, I don't know how hard it will be to 
develop 21st century methods on top of a 1980s/90s language core. 
Although I admit that the R developers are doing a great job to make it 
possible.

Best,
Hilmar

> So all in all, I do admire the tidyverse and how it speeds up data 
> preparation for analysis. But tidyverse is a powerful data toolkit, 
> not a programming language. And it won't make R a programming language 
> either. Because R is already.
>
> Cheers
> Joris
>
>
> -- 
> Dr. Hilmar Berger, MD
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin
> GERMANY
>
> Phone: + 49 30 28460 430 
> Fax: + 49 30 28460 401 
>  E-Mail: ber...@mpiib-berlin.mpg.de
> 
> Web   : www.mpiib-berlin.mpg.de 
>
>
> __
> R-devel@r-project.org  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>
>
>
>
> -- 
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

-- 
Dr. Hilmar Berger, MD
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

Phone:  + 49 30 28460 43