Re: [R] Trying to understand the magic of lm (Still trying)

2019-05-13 Thread Sorkin, John
Terry ,


Thank you. Many years ago I took a course you taught in which you explained how 
to conduct survival analyses using S. The course was very useful, as was the 
email you sent me today. If you find a place where you can store your lecture 
notes, please send me the URL.


I believe that there is a great need for someone to explain not just how to 
write a package, but generally how to write a function that checks the 
parameters passed to it, that uses the parameters in a manner that allows the 
output of the function to produce output that informs the function users of the 
call to the function, etc. While these steps are needed when one writes a 
package, they should be taught as a matter of good coding practice when anyone 
writes a function that will be used more than once. Many years ago when I was a 
mainframe system programmer, it was de rigueur that one learned (and used) 
certain standards about saving registers at the beginning of a function and 
restoring them at the end of the function. The same should be true for all R 
functions; certain standardized, well described steps should be considered a 
part of writing any function. The problem, at least from my perspective, is 
that there is no commonly recognized document that explains the steps clearly.


Thank you,

John


John David Sorkin M.D., Ph.D.
Professor of Medicine
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology and Geriatric 
Medicine
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)




From: Therneau, Terry M., Ph.D. 
Sent: Monday, May 13, 2019 9:29 AM
To: r-help@r-project.org; Sorkin, John
Subject: Re: [R] Trying to understand the magic of lm (Still trying)

John,

 The text below is cut out of a "how to write a package" course I gave at the R 
conference in Vanderbilt.   I need to find a home for the course notes, because 
it had a lot of tidbits that are not well explained in the R documentation.
Terry T.



Model frames:
One of the first tasks of any modeling routine is to construct a special data 
frame containing the covariates that will be used, via a call to the 
model.frame function. The code to do this is found in many routines, and can be 
a little opaque on first view. The obvious code would be
\begin{verbatim}
coxph <- function(formula, data, weights, subset, na.action,
init, control, ties= c("efron", "breslow", "exact"),
singular.ok =TRUE, robust=FALSE,
model=FALSE, x=FALSE, y=TRUE,  tt, method=ties, ...) {

 mf <- model.frame(formula, data, subset, weights, na.action)
\end{verbatim}
since those are the coxph arguments that are passed forward to the model.frame 
routine.  However, this simple approach will fail with a ``not found'' error 
message if any of the data, subset, weights, etc. arguments are missing. 
Programs have to take the slightly more complicated approach of constructing a 
call.
\begin{verbatim}
Call <- match.call()
indx <- match(c("formula", "data", "weights", "subset", "na.action"),
  names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
temp <- Call[c(1,indx)]  # only keep the arguments we wanted
temp[[1]] <- as.name('model.frame')  # change the function called
mf <- eval(temp, parent.frame())

Y <- model.response(mf)
etc.
\end{verbatim}

We start with a copy of the call to the program, which we want to save anyway 
as documentation in the output object. Then subscripting is used to extract 
only the portions of the call that we want, saving the result in a temporary. 
This is based on the fact that a call object can be viewed as a list whose 
first element is the name of the function to call, followed by the arguments to 
the call. Note the use of \code{nomatch=0}; if any arguments on the list are 
missing they will then be missing in \code{temp}, without generating an error 
message. The \mycode{temp} variable will contain a object of type ``call'', 
which is an unevaluated call to a routine.  Finally, the name of the function 
to be called is changed from ``coxph'' to ``model.frame'' and the call is 
evaluated.  In many of the core routines the result is stored in a variable 
``m''.  This is a horribly short and non-descriptive name. (The above used mf 
which isn't a much better.)  Many routines also use ``m'' for the temporar
 y variable leading to \code{m <- eval(m, parent.frame())}, but I think that is 
unnecessarily confusing.

The list of names in the match call will include all arguments that should be 
evaluated within context of the named dataframe. This can include more than the 
list above, the survfit routine for instance has an optional argument ``id'' 
that name

Re: [R] Trying to understand the magic of lm (Still trying)

2019-05-13 Thread Therneau, Terry M., Ph.D. via R-help
John,

  The text below is cut out of a "how to write a package" course I gave at the 
R 
conference in Vanderbilt.   I need to find a home for the course notes, because 
it had a 
lot of tidbits that are not well explained in the R documentation.
Terry T.



Model frames:
One of the first tasks of any modeling routine is to construct a special data 
frame 
containing the covariates that will be used, via a call to the model.frame 
function. The 
code to do this is found in many routines, and can be a little opaque on first 
view. The 
obvious code would be
\begin{verbatim}
coxph <- function(formula, data, weights, subset, na.action,
     init, control, ties= c("efron", "breslow", "exact"),
     singular.ok =TRUE, robust=FALSE,
     model=FALSE, x=FALSE, y=TRUE,  tt, method=ties, ...) {

  mf <- model.frame(formula, data, subset, weights, na.action)
\end{verbatim}
since those are the coxph arguments that are passed forward to the model.frame 
routine.  
However, this simple approach will fail with a ``not found'' error message if 
any of the 
data, subset, weights, etc. arguments are missing. Programs have to take the 
slightly more 
complicated approach of constructing a call.
\begin{verbatim}
Call <- match.call()
indx <- match(c("formula", "data", "weights", "subset", "na.action"),
   names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
temp <- Call[c(1,indx)]  # only keep the arguments we wanted
temp[[1]] <- as.name('model.frame')  # change the function called
mf <- eval(temp, parent.frame())

Y <- model.response(mf)
etc.
\end{verbatim}

We start with a copy of the call to the program, which we want to save anyway 
as 
documentation in the output object. Then subscripting is used to extract only 
the portions 
of the call that we want, saving the result in a temporary. This is based on 
the fact that 
a call object can be viewed as a list whose first element is the name of the 
function to 
call, followed by the arguments to the call. Note the use of \code{nomatch=0}; 
if any 
arguments on the list are missing they will then be missing in \code{temp}, 
without 
generating an error message. The \mycode{temp} variable will contain a object 
of type 
``call'', which is an unevaluated call to a routine.  Finally, the name of the 
function to 
be called is changed from ``coxph'' to ``model.frame'' and the call is 
evaluated.  In many 
of the core routines the result is stored in a variable ``m''.  This is a 
horribly short 
and non-descriptive name. (The above used mf which isn't a much better.)  Many 
routines 
also use ``m'' for the temporary variable leading to \code{m <- eval(m, 
parent.frame())}, 
but I think that is unnecessarily confusing.

The list of names in the match call will include all arguments that should be 
evaluated 
within context of the named dataframe. This can include more than the list 
above, the 
survfit routine for instance has an optional argument ``id'' that names an 
identifying 
variable (several rows of the data may represent a single subject), and this is 
included 
along with ``formula'' etc in the list of choices in the match function.  The 
order of 
names in the list makes no difference.  The id is later retrieved with 
\code{model.extract(m, 'id')}, which will be NULL if the argument was not 
supplied. At the 
time that coxph was written I had not caught on to this fact and thought that 
all 
variables that came from a data frame had to be represented in the formula 
somehow, thus 
the use of \code{cluster(id)} as part of the formula, in order to denote a 
grouping variable.

On 5/11/19 5:00 AM, r-help-requ...@r-project.org wrote:
> A number of people have helped me in my mission to understand how lm (and 
> other fucntions) are able to pass a dataframe and then refer to a specific 
> column in the dataframe. I thank everyone who has responded. I now know a bit 
> about deparse(substitute(xx)), but I still don't fully understand how it 
> works. The program below attempts to print a column of a dataframe from a 
> function whose parameters include the dataframe (df) and the column requested 
> (col). The program works fine until the last print statement were I receive 
> an error,  Error in `[.data.frame`(df, , col) : object 'y' not found . I hope 
> someone can explain to me (1) why my code does not work, and (2) what I can 
> do to fix it.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Trying to understand the magic of lm (Still trying)

2019-05-10 Thread David Winsemius



On 5/10/19 12:53 PM, Sorkin, John wrote:

A number of people have helped me in my mission to understand how lm (and other 
fucntions) are able to pass a dataframe and then refer to a specific column in 
the dataframe. I thank everyone who has responded. I now know a bit about 
deparse(substitute(xx)), but I still don't fully understand how it works. The 
program below attempts to print a column of a dataframe from a function whose 
parameters include the dataframe (df) and the column requested (col). The 
program works fine until the last print statement were I receive an error,  
Error in `[.data.frame`(df, , col) : object 'y' not found . I hope someone can 
explain to me (1) why my code does not work, and (2) what I can do to fix it.


Many thanks to everyone who tries to help lost souls like me!


Thank you,

John


data <- data.frame(x=c(1,2,3,4,5),y=c(5,4,3,2,1))
data

doit <- function(df,col){
   dfx <- deparse(substitute(df))
   colx<- deparse(substitute(col))

   cat("results of deparse substitute")
   print(colx)
   print (dfx)

   cat("I can print the columns using column relative reference\n")
   print(df[,1])
   print(df[,2])

   cat("I can print the entire data frame \n")
   print(df)

   cat("I can print a single columng from the dataframe using a column name\n")
   #


#Try instead:

    print( df[ , colx]  # colx will be a character value, ... there is 
no `y`-object



#print(df[,col])
}

doit(data,y)









John David Sorkin M.D., Ph.D.
Professor of Medicine
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology and Geriatric 
Medicine
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trying to understand the magic of lm (Still trying)

2019-05-10 Thread Sorkin, John
A number of people have helped me in my mission to understand how lm (and other 
fucntions) are able to pass a dataframe and then refer to a specific column in 
the dataframe. I thank everyone who has responded. I now know a bit about 
deparse(substitute(xx)), but I still don't fully understand how it works. The 
program below attempts to print a column of a dataframe from a function whose 
parameters include the dataframe (df) and the column requested (col). The 
program works fine until the last print statement were I receive an error,  
Error in `[.data.frame`(df, , col) : object 'y' not found . I hope someone can 
explain to me (1) why my code does not work, and (2) what I can do to fix it.


Many thanks to everyone who tries to help lost souls like me!


Thank you,

John


data <- data.frame(x=c(1,2,3,4,5),y=c(5,4,3,2,1))
data

doit <- function(df,col){
  dfx <- deparse(substitute(df))
  colx<- deparse(substitute(col))

  cat("results of deparse substitute")
  print(colx)
  print (dfx)

  cat("I can print the columns using column relative reference\n")
  print(df[,1])
  print(df[,2])

  cat("I can print the entire data frame \n")
  print(df)

  cat("I can print a single columng from the dataframe using a column name\n")
  print(df[,col])
}

doit(data,y)









John David Sorkin M.D., Ph.D.
Professor of Medicine
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology and Geriatric 
Medicine
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.