[R] debugging R code and dealing with dependencies

2014-12-25 Thread Mike Miller
I just wanted to put this out there.  It's just some of my observations 
about things that happen with R, or happened in this particular 
investigation.  There were definitely some lessons for me in this, and 
maybe that will be true of someone else.  The main thing I picked up is 
that it is good to put plenty of checks into our code -- if we expect 
input of a certain type or class, then I should either coerce input into 
that structure or test the input and throw an error.  If the function 
works very differently for different kinds of input, this should be 
documented.  The more people are doing this, the better things will go for 
everyone.



I was working with a CRAN package called RFGLS...

http://cran.r-project.org/web/packages/RFGLS/index.html

...and I was getting an error.  After a few rounds of testing I realized 
that the error was caused by a FAMID variable that was of character type.


The problem seemed to be that gls.batch() expected FAMID to be integers, 
but the default ought to be character type because family and individual 
IDs in nearly all genetic-analysis software are character strings (they 
might even be people's names).


This was the error:

Error in sum(blocksize) : invalid 'type' (character) of argument
Calls: gls.batch - bdsmatrix

To figure out more about it, I spent a bunch of time to go from CMD BATCH 
mode to an interactive session so that I could look at traceback().  That 
got me this additional info:



traceback()

2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id))

bdsmatrix() is from a package on which RFGLS depends:

http://cran.r-project.org/web/packages/bdsmatrix/index.html

The problem is that RFGLS's gls.batch() function is sending something to 
bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for 
bdsmatrix() and I see this:


 if (any(blocksize = 0))
 stop(Block sizes must be 0)
 if (any(as.integer(blocksize) != blocksize))
 stop(Block sizes must be integers)
 n1 - as.integer(sum(blocksize))

The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE) 
only if blocksize contains one or more noninteger numeric values.  It 
doesn't fail if blocksize is character or logical if the character strings 
are integers.  Example:



4==4

[1] TRUE

That's an interesting feature of R, but I guess that's how it works. 
Also this:



1==1

[1] TRUE

1==TRUE

[1] TRUE

1==TRUE

[1] FALSE

bdsmatrix() has no test that blocksize is numeric, so it fails when 
sum(blocksize) cannot sum character strings.


Next I had to figure out where RFGLS's gls.batch() is going wrong in 
producing sizelist.  It is created in a number of steps, but I identified 
this line as especially suspicious:


test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)

famsize was later converted to sizelist, and this line also includes 
FAMID, so this is likely where the problem originates.  Of course this is 
the big problem with debugging -- it's hard to find the source of an error 
that occurs far downstream in another function from a different package. 
I see that ave() is used, so I have to understand ave().


William Dunlap provided some guidance:

ave() uses its first argument, 'x', to set the length of its output and 
to make an initial guess at the type of its output.  The return value of 
FUN can alter the type, but only in an 'upward' direction where 
logicalintegernumericcomplexcharacterlist.  (This is the same rule 
that x[i]-newvalue uses.)


In other words, if x is of character type, the output cannot be of integer 
or numeric type even if the output of FUN is always of integer or numeric 
type.  Looking at the ave() code, I can understand that choice:


function (x, ..., FUN = mean)
{
if (missing(...))
x[] - FUN(x)
else {
g - interaction(...)
split(x, g) - lapply(split(x, g), FUN)
}
x
}

If the factor is missing an element, then the corresponding element of X 
is not changed in the output:



fact - gl(2,2)
fact[3] - NA
fact

[1] 11NA 2
Levels: 1 2

ave(1:4, fact)

[1] 1.5 1.5 3.0 4.0

That's a reasonable plan, but it isn't the documented functioning of 
ave().  From the document...


https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html

...you get next to nothing about what the function actually does.  It does 
say that x is a numeric, but the function does not throw an error when x 
is not numeric.  So if someone writes code expecting numeric x, but a user 
provides a non-numeric x, there may be trouble.


I suspect that the programmer saw that the code worked in her examples and 
she went on to other things.  I can't blame the documentation for that, 
but it is possible that if it said something about the relation between 
the type of the input and the type of the output she might have written it 
differently.  In addition, I probably would have caught it sooner and I 
would have 

Re: [R] debugging R code and dealing with dependencies

2014-12-25 Thread Uwe Ligges
This is a rather detailed analysis, thanks, but I think it should be 
send to the maintainer of the RFGLS package (CCing).


Best,
Uwe Ligges


On 25.12.2014 10:04, Mike Miller wrote:

I just wanted to put this out there.  It's just some of my observations
about things that happen with R, or happened in this particular
investigation.  There were definitely some lessons for me in this, and
maybe that will be true of someone else.  The main thing I picked up is
that it is good to put plenty of checks into our code -- if we expect
input of a certain type or class, then I should either coerce input into
that structure or test the input and throw an error.  If the function
works very differently for different kinds of input, this should be
documented.  The more people are doing this, the better things will go
for everyone.


I was working with a CRAN package called RFGLS...

http://cran.r-project.org/web/packages/RFGLS/index.html

...and I was getting an error.  After a few rounds of testing I realized
that the error was caused by a FAMID variable that was of character type.

The problem seemed to be that gls.batch() expected FAMID to be integers,
but the default ought to be character type because family and individual
IDs in nearly all genetic-analysis software are character strings (they
might even be people's names).

This was the error:

Error in sum(blocksize) : invalid 'type' (character) of argument
Calls: gls.batch - bdsmatrix

To figure out more about it, I spent a bunch of time to go from CMD
BATCH mode to an interactive session so that I could look at
traceback().  That got me this additional info:


traceback()

2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id))

bdsmatrix() is from a package on which RFGLS depends:

http://cran.r-project.org/web/packages/bdsmatrix/index.html

The problem is that RFGLS's gls.batch() function is sending something to
bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for
bdsmatrix() and I see this:

  if (any(blocksize = 0))
  stop(Block sizes must be 0)
  if (any(as.integer(blocksize) != blocksize))
  stop(Block sizes must be integers)
  n1 - as.integer(sum(blocksize))

The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
only if blocksize contains one or more noninteger numeric values.  It
doesn't fail if blocksize is character or logical if the character
strings are integers.  Example:


4==4

[1] TRUE

That's an interesting feature of R, but I guess that's how it works.
Also this:


1==1

[1] TRUE

1==TRUE

[1] TRUE

1==TRUE

[1] FALSE

bdsmatrix() has no test that blocksize is numeric, so it fails when
sum(blocksize) cannot sum character strings.

Next I had to figure out where RFGLS's gls.batch() is going wrong in
producing sizelist.  It is created in a number of steps, but I
identified this line as especially suspicious:

test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)


famsize was later converted to sizelist, and this line also includes
FAMID, so this is likely where the problem originates.  Of course this
is the big problem with debugging -- it's hard to find the source of an
error that occurs far downstream in another function from a different
package. I see that ave() is used, so I have to understand ave().

William Dunlap provided some guidance:

ave() uses its first argument, 'x', to set the length of its output and
to make an initial guess at the type of its output.  The return value of
FUN can alter the type, but only in an 'upward' direction where
logicalintegernumericcomplexcharacterlist.  (This is the same rule
that x[i]-newvalue uses.)

In other words, if x is of character type, the output cannot be of
integer or numeric type even if the output of FUN is always of integer
or numeric type.  Looking at the ave() code, I can understand that choice:

function (x, ..., FUN = mean)
{
 if (missing(...))
 x[] - FUN(x)
 else {
 g - interaction(...)
 split(x, g) - lapply(split(x, g), FUN)
 }
 x
}

If the factor is missing an element, then the corresponding element of X
is not changed in the output:


fact - gl(2,2)
fact[3] - NA
fact

[1] 11NA 2
Levels: 1 2

ave(1:4, fact)

[1] 1.5 1.5 3.0 4.0

That's a reasonable plan, but it isn't the documented functioning of
ave().  From the document...

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html

...you get next to nothing about what the function actually does.  It
does say that x is a numeric, but the function does not throw an error
when x is not numeric.  So if someone writes code expecting numeric x,
but a user provides a non-numeric x, there may be trouble.

I suspect that the programmer saw that the code worked in her examples
and she went on to other things.  I can't blame the documentation for
that, but it is possible that if it said something about the relation
between the type of the input 

Re: [R] debugging R code and dealing with dependencies

2014-12-25 Thread Mike Miller
Thanks, but I was already in touch with Rob Kirkpatrick about it.  We all 
work together at U Minnesota, or did until Rob went to VCU.


Mike


On Thu, 25 Dec 2014, Uwe Ligges wrote:

This is a rather detailed analysis, thanks, but I think it should be 
send to the maintainer of the RFGLS package (CCing).


Best,
Uwe Ligges


On 25.12.2014 10:04, Mike Miller wrote:

I just wanted to put this out there.  It's just some of my observations
about things that happen with R, or happened in this particular
investigation.  There were definitely some lessons for me in this, and
maybe that will be true of someone else.  The main thing I picked up is
that it is good to put plenty of checks into our code -- if we expect
input of a certain type or class, then I should either coerce input into
that structure or test the input and throw an error.  If the function
works very differently for different kinds of input, this should be
documented.  The more people are doing this, the better things will go
for everyone.


I was working with a CRAN package called RFGLS...

http://cran.r-project.org/web/packages/RFGLS/index.html

...and I was getting an error.  After a few rounds of testing I realized
that the error was caused by a FAMID variable that was of character type.

The problem seemed to be that gls.batch() expected FAMID to be integers,
but the default ought to be character type because family and individual
IDs in nearly all genetic-analysis software are character strings (they
might even be people's names).

This was the error:

Error in sum(blocksize) : invalid 'type' (character) of argument
Calls: gls.batch - bdsmatrix

To figure out more about it, I spent a bunch of time to go from CMD
BATCH mode to an interactive session so that I could look at
traceback().  That got me this additional info:


traceback()

2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id))

bdsmatrix() is from a package on which RFGLS depends:

http://cran.r-project.org/web/packages/bdsmatrix/index.html

The problem is that RFGLS's gls.batch() function is sending something to
bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for
bdsmatrix() and I see this:

  if (any(blocksize = 0))
  stop(Block sizes must be 0)
  if (any(as.integer(blocksize) != blocksize))
  stop(Block sizes must be integers)
  n1 - as.integer(sum(blocksize))

The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE)
only if blocksize contains one or more noninteger numeric values.  It
doesn't fail if blocksize is character or logical if the character
strings are integers.  Example:


4==4

[1] TRUE

That's an interesting feature of R, but I guess that's how it works.
Also this:


1==1

[1] TRUE

1==TRUE

[1] TRUE

1==TRUE

[1] FALSE

bdsmatrix() has no test that blocksize is numeric, so it fails when
sum(blocksize) cannot sum character strings.

Next I had to figure out where RFGLS's gls.batch() is going wrong in
producing sizelist.  It is created in a number of steps, but I
identified this line as especially suspicious:

test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)


famsize was later converted to sizelist, and this line also includes
FAMID, so this is likely where the problem originates.  Of course this
is the big problem with debugging -- it's hard to find the source of an
error that occurs far downstream in another function from a different
package. I see that ave() is used, so I have to understand ave().

William Dunlap provided some guidance:

ave() uses its first argument, 'x', to set the length of its output and
to make an initial guess at the type of its output.  The return value of
FUN can alter the type, but only in an 'upward' direction where
logicalintegernumericcomplexcharacterlist.  (This is the same rule
that x[i]-newvalue uses.)

In other words, if x is of character type, the output cannot be of
integer or numeric type even if the output of FUN is always of integer
or numeric type.  Looking at the ave() code, I can understand that choice:

function (x, ..., FUN = mean)
{
 if (missing(...))
 x[] - FUN(x)
 else {
 g - interaction(...)
 split(x, g) - lapply(split(x, g), FUN)
 }
 x
}

If the factor is missing an element, then the corresponding element of X
is not changed in the output:


fact - gl(2,2)
fact[3] - NA
fact

[1] 11NA 2
Levels: 1 2

ave(1:4, fact)

[1] 1.5 1.5 3.0 4.0

That's a reasonable plan, but it isn't the documented functioning of
ave().  From the document...

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ave.html

...you get next to nothing about what the function actually does.  It
does say that x is a numeric, but the function does not throw an error
when x is not numeric.  So if someone writes code expecting numeric x,
but a user provides a non-numeric x, there may be trouble.

I suspect that the programmer saw that the code worked 

Re: [R] debugging R code and dealing with dependencies

2014-12-25 Thread David Winsemius

 On Dec 25, 2014, at 1:04 AM, Mike Miller mbmille...@gmail.com wrote:
 
 I just wanted to put this out there.  It's just some of my observations about 
 things that happen with R, or happened in this particular investigation.  
 There were definitely some lessons for me in this, and maybe that will be 
 true of someone else.  The main thing I picked up is that it is good to put 
 plenty of checks into our code -- if we expect input of a certain type or 
 class, then I should either coerce input into that structure or test the 
 input and throw an error.  If the function works very differently for 
 different kinds of input, this should be documented.  The more people are 
 doing this, the better things will go for everyone.
 
 
 I was working with a CRAN package called RFGLS...
 
 http://cran.r-project.org/web/packages/RFGLS/index.html
 
 ...and I was getting an error.  After a few rounds of testing I realized that 
 the error was caused by a FAMID variable that was of character type.

But The Details section of the help page does say that the accepted FTYPES are 
all integers between 1 and 6 and the INDIV variables are integers in range 1:4.

 The problem seemed to be that gls.batch() expected FAMID to be integers, but 
 the default ought to be character type because family and individual IDs in 
 nearly all genetic-analysis software are character strings (they might even 
 be people's names).

You are making up rules that were not in accord with the documentation.

 This was the error:
 
 Error in sum(blocksize) : invalid 'type' (character) of argument
 Calls: gls.batch - bdsmatrix
 
 To figure out more about it, I spent a bunch of time to go from CMD BATCH 
 mode to an interactive session so that I could look at traceback().  

Generally the first thing to check is the help page. And if there is a worked 
example to look at its data:

 data(pedigree, package=RFGLS)
 str(pedigree)
'data.frame':   4050 obs. of  5 variables:
 $ FAMID: int  10 10 10 10 20 20 20 20 30 30 ...
 $ ID   : int  11 12 13 14 21 22 23 24 31 32 ...
 $ PID  : int  14 14 0 0 24 24 0 0 34 34 ...
 $ MID  : int  13 13 0 0 23 23 0 0 33 33 ...
 $ SEX  : num  1 1 2 1 2 2 2 1 2 2 …

— 
David.

 That got me this additional info:
 
 traceback()
 2: bdsmatrix(sizelist, lme.out$sigma@blocks, dimnames = list(id, id))
 
 bdsmatrix() is from a package on which RFGLS depends:
 
 http://cran.r-project.org/web/packages/bdsmatrix/index.html
 
 The problem is that RFGLS's gls.batch() function is sending something to 
 bdsmatrix's bdsmatrix() that it can't handle.  So I look at the code for 
 bdsmatrix() and I see this:
 
 if (any(blocksize = 0))
 stop(Block sizes must be 0)
 if (any(as.integer(blocksize) != blocksize))
 stop(Block sizes must be integers)
 n1 - as.integer(sum(blocksize))
 
 The condition any(as.integer(blocksize) != blocksize)) fails (is TRUE) only 
 if blocksize contains one or more noninteger numeric values.  It doesn't fail 
 if blocksize is character or logical if the character strings are integers.  
 Example:
 
 4==4
 [1] TRUE
 
 That's an interesting feature of R, but I guess that's how it works. Also 
 this:
 
 1==1
 [1] TRUE
 1==TRUE
 [1] TRUE
 1==TRUE
 [1] FALSE
 
 bdsmatrix() has no test that blocksize is numeric, so it fails when 
 sum(blocksize) cannot sum character strings.
 
 Next I had to figure out where RFGLS's gls.batch() is going wrong in 
 producing sizelist.  It is created in a number of steps, but I identified 
 this line as especially suspicious:
 
 test.dat$famsize[test.dat$FTYPE!=6]=ave(test.dat$FAMID[test.dat$FTYPE!=6],test.dat$FAMID[test.dat$FTYPE!=6],FUN=length)
 
 famsize was later converted to sizelist, and this line also includes FAMID, 
 so this is likely where the problem originates.  Of course this is the big 
 problem with debugging -- it's hard to find the source of an error that 
 occurs far downstream in another function from a different package. I see 
 that ave() is used, so I have to understand ave().
 
 William Dunlap provided some guidance:
 
 ave() uses its first argument, 'x', to set the length of its output and to 
 make an initial guess at the type of its output.  The return value of FUN can 
 alter the type, but only in an 'upward' direction where 
 logicalintegernumericcomplexcharacterlist.  (This is the same rule that 
 x[i]-newvalue uses.)
 
 In other words, if x is of character type, the output cannot be of integer or 
 numeric type even if the output of FUN is always of integer or numeric type.  
 Looking at the ave() code, I can understand that choice:
 
 function (x, ..., FUN = mean)
 {
if (missing(...))
x[] - FUN(x)
else {
g - interaction(...)
split(x, g) - lapply(split(x, g), FUN)
}
x
 }
 
 If the factor is missing an element, then the corresponding element of X is 
 not changed in the output:
 
 fact - gl(2,2)
 fact[3] - NA
 fact
 [1] 11NA 2
 Levels: 1 2
 ave(1:4, fact)
 [1] 1.5 1.5 3.0 4.0
 
 That's a 

Re: [R] debugging R code and dealing with dependencies

2014-12-25 Thread Mike Miller

On Thu, 25 Dec 2014, David Winsemius wrote:


On Dec 25, 2014, at 1:04 AM, Mike Miller mbmille...@gmail.com wrote:

I just wanted to put this out there.  It's just some of my observations 
about things that happen with R, or happened in this particular 
investigation.  There were definitely some lessons for me in this, and 
maybe that will be true of someone else.  The main thing I picked up is 
that it is good to put plenty of checks into our code -- if we expect 
input of a certain type or class, then I should either coerce input 
into that structure or test the input and throw an error.  If the 
function works very differently for different kinds of input, this 
should be documented.  The more people are doing this, the better 
things will go for everyone.



I was working with a CRAN package called RFGLS...

http://cran.r-project.org/web/packages/RFGLS/index.html

...and I was getting an error.  After a few rounds of testing I 
realized that the error was caused by a FAMID variable that was of 
character type.


But The Details section of the help page does say that the accepted 
FTYPES are all integers between 1 and 6 and the INDIV variables are 
integers in range 1:4.


But FAMID and FTYPE are different variables, both required.


The problem seemed to be that gls.batch() expected FAMID to be 
integers, but the default ought to be character type because family and 
individual IDs in nearly all genetic-analysis software are character 
strings (they might even be people's names).


You are making up rules that were not in accord with the documentation.


I think you are confusing FTYPE with FAMID.



This was the error:

Error in sum(blocksize) : invalid 'type' (character) of argument
Calls: gls.batch - bdsmatrix

To figure out more about it, I spent a bunch of time to go from CMD 
BATCH mode to an interactive session so that I could look at 
traceback().


Generally the first thing to check is the help page. And if there is a 
worked example to look at its data:



data(pedigree, package=RFGLS)
str(pedigree)

'data.frame':   4050 obs. of  5 variables:
$ FAMID: int  10 10 10 10 20 20 20 20 30 30 ...
$ ID   : int  11 12 13 14 21 22 23 24 31 32 ...
$ PID  : int  14 14 0 0 24 24 0 0 34 34 ...
$ MID  : int  13 13 0 0 23 23 0 0 33 33 ...
$ SEX  : num  1 1 2 1 2 2 2 1 2 2 …


Thanks for your efforts, but you are mistaken.  Before I wrote anything 
here I had already worked through this with Rob Kirkpatrick, we had run 
the data() examples, confirmed the error there, and more.


I was a coauthor of the Human Heredity paper that introduced this software 
and it was based on other work I had done.  I'm pretty sure I'm the #1 
user of this package.


FTYPE != FAMID

Everything I said was correct.

Mike
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.