Re: [Rd] Difficult debug - follow-up

2024-02-19 Thread Therneau, Terry M., Ph.D. via R-devel
I want to thank Ivan and Bill for useful advice.   I eventually found my memory 
mistake, 
which was of the 'obvious once you see it' variety.  Ivan's note that it 
appeared to be in 
a small allocation was correct.
>   I've hit a roadblock debugging a new update to the survival package. I do 
> debugging in
> a developement envinment, i.e. I don't create and load a package but rather  
> source all
> the .R files and dyn.load an .so file, which makes things a bit easier.  
> etc...
>
Terry T.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Difficult debug

2024-02-07 Thread Bill Dunlap
| >>> I often run R under valgrind with the command line:
| >>>   $ R --quiet --no-save --debugger=valgrind
| >>> --debugger-args="--track-origins=yes --vgdb=full --vgdb-error=0"
| >>> then in another window start a debugger process with
| >>>   $ gdb RHOME/bin/exec/R
| >>>   ...
| >>>(gdb) target remote | vgdb
| >>>(gdb) cont
| >>> You may need to supply a complete path to vgdb, valgrind's interface to
| >>> gdb.
| >>> Then gdb will set breakpoints where valgrind reports memory misuse and
| >>> you can inspect variables, etc., where an error occurs.
| >>> You can check for memory leaks from gdb with
| >>>(gdb) monitor leak_check full
| >>>
| >>> -Bill


You can use ^C to get the gdb prompt, to, say, set a breakpoint on
Rf_error or something in your code, and 'cont' or 'continue' to resume
running R.


You can get information about the status and history of a memory
location, say 0x12345678, with


(gdb) monitor vinfo location 0x12345678


More information is in https://valgrind.org/docs/manual/manual-core-adv.html.


On Wed, Feb 7, 2024 at 3:46 PM Bill Dunlap  wrote:

> I haven't done any R memory debugging lately, but
>
> https://www.mail-archive.com/rcpp-devel@lists.r-forge.r-project.org/msg10289.html
> shows how I used to have gdb break where valgrind finds a problem so you
> could examine the details.
>
> Also, running your code after running gctorture(TRUE) can help track down
> memory problems.
>
> -Bill
>
>
> On Wed, Feb 7, 2024 at 12:03 PM Therneau, Terry M., Ph.D. via R-devel <
> r-devel@r-project.org> wrote:
>
>>   I've hit a roadblock debugging a new update to the survival package.
>> I do debugging in
>> a developement envinment, i.e. I don't create and load a package but
>> rather  source all
>> the .R files and dyn.load an .so file, which makes things a bit easier.
>>
>>Running with R -d "valgrind --tool=memcheck --leak-check=full" one of
>> my test files
>> crashes in simple R code a dozen lines after the  likely culprit has been
>> called, i.e, the
>> survfit function for an Aalen-Johansen, containing a .Call to the new C
>> code. The
>> valgrind approach had already allowed me to find a few other (mostly
>> dumb) errors that led
>> to an out of bounds access, e.g., the wrong endpoint variable in a for( )
>> loop.What
>> would others advise as a next step?
>>
>> Here is the last part of the screen
>>  > fit2 <- coxph(list(Surv(tstart, tstop, bstat) ~ 1,
>> +c(1:4):5 ~ age / common + shared), id= id,
>> istate=bili4,
>> +   data=pbc2, ties='breslow', x=TRUE)
>>  > surv2 <- survfit(fit2, newdata=list(age=50), p0=c(.4, .3, .2, .1, 0))
>>  > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50)
>> ==31730== Invalid read of size 8
>> ==31730==at 0x298A07: Rf_allocVector3 (memory.c:2861)
>> ==31730==by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595)
>> ==31730==by 0x299B2C: R_alloc (memory.c:2330)
>> ==31730==by 0x3243C6: do_which (summary.c:1152)
>> ==31730==by 0x23D8EF: bcEval (eval.c:7724)
>> ==31730==by 0x25731F: Rf_eval (eval.c:1152)
>> ==31730==by 0x25927D: R_execClosure (eval.c:2362)
>> ==31730==by 0x25A35A: R_execMethod (eval.c:2535)
>> ==31730==by 0x887E93F: R_dispatchGeneric
>> (methods_list_dispatch.c:1151)
>> ==31730==by 0x2A0E72: do_standardGeneric (objects.c:1344)
>> ==31730==by 0x2577E7: Rf_eval (eval.c:1254)
>> ==31730==by 0x25927D: R_execClosure (eval.c:2362)
>> ==31730==by 0x25A01C: applyClosure_core (eval.c:2250)
>> ==31730==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
>> ==31730==
>>
>>   *** caught segfault ***
>> address 0x10, cause 'memory not mapped'
>>
>> Traceback:
>>   1: which(smap == j)
>>   2: which(smap == j)
>>   3: mysurv(fit2, pbc2$bili4, p0 = 4:0/10, fit2, x0 = 50)
>>
>> The offending call is amost certainly the one to survfit; mysurv() is a
>> local function
>> that caculates some things 'by hand'.   It does nothing complex: counts,
>> loops, etc, the
>> only non-base action is a call to Matrix::exp near the end, but the
>> which() failure is
>> well before that.
>>
>> The session info just before the offending material:
>>
>>  > sessionInfo()
>> R Under development (unstable) (2024-02-07 r85873)
>> Platform: x86_64-pc-linux-gnu
>> Running under: Ubuntu 22.04.3 LTS
>>
>> Matrix products: default
>> BLAS:   /usr/local/src/R-devel/lib/libRblas.so
>> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=C
>>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> time zone: America/Chicago
>> tzcode source: system (glibc)
>>
>> attached base packages:
>> [1] splines   stats graphics  grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:

Re: [Rd] Difficult debug

2024-02-07 Thread Bill Dunlap
I haven't done any R memory debugging lately, but

https://www.mail-archive.com/rcpp-devel@lists.r-forge.r-project.org/msg10289.html
shows how I used to have gdb break where valgrind finds a problem so you
could examine the details.

Also, running your code after running gctorture(TRUE) can help track down
memory problems.

-Bill


On Wed, Feb 7, 2024 at 12:03 PM Therneau, Terry M., Ph.D. via R-devel <
r-devel@r-project.org> wrote:

>   I've hit a roadblock debugging a new update to the survival package.   I
> do debugging in
> a developement envinment, i.e. I don't create and load a package but
> rather  source all
> the .R files and dyn.load an .so file, which makes things a bit easier.
>
>Running with R -d "valgrind --tool=memcheck --leak-check=full" one of
> my test files
> crashes in simple R code a dozen lines after the  likely culprit has been
> called, i.e, the
> survfit function for an Aalen-Johansen, containing a .Call to the new C
> code. The
> valgrind approach had already allowed me to find a few other (mostly dumb)
> errors that led
> to an out of bounds access, e.g., the wrong endpoint variable in a for( )
> loop.What
> would others advise as a next step?
>
> Here is the last part of the screen
>  > fit2 <- coxph(list(Surv(tstart, tstop, bstat) ~ 1,
> +c(1:4):5 ~ age / common + shared), id= id,
> istate=bili4,
> +   data=pbc2, ties='breslow', x=TRUE)
>  > surv2 <- survfit(fit2, newdata=list(age=50), p0=c(.4, .3, .2, .1, 0))
>  > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50)
> ==31730== Invalid read of size 8
> ==31730==at 0x298A07: Rf_allocVector3 (memory.c:2861)
> ==31730==by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595)
> ==31730==by 0x299B2C: R_alloc (memory.c:2330)
> ==31730==by 0x3243C6: do_which (summary.c:1152)
> ==31730==by 0x23D8EF: bcEval (eval.c:7724)
> ==31730==by 0x25731F: Rf_eval (eval.c:1152)
> ==31730==by 0x25927D: R_execClosure (eval.c:2362)
> ==31730==by 0x25A35A: R_execMethod (eval.c:2535)
> ==31730==by 0x887E93F: R_dispatchGeneric (methods_list_dispatch.c:1151)
> ==31730==by 0x2A0E72: do_standardGeneric (objects.c:1344)
> ==31730==by 0x2577E7: Rf_eval (eval.c:1254)
> ==31730==by 0x25927D: R_execClosure (eval.c:2362)
> ==31730==by 0x25A01C: applyClosure_core (eval.c:2250)
> ==31730==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
> ==31730==
>
>   *** caught segfault ***
> address 0x10, cause 'memory not mapped'
>
> Traceback:
>   1: which(smap == j)
>   2: which(smap == j)
>   3: mysurv(fit2, pbc2$bili4, p0 = 4:0/10, fit2, x0 = 50)
>
> The offending call is amost certainly the one to survfit; mysurv() is a
> local function
> that caculates some things 'by hand'.   It does nothing complex: counts,
> loops, etc, the
> only non-base action is a call to Matrix::exp near the end, but the
> which() failure is
> well before that.
>
> The session info just before the offending material:
>
>  > sessionInfo()
> R Under development (unstable) (2024-02-07 r85873)
> Platform: x86_64-pc-linux-gnu
> Running under: Ubuntu 22.04.3 LTS
>
> Matrix products: default
> BLAS:   /usr/local/src/R-devel/lib/libRblas.so
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=C
>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> time zone: America/Chicago
> tzcode source: system (glibc)
>
> attached base packages:
> [1] splines   stats graphics  grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] Matrix_1.6-0
>
> loaded via a namespace (and not attached):
> [1] compiler_4.4.0 tools_4.4.0grid_4.4.0 lattice_0.22-5
>
>
> ---
> Footnote.  The impetus for this is realizing that the robust variance for
> an
> Aalen-Johansen was incorrect when there are case weights for a subject
> that vary over
> time;  a rare case but will occur with time dependent IPC weights.
> Carefully figuring
> this out has been all I did for the last week, leading to a new routine
> survfitaj.c and
> approx 14 pages of derivation and explanation in the methods.Rnw
> vignette.   Subjects who
> "change horses in midstream", i.e., swap from one curve to another
> mid-followup make the
> code more complex.   This arises out of the "extended Kaplan-Meier"; I am
> not a fan of
> this statistically, but some will use it and expect my code to work.
>
> --
> Terry M Therneau, PhD
> Department of Quantitative Health Sciences
> Mayo Clinic
> thern...@mayo.edu
>
> "TERR-ree THUR-noh"
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version 

Re: [Rd] Difficult debug

2024-02-07 Thread Ivan Krylov via R-devel
On Wed, 07 Feb 2024 14:01:44 -0600
"Therneau, Terry M., Ph.D. via R-devel"  wrote:

>  > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50)  
> ==31730== Invalid read of size 8
> ==31730==    at 0x298A07: Rf_allocVector3 (memory.c:2861)
> ==31730==    by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595)
> ==31730==    by 0x299B2C: R_alloc (memory.c:2330)
> ==31730==    by 0x3243C6: do_which (summary.c:1152)
<...>
> ==31730==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
<...>
>   *** caught segfault ***
> address 0x10, cause 'memory not mapped'

An unrelated allocation function suddenly dereferencing a null pointer
is likely indication of heap corruption. Valgrind may be silent about
it because the C heap (that it knows how to override and track) is still
intact, but the R memory management metadata got corrupted (which looks
like a valid memory access to Valgrind).

An easy solution could be brought by more instrumentation.

R can tell Valgrind to consider some memory accesses invalid if you
configure it using --with-valgrind-instrumentation [*], but I'm not
sure it will be able to trap overwriting GC metadata, so let's set it
aside for now.

If you compile your own R, you can configure it with -fsanitize=address
added to the compiler and linker flags [**]. I'm not sure whether the
bounds checks performed by AddressSanitizer would be sufficient to
catch the problem, but it's worth a try. Instead of compiling R with
sanitizers, it should be also possible to use the container image
docker.io/rocker/r-devel-san.

The hard option is left if no instrumentation lets you pinpoint the
error. Since the first (as far as Valgrind is concerned) memory error
already happens to result in a SIGSEGV, you can run R in a regular
debugger and try to work backwards from the local variables at the
location of the crash. Maybe there's a way to identify the block
containing the pointer that gets overwritten and set a watchpoint on
it for the next run of R. Maybe you can read the overwritten value as
double and guess where the number came from. If your processor is
sufficiently new, you can try `rr`, the time-travelling debugger [***],
to rewind the process execution back to the point where the pointer gets
overwritten.

-- 
Best regards,
Ivan

[*]
https://cran.r-project.org/doc/manuals/R-exts.html#Using-valgrind

[**]
https://cran.r-project.org/doc/manuals/R-exts.html#Using-Address-Sanitizer

[***]
https://rr-project.org
Judging by the domain name, it's practically designed to fix troublesome
bugs in R packages!

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Difficult debug

2024-02-07 Thread Therneau, Terry M., Ph.D. via R-devel
  I've hit a roadblock debugging a new update to the survival package.   I do 
debugging in 
a developement envinment, i.e. I don't create and load a package but rather  
source all 
the .R files and dyn.load an .so file, which makes things a bit easier.

   Running with R -d "valgrind --tool=memcheck --leak-check=full" one of my 
test files 
crashes in simple R code a dozen lines after the  likely culprit has been 
called, i.e, the 
survfit function for an Aalen-Johansen, containing a .Call to the new C code.   
  The 
valgrind approach had already allowed me to find a few other (mostly dumb) 
errors that led 
to an out of bounds access, e.g., the wrong endpoint variable in a for( ) loop. 
   What 
would others advise as a next step?

Here is the last part of the screen
 > fit2 <- coxph(list(Surv(tstart, tstop, bstat) ~ 1,
+    c(1:4):5 ~ age / common + shared), id= id, istate=bili4,
+   data=pbc2, ties='breslow', x=TRUE)
 > surv2 <- survfit(fit2, newdata=list(age=50), p0=c(.4, .3, .2, .1, 0))
 > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50)
==31730== Invalid read of size 8
==31730==    at 0x298A07: Rf_allocVector3 (memory.c:2861)
==31730==    by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595)
==31730==    by 0x299B2C: R_alloc (memory.c:2330)
==31730==    by 0x3243C6: do_which (summary.c:1152)
==31730==    by 0x23D8EF: bcEval (eval.c:7724)
==31730==    by 0x25731F: Rf_eval (eval.c:1152)
==31730==    by 0x25927D: R_execClosure (eval.c:2362)
==31730==    by 0x25A35A: R_execMethod (eval.c:2535)
==31730==    by 0x887E93F: R_dispatchGeneric (methods_list_dispatch.c:1151)
==31730==    by 0x2A0E72: do_standardGeneric (objects.c:1344)
==31730==    by 0x2577E7: Rf_eval (eval.c:1254)
==31730==    by 0x25927D: R_execClosure (eval.c:2362)
==31730==    by 0x25A01C: applyClosure_core (eval.c:2250)
==31730==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
==31730==

  *** caught segfault ***
address 0x10, cause 'memory not mapped'

Traceback:
  1: which(smap == j)
  2: which(smap == j)
  3: mysurv(fit2, pbc2$bili4, p0 = 4:0/10, fit2, x0 = 50)

The offending call is amost certainly the one to survfit; mysurv() is a local 
function 
that caculates some things 'by hand'.   It does nothing complex: counts, loops, 
etc, the 
only non-base action is a call to Matrix::exp near the end, but the which() 
failure is 
well before that.

The session info just before the offending material:

 > sessionInfo()
R Under development (unstable) (2024-02-07 r85873)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/local/src/R-devel/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8    LC_COLLATE=C
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Chicago
tzcode source: system (glibc)

attached base packages:
[1] splines   stats graphics  grDevices utils datasets methods
[8] base

other attached packages:
[1] Matrix_1.6-0

loaded via a namespace (and not attached):
[1] compiler_4.4.0 tools_4.4.0    grid_4.4.0 lattice_0.22-5


---
Footnote.  The impetus for this is realizing that the robust variance for an 
Aalen-Johansen was incorrect when there are case weights for a subject that 
vary over 
time;  a rare case but will occur with time dependent IPC weights.  Carefully 
figuring 
this out has been all I did for the last week, leading to a new routine 
survfitaj.c and 
approx 14 pages of derivation and explanation in the methods.Rnw vignette.   
Subjects who 
"change horses in midstream", i.e., swap from one curve to another mid-followup 
make the 
code more complex.   This arises out of the "extended Kaplan-Meier"; I am not a 
fan of 
this statistically, but some will use it and expect my code to work.

-- 
Terry M Therneau, PhD
Department of Quantitative Health Sciences
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel