awesome, thank you! looks like folks on bugzilla have also reproduced and submitted a patch, so i am happy. thanks all
On Mon, Jul 17, 2017 at 11:36 AM, William Dunlap <wdun...@tibco.com> wrote: > The original file had a lot of trailing null bytes so I tried making a > similar file with: > > tf <- tempfile(); file <- file(tf, "wb") > for(i in 1:(2^15-1))writeBin(rep(as.raw(32:127), len=2^16), file) > for(i in 1:(2^15-1))writeBin(rep(as.raw(0L), len=2^16), file) > close(file) > log2(file.size(tf)) > #[1] 31.99996 > > Reading this with readLines() caused R-3.4.0 to segfault in > Rf_con_pushback with the same gdb traceback I saw when reading the original > file. > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Sat, Jul 15, 2017 at 4:28 PM, William Dunlap <wdun...@tibco.com> wrote: > >> I see the problem on Windows 10, R-3.4.0, R.exe. It is not compiled for >> debugging but gdb gives some information when I attach the debugger after >> the 'R..has stopped working' popup appears. I don't know how reliable it >> is: >> >> (gdb) info threads >> Id Target Id Frame >> * 4 Thread 11848.0x1500 0x00007ffe38dc8861 in ntdll!DbgBreakPoint () >> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll >> 3 Thread 11848.0x2e90 0x00007ffe38dc87e4 in >> ntdll!ZwWaitForWorkViaWorkerFactory () >> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll >> 2 Thread 11848.0x3618 0x00007ffe38dc5154 in >> ntdll!ZwWaitForSingleObject () >> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll >> 1 Thread 11848.0x1808 0x000000006c77de3b in Rf_con_pushback () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> (gdb) thread 1 >> [Switching to thread 1 (Thread 11848.0x1808)] >> #0 0x000000006c77de3b in Rf_con_pushback () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> (gdb) where >> #0 0x000000006c77de3b in Rf_con_pushback () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #1 0x000000006c7d8919 in R_initAssignSymbols () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #2 0x000000006c7ef961 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/ >> R.dll >> #3 0x000000006c7f1b70 in R_cmpfun1 () from /cygdrive/c/R/R-3.4.0/bin/x64/ >> R.dll >> #4 0x000000006c7f1ef2 in Rf_applyClosure () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #5 0x000000006c7efaf7 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/ >> R.dll >> #6 0x000000006c7f3816 in R_execMethod () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #7 0x000000006c7efcdf in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/ >> R.dll >> #8 0x000000006c81053c in Rf_ReplIteration () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #9 0x000000006c810902 in Rf_ReplIteration () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #10 0x000000006c810992 in run_Rmainloop () from >> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll >> #11 0x000000000040171c in ?? () >> #12 0x000000000040155a in ?? () >> #13 0x00000000004013e8 in ?? () >> #14 0x000000000040151b in ?? () >> #15 0x00007ffe37868102 in KERNEL32!BaseThreadInitThunk () from >> /cygdrive/c/WINDOWS/system32/KERNEL32.DLL >> #16 0x00007ffe38d7c5b4 in ntdll!RtlUserThreadStart () from >> /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll >> #17 0x0000000000000000 in ?? () >> Backtrace stopped: previous frame inner to this frame (corrupt stack?) >> (gdb) >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us >> > wrote: >> >>> I am not able to reproduce your segfault on a Windows 7 platform either: >>> >>> ########################## >>> fn1 <- "d:/DADOS_ENEM_2009.txt" >>> sessionInfo() >>> ## R version 3.4.1 (2017-06-30) >>> ## Platform: x86_64-w64-mingw32/x64 (64-bit) >>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 >>> ## >>> ## Matrix products: default >>> ## >>> ## locale: >>> ## [1] LC_COLLATE=English_United States.1252 >>> ## [2] LC_CTYPE=English_United States.1252 >>> ## [3] LC_MONETARY=English_United States.1252 >>> ## [4] LC_NUMERIC=C >>> ## [5] LC_TIME=English_United States.1252 >>> ## >>> ## attached base packages: >>> ## [1] stats graphics grDevices utils datasets methods base >>> ## >>> ## loaded via a namespace (and not attached): >>> ## [1] compiler_3.4.1 >>> tools::md5sum( fn1 ) >>> ## d:/DADOS_ENEM_2009.txt >>> ## "83e61c96092285b60d7bf6b0dbc7072e" >>> dat <- readLines( fn1 ) >>> length( dat ) >>> ## [1] 4148721 >>> >>> >>> On Sat, 15 Jul 2017, Jeff Newmiller wrote: >>> >>> I am not able to reproduce this on a Linux platform: >>>> >>>> #######################3 >>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>>> 2009/DADOS_ENEM_2009.txt" >>>> sessionInfo() >>>> ## R version 3.4.1 (2017-06-30) >>>> ## Platform: x86_64-pc-linux-gnu (64-bit) >>>> ## Running under: Ubuntu 14.04.5 LTS >>>> ## >>>> ## Matrix products: default >>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >>>> ## >>>> ## locale: >>>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> ## >>>> ## attached base packages: >>>> ## [1] stats graphics grDevices utils datasets methods base >>>> ## >>>> ## loaded via a namespace (and not attached): >>>> ## [1] compiler_3.4.1 >>>> tools::md5sum( fn1 ) >>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>>> 2009/DADOS_ENEM_2009.txt >>>> ## >>>> "83e61c96092285b60d7bf6b0dbc7072e" >>>> dat <- readLines( fn1 ) >>>> length( dat ) >>>> ## [1] 4148721 >>>> >>>> No segfault occurs. >>>> >>>> On Sat, 15 Jul 2017, Anthony Damico wrote: >>>> >>>> hi, i realized that the segfault happens on the text file in a new R >>>>> session. so, creating the segfault-generating text file requires a >>>>> contributed package, but prompting the actual segfault does not -- >>>>> pretty >>>>> sure that means this is a base R bug? submitted here: >>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully >>>>> i am >>>>> not doing something remarkably stupid. the text file itself is 4GB so >>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error >>>>> in the >>>>> previous message, i think most or all of it needs to be there to >>>>> trigger >>>>> the segfault. thanks! >>>>> >>>>> >>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdam...@gmail.com> >>>>> wrote: >>>>> >>>>> hi, thanks Dr. Murdoch >>>>>> >>>>>> >>>>>> i'd appreciate if anyone on r-help could help me narrow this down? i >>>>>> believe the segfault occurs because there's a single line with 4GB >>>>>> and also >>>>>> embedded nuls, but i am not sure how to artificially construct that? >>>>>> >>>>>> >>>>>> the lodown package can be removed from my example.. it is just for >>>>>> file >>>>>> download cacheing, so `lodown::cachaca` can be replaced with >>>>>> `download.file` my current example requires a huge download, so sort >>>>>> of >>>>>> painful to repeat but i'm pretty confident that's not the issue. >>>>>> >>>>>> >>>>>> the archive::archive_extract() function unzips a (probably corrupt) >>>>>> .RAR >>>>>> file and creates a text file with 80,937 lines. this file is 4GB: >>>>>> >>>>>> > file.size(infile) >>>>>> [1] 4078192743 <(407)%20819-2743> >>>>>> >>>>>> >>>>>> i am pretty sure that nearly all of that 4GB is contained on a single >>>>>> line >>>>>> in the file. here's what happens when i create a file connection and >>>>>> scan >>>>>> through.. >>>>>> >>>>>> > file_con <- file( infile , 'r' ) >>>>>> > >>>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "1000023930632009" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "36F2924009PAULO" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "AFONSO" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "BA11" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "00000" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "00" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "2924009PAULO" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "AFONSO" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "BA1111" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "467.20" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "346.10" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Read 1 item >>>>>> [1] "414.40" >>>>>> > scan( w , n = 1 , what = character() ) >>>>>> Error in scan(w, n = 1, what = character()) : >>>>>> could not allocate memory (2048 Mb) in C function >>>>>> 'R_AllocStringBuffer' >>>>>> >>>>>> >>>>>> >>>>>> making a huge single-line file does not reproduce the problem, i >>>>>> think the >>>>>> embedded nuls have something to do with it-- >>>>>> >>>>>> >>>>>> # WARNING do not run with less than 64GB RAM >>>>>> tf <- tempfile() >>>>>> a <- rep( "a" , 1000000000 ) >>>>>> b <- paste( a , collapse = '' ) >>>>>> writeLines( b , tf ) ; rm( b ) ; gc() >>>>>> d <- readLines( tf ) >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >>>>>> murdoch.dun...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>>>>>> >>>>>>> hello, the last line of the code below causes a segfault for me on >>>>>>>> 3.4.1. >>>>>>>> i think i should submit to https://bugs.r-project.org/ unless >>>>>>>> others >>>>>>>> have >>>>>>>> advice? thanks >>>>>>>> >>>>>>>> >>>>>>> Segfaults are usually worth reporting as bugs. Try to come up with a >>>>>>> self-contained example, not using the lodown and archive packages. I >>>>>>> imagine you can do this by uploading the file you downloaded, or >>>>>>> enough of >>>>>>> a subset of it to trigger the segfault. If you can't do that, then >>>>>>> likely >>>>>>> the bug is with one of those packages, not with R. >>>>>>> >>>>>>> Duncan Murdoch >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> install.packages( "devtools" ) >>>>>>>> devtools::install_github("ajdamico/lodown") >>>>>>>> devtools::install_github("jimhester/archive") >>>>>>>> >>>>>>>> >>>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>>>>>> >>>>>>>> tf <- tempfile() >>>>>>>> >>>>>>>> # large download! cachaca saves on your local disk if already >>>>>>>> downloaded >>>>>>>> lodown::cachaca( ' >>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , >>>>>>>> tf , >>>>>>>> mode >>>>>>>> = 'wb' ) >>>>>>>> >>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>>>>>>> >>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>>>>>> full.names = >>>>>>>> TRUE ) >>>>>>>> >>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>>>>>>> >>>>>>>> # works >>>>>>>> R.utils::countLines( infile ) >>>>>>>> >>>>>>>> # works with warning >>>>>>>> my_file <- readLines( infile , skipNul = TRUE ) >>>>>>>> >>>>>>>> # crash >>>>>>>> my_file <- readLines( infile ) >>>>>>>> >>>>>>>> >>>>>>>> # run just before crash >>>>>>>> sessionInfo() >>>>>>>> # R version 3.4.1 (2017-06-30) >>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>>> # Running under: Windows 10 x64 (build 15063) >>>>>>>> >>>>>>>> # Matrix products: default >>>>>>>> >>>>>>>> # locale: >>>>>>>> # [1] LC_COLLATE=English_United States.1252 >>>>>>>> # [2] LC_CTYPE=English_United States.1252 >>>>>>>> # [3] LC_MONETARY=English_United States.1252 >>>>>>>> # [4] LC_NUMERIC=C >>>>>>>> # [5] LC_TIME=English_United States.1252 >>>>>>>> >>>>>>>> # attached base packages: >>>>>>>> # [1] stats graphics grDevices utils datasets methods >>>>>>>> base >>>>>>>> >>>>>>>> # loaded via a namespace (and not attached): >>>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>>>>>> withr_1.0.2 >>>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>>>>>> memoise_1.1.0 >>>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>>>>>> lodown_0.1.0 >>>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>>>>>> R.oo_1.21.0 >>>>>>>> # [17] archive_0.0.0.9000 >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> ______________________________________________ >>>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>>>> ng-guide.html >>>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>> ng-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>> ------------------------------------------------------------ >>>> --------------- >>>> Jeff Newmiller The ..... ..... Go >>>> Live... >>>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>> Go... >>>> Live: OO#.. Dead: OO#.. Playing >>>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>> rocks...1k >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> ------------------------------------------------------------ >>> --------------- >>> Jeff Newmiller The ..... ..... Go >>> Live... >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>> Go... >>> Live: OO#.. Dead: OO#.. Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>> rocks...1k >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.