hi, i realized that the segfault happens on the text file in a new R session. so, creating the segfault-generating text file requires a contributed package, but prompting the actual segfault does not -- pretty sure that means this is a base R bug? submitted here: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i am not doing something remarkably stupid. the text file itself is 4GB so cannot upload it to bugzilla, and from the R_AllocStringBugger error in the previous message, i think most or all of it needs to be there to trigger the segfault. thanks!
On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdam...@gmail.com> wrote: > hi, thanks Dr. Murdoch > > > i'd appreciate if anyone on r-help could help me narrow this down? i > believe the segfault occurs because there's a single line with 4GB and also > embedded nuls, but i am not sure how to artificially construct that? > > > the lodown package can be removed from my example.. it is just for file > download cacheing, so `lodown::cachaca` can be replaced with > `download.file` my current example requires a huge download, so sort of > painful to repeat but i'm pretty confident that's not the issue. > > > the archive::archive_extract() function unzips a (probably corrupt) .RAR > file and creates a text file with 80,937 lines. this file is 4GB: > > > file.size(infile) > [1] 4078192743 <(407)%20819-2743> > > > i am pretty sure that nearly all of that 4GB is contained on a single line > in the file. here's what happens when i create a file connection and scan > through.. > > > file_con <- file( infile , 'r' ) > > > > first_80936_lines <- readLines( file_con , n = 80936 ) > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "1000023930632009" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "36F2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA11" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00000" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA1111" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "467.20" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "346.10" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "414.40" > > scan( w , n = 1 , what = character() ) > Error in scan(w, n = 1, what = character()) : > could not allocate memory (2048 Mb) in C function > 'R_AllocStringBuffer' > > > > making a huge single-line file does not reproduce the problem, i think the > embedded nuls have something to do with it-- > > > # WARNING do not run with less than 64GB RAM > tf <- tempfile() > a <- rep( "a" , 1000000000 ) > b <- paste( a , collapse = '' ) > writeLines( b , tf ) ; rm( b ) ; gc() > d <- readLines( tf ) > > > > On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.dun...@gmail.com> > wrote: > >> On 15/07/2017 7:35 AM, Anthony Damico wrote: >> >>> hello, the last line of the code below causes a segfault for me on 3.4.1. >>> i think i should submit to https://bugs.r-project.org/ unless others >>> have >>> advice? thanks >>> >> >> Segfaults are usually worth reporting as bugs. Try to come up with a >> self-contained example, not using the lodown and archive packages. I >> imagine you can do this by uploading the file you downloaded, or enough of >> a subset of it to trigger the segfault. If you can't do that, then likely >> the bug is with one of those packages, not with R. >> >> Duncan Murdoch >> >> >>> >>> >>> >>> >>> install.packages( "devtools" ) >>> devtools::install_github("ajdamico/lodown") >>> devtools::install_github("jimhester/archive") >>> >>> >>> file_folder <- file.path( tempdir() , "file_folder" ) >>> >>> tf <- tempfile() >>> >>> # large download! cachaca saves on your local disk if already downloaded >>> lodown::cachaca( ' >>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , >>> mode >>> = 'wb' ) >>> >>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>> >>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>> full.names = >>> TRUE ) >>> >>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>> >>> # works >>> R.utils::countLines( infile ) >>> >>> # works with warning >>> my_file <- readLines( infile , skipNul = TRUE ) >>> >>> # crash >>> my_file <- readLines( infile ) >>> >>> >>> # run just before crash >>> sessionInfo() >>> # R version 3.4.1 (2017-06-30) >>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>> # Running under: Windows 10 x64 (build 15063) >>> >>> # Matrix products: default >>> >>> # locale: >>> # [1] LC_COLLATE=English_United States.1252 >>> # [2] LC_CTYPE=English_United States.1252 >>> # [3] LC_MONETARY=English_United States.1252 >>> # [4] LC_NUMERIC=C >>> # [5] LC_TIME=English_United States.1252 >>> >>> # attached base packages: >>> # [1] stats graphics grDevices utils datasets methods base >>> >>> # loaded via a namespace (and not attached): >>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>> withr_1.0.2 >>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>> memoise_1.1.0 >>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>> lodown_0.1.0 >>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>> R.oo_1.21.0 >>> # [17] archive_0.0.0.9000 >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.