hi, thanks again for taking the time. since corrupted compression prompted the segfault for me in the first place, i've just posted the text file as-is. it's a 2.4GB file so to be avoided on a metered internet connection. i've updated the bugzilla report at https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more relevant info. these lines of code crash both windows R 3.4.1 and also linux R 3.3.3 for me. thanks again
# consider changing `tempfile()` to a permanent location # so you don't lose the large downloaded file after the crash tf <- tempfile() download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt" , tf , mode = 'wb' ) sessionInfo() x <- readLines( tf ) On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > I am stuck. The archive package won't compile for me on Ubuntu, and the > CRANextra repo seems to be down so I cannot install packages on Windows > right now. Perhaps you can zip the corrupt text file and put it online > somewhere? Don't use the archive package to pack it since there seem to be > issues with that tool on your machine. > > I would discourage you from harassing the Brazilian government about their > RAR file because the RAR file seems fine (no NUL characters appear in the > text file) when extracted using the file-roller archive tool on Ubuntu. > -- > Sent from my phone. Please excuse my brevity. > > On July 16, 2017 9:37:17 AM PDT, Anthony Damico <ajdam...@gmail.com> > wrote: > >hi, yep, there are two problems -- but i think only the segfault is > >within > >the scope of a base R issue? i need to look closer at the corrupted > >decompression and figure out whether i should talk to the brazilian > >government agency that creates that .rar file or open an issue with the > >archive package maintainer. my goal in this thread is only to figure > >out > >how to replicate the goofy text file so the r team can turn it into an > >error instead of a segfault. > > > >the original example i sent stores the .txt file somewhere inside the > >tempdir(), but when i copy it over elsewhere on my machine, the > >md5sum() > >gives the same result. thanks again for looking at this > > > > > tools::md5sum(infile) > > > >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_ > folder/Microdados > >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt > > "30beb57419486108e98d42ec7a2f8b19" > > > > > > > tools::md5sum( "S:/temp/crash.txt" ) > > S:/temp/crash.txt > > "30beb57419486108e98d42ec7a2f8b19" > > > > > > > > > >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller > ><jdnew...@dcn.davis.ca.us> > >wrote: > > > >> So you are saying there are two problems... one that produces a > >corrupt > >> file from a valid compressed file, and one that segfaults when > >presented > >> with that corrupt file? Can you please confirm the file name and run > >md5sum > >> on it and share the result so we can tell when the file problem has > >been > >> reproduced? > >> -- > >> Sent from my phone. Please excuse my brevity. > >> > >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdam...@gmail.com> > >> wrote: > >> >hi, thank you for attempting this. it looks like your unix machine > >> >unzipped > >> >the txt file without corruption -- if you copied over the same txt > >file > >> >to > >> >windows 7, i don't think that would reproduce the problem? i think > >it > >> >needs to be the corrupted text file where R.utils::countLines( > >> >txtfile > >> >) gives 809367. i am able to reproduce on two distinct windows > >> >machines > >> >but no guarantee i'm not doing something dumb > >> > > >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller > >> ><jdnew...@dcn.davis.ca.us> > >> >wrote: > >> > > >> >> I am not able to reproduce your segfault on a Windows 7 platform > >> >either: > >> >> > >> >> ########################## > >> >> fn1 <- "d:/DADOS_ENEM_2009.txt" > >> >> sessionInfo() > >> >> ## R version 3.4.1 (2017-06-30) > >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) > >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 > >> >> ## > >> >> ## Matrix products: default > >> >> ## > >> >> ## locale: > >> >> ## [1] LC_COLLATE=English_United States.1252 > >> >> ## [2] LC_CTYPE=English_United States.1252 > >> >> ## [3] LC_MONETARY=English_United States.1252 > >> >> ## [4] LC_NUMERIC=C > >> >> ## [5] LC_TIME=English_United States.1252 > >> >> ## > >> >> ## attached base packages: > >> >> ## [1] stats graphics grDevices utils datasets methods > >> >base > >> >> ## > >> >> ## loaded via a namespace (and not attached): > >> >> ## [1] compiler_3.4.1 > >> >> tools::md5sum( fn1 ) > >> >> ## d:/DADOS_ENEM_2009.txt > >> >> ## "83e61c96092285b60d7bf6b0dbc7072e" > >> >> dat <- readLines( fn1 ) > >> >> length( dat ) > >> >> ## [1] 4148721 > >> >> > >> >> > >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: > >> >> > >> >> I am not able to reproduce this on a Linux platform: > >> >>> > >> >>> #######################3 > >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem > >> >>> 2009/DADOS_ENEM_2009.txt" > >> >>> sessionInfo() > >> >>> ## R version 3.4.1 (2017-06-30) > >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit) > >> >>> ## Running under: Ubuntu 14.04.5 LTS > >> >>> ## > >> >>> ## Matrix products: default > >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 > >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 > >> >>> ## > >> >>> ## locale: > >> >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >> >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C > >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> >>> ## > >> >>> ## attached base packages: > >> >>> ## [1] stats graphics grDevices utils datasets methods > >> >base > >> >>> ## > >> >>> ## loaded via a namespace (and not attached): > >> >>> ## [1] compiler_3.4.1 > >> >>> tools::md5sum( fn1 ) > >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem > >> >>> 2009/DADOS_ENEM_2009.txt > >> >>> ## > >> >>> "83e61c96092285b60d7bf6b0dbc7072e" > >> >>> dat <- readLines( fn1 ) > >> >>> length( dat ) > >> >>> ## [1] 4148721 > >> >>> > >> >>> No segfault occurs. > >> >>> > >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote: > >> >>> > >> >>> hi, i realized that the segfault happens on the text file in a > >new R > >> >>>> session. so, creating the segfault-generating text file > >requires a > >> >>>> contributed package, but prompting the actual segfault does not > >-- > >> >pretty > >> >>>> sure that means this is a base R bug? submitted here: > >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 > >> >hopefully i > >> >>>> am > >> >>>> not doing something remarkably stupid. the text file itself is > >4GB > >> >so > >> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger > >> >error in > >> >>>> the > >> >>>> previous message, i think most or all of it needs to be there to > >> >trigger > >> >>>> the segfault. thanks! > >> >>>> > >> >>>> > >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico > >> ><ajdam...@gmail.com> > >> >>>> wrote: > >> >>>> > >> >>>> hi, thanks Dr. Murdoch > >> >>>>> > >> >>>>> > >> >>>>> i'd appreciate if anyone on r-help could help me narrow this > >down? > >> > i > >> >>>>> believe the segfault occurs because there's a single line with > >4GB > >> >and > >> >>>>> also > >> >>>>> embedded nuls, but i am not sure how to artificially construct > >> >that? > >> >>>>> > >> >>>>> > >> >>>>> the lodown package can be removed from my example.. it is just > >> >for file > >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with > >> >>>>> `download.file` my current example requires a huge download, > >so > >> >sort of > >> >>>>> painful to repeat but i'm pretty confident that's not the > >issue. > >> >>>>> > >> >>>>> > >> >>>>> the archive::archive_extract() function unzips a (probably > >> >corrupt) .RAR > >> >>>>> file and creates a text file with 80,937 lines. this file is > >4GB: > >> >>>>> > >> >>>>> > file.size(infile) > >> >>>>> [1] 4078192743 <(407)%20819-2743> > >> >>>>> > >> >>>>> > >> >>>>> i am pretty sure that nearly all of that 4GB is contained on a > >> >single > >> >>>>> line > >> >>>>> in the file. here's what happens when i create a file > >connection > >> >and > >> >>>>> scan > >> >>>>> through.. > >> >>>>> > >> >>>>> > file_con <- file( infile , 'r' ) > >> >>>>> > > >> >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "1000023930632009" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "36F2924009PAULO" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "AFONSO" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "BA11" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "00000" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "00" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "2924009PAULO" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "AFONSO" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "BA1111" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "467.20" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "346.10" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Read 1 item > >> >>>>> [1] "414.40" > >> >>>>> > scan( w , n = 1 , what = character() ) > >> >>>>> Error in scan(w, n = 1, what = character()) : > >> >>>>> could not allocate memory (2048 Mb) in C function > >> >>>>> 'R_AllocStringBuffer' > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> making a huge single-line file does not reproduce the problem, > >i > >> >think > >> >>>>> the > >> >>>>> embedded nuls have something to do with it-- > >> >>>>> > >> >>>>> > >> >>>>> # WARNING do not run with less than 64GB RAM > >> >>>>> tf <- tempfile() > >> >>>>> a <- rep( "a" , 1000000000 ) > >> >>>>> b <- paste( a , collapse = '' ) > >> >>>>> writeLines( b , tf ) ; rm( b ) ; gc() > >> >>>>> d <- readLines( tf ) > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < > >> >>>>> murdoch.dun...@gmail.com> > >> >>>>> wrote: > >> >>>>> > >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: > >> >>>>>> > >> >>>>>> hello, the last line of the code below causes a segfault for > >me > >> >on > >> >>>>>>> 3.4.1. > >> >>>>>>> i think i should submit to https://bugs.r-project.org/ > >unless > >> >others > >> >>>>>>> have > >> >>>>>>> advice? thanks > >> >>>>>>> > >> >>>>>>> > >> >>>>>> Segfaults are usually worth reporting as bugs. Try to come up > >> >with a > >> >>>>>> self-contained example, not using the lodown and archive > >> >packages. I > >> >>>>>> imagine you can do this by uploading the file you downloaded, > >or > >> >>>>>> enough of > >> >>>>>> a subset of it to trigger the segfault. If you can't do that, > >> >then > >> >>>>>> likely > >> >>>>>> the bug is with one of those packages, not with R. > >> >>>>>> > >> >>>>>> Duncan Murdoch > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> install.packages( "devtools" ) > >> >>>>>>> devtools::install_github("ajdamico/lodown") > >> >>>>>>> devtools::install_github("jimhester/archive") > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) > >> >>>>>>> > >> >>>>>>> tf <- tempfile() > >> >>>>>>> > >> >>>>>>> # large download! cachaca saves on your local disk if > >already > >> >>>>>>> downloaded > >> >>>>>>> lodown::cachaca( ' > >> >>>>>>> > >http://download.inep.gov.br/microdados/microdados_enem2009.rar' > >> >, tf > >> >>>>>>> , > >> >>>>>>> mode > >> >>>>>>> = 'wb' ) > >> >>>>>>> > >> >>>>>>> archive::archive_extract( tf , dir = normalizePath( > >file_folder > >> >) ) > >> >>>>>>> > >> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE > >, > >> >>>>>>> full.names = > >> >>>>>>> TRUE ) > >> >>>>>>> > >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = > >> >TRUE ) > >> >>>>>>> > >> >>>>>>> # works > >> >>>>>>> R.utils::countLines( infile ) > >> >>>>>>> > >> >>>>>>> # works with warning > >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) > >> >>>>>>> > >> >>>>>>> # crash > >> >>>>>>> my_file <- readLines( infile ) > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> # run just before crash > >> >>>>>>> sessionInfo() > >> >>>>>>> # R version 3.4.1 (2017-06-30) > >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) > >> >>>>>>> # Running under: Windows 10 x64 (build 15063) > >> >>>>>>> > >> >>>>>>> # Matrix products: default > >> >>>>>>> > >> >>>>>>> # locale: > >> >>>>>>> # [1] LC_COLLATE=English_United States.1252 > >> >>>>>>> # [2] LC_CTYPE=English_United States.1252 > >> >>>>>>> # [3] LC_MONETARY=English_United States.1252 > >> >>>>>>> # [4] LC_NUMERIC=C > >> >>>>>>> # [5] LC_TIME=English_United States.1252 > >> >>>>>>> > >> >>>>>>> # attached base packages: > >> >>>>>>> # [1] stats graphics grDevices utils datasets > >methods > >> > base > >> >>>>>>> > >> >>>>>>> # loaded via a namespace (and not attached): > >> >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 > >> >>>>>>> withr_1.0.2 > >> >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 > >> >>>>>>> memoise_1.1.0 > >> >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 > >> >>>>>>> lodown_0.1.0 > >> >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 > >> >>>>>>> R.oo_1.21.0 > >> >>>>>>> # [17] archive_0.0.0.9000 > >> >>>>>>> > >> >>>>>>> [[alternative HTML version deleted]] > >> >>>>>>> > >> >>>>>>> ______________________________________________ > >> >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >> >see > >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >> >>>>>>> PLEASE do read the posting guide > >http://www.R-project.org/posti > >> >>>>>>> ng-guide.html > >> >>>>>>> and provide commented, minimal, self-contained, reproducible > >> >code. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>> > >> >>>> [[alternative HTML version deleted]] > >> >>>> > >> >>>> ______________________________________________ > >> >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, > >see > >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >> >>>> PLEASE do read the posting guide http://www.R-project.org/posti > >> >>>> ng-guide.html > >> >>>> and provide commented, minimal, self-contained, reproducible > >code. > >> >>>> > >> >>>> > >> >>> ------------------------------------------------------------ > >> >>> --------------- > >> >>> Jeff Newmiller The ..... ..... > >Go > >> >>> Live... > >> >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. > >> >Live > >> >>> Go... > >> >>> Live: OO#.. Dead: OO#.. > >> >Playing > >> >>> Research Engineer (Solar/Batteries O.O#. #.O#. > >> >with > >> >>> /Software/Embedded Controllers) .OO#. .OO#. > >> >>> rocks...1k > >> >>> > >> >>> ______________________________________________ > >> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> >>> https://stat.ethz.ch/mailman/listinfo/r-help > >> >>> PLEASE do read the posting guide http://www.R-project.org/posti > >> >>> ng-guide.html > >> >>> and provide commented, minimal, self-contained, reproducible > >code. > >> >>> > >> >>> > >> >> ------------------------------------------------------------ > >> >> --------------- > >> >> Jeff Newmiller The ..... ..... > >Go > >> >Live... > >> >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. > >Live > >> >> Go... > >> >> Live: OO#.. Dead: OO#.. > >> >Playing > >> >> Research Engineer (Solar/Batteries O.O#. #.O#. > >with > >> >> /Software/Embedded Controllers) .OO#. .OO#. > >> >rocks...1k > >> >> ------------------------------------------------------------ > >> >> --------------- > >> >> > >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.