I'll pass. Just because some non-CRAN "archive" package has bugs or your disk storage is flaky does not mean that any of dozens or hundreds of other compression tools (e.g. the built-in Windows "Send to compressed folder" pop-up menu) won't get it right, and we would know if it did fail because of the md5sum. -- Sent from my phone. Please excuse my brevity.
On July 17, 2017 5:00:48 AM PDT, Anthony Damico <ajdam...@gmail.com> wrote: >hi, thanks again for taking the time. since corrupted compression >prompted >the segfault for me in the first place, i've just posted the text file >as-is. it's a 2.4GB file so to be avoided on a metered internet >connection. i've updated the bugzilla report at >https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more >relevant info. these lines of code crash both windows R 3.4.1 and also >linux R 3.3.3 for me. thanks again > > > # consider changing `tempfile()` to a permanent location > # so you don't lose the large downloaded file after the crash > tf <- tempfile() > download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt" >, tf , mode = 'wb' ) > sessionInfo() > x <- readLines( tf ) > > > > >On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller ><jdnew...@dcn.davis.ca.us> >wrote: > >> I am stuck. The archive package won't compile for me on Ubuntu, and >the >> CRANextra repo seems to be down so I cannot install packages on >Windows >> right now. Perhaps you can zip the corrupt text file and put it >online >> somewhere? Don't use the archive package to pack it since there seem >to be >> issues with that tool on your machine. >> >> I would discourage you from harassing the Brazilian government about >their >> RAR file because the RAR file seems fine (no NUL characters appear in >the >> text file) when extracted using the file-roller archive tool on >Ubuntu. >> -- >> Sent from my phone. Please excuse my brevity. >> >> On July 16, 2017 9:37:17 AM PDT, Anthony Damico <ajdam...@gmail.com> >> wrote: >> >hi, yep, there are two problems -- but i think only the segfault is >> >within >> >the scope of a base R issue? i need to look closer at the corrupted >> >decompression and figure out whether i should talk to the brazilian >> >government agency that creates that .rar file or open an issue with >the >> >archive package maintainer. my goal in this thread is only to >figure >> >out >> >how to replicate the goofy text file so the r team can turn it into >an >> >error instead of a segfault. >> > >> >the original example i sent stores the .txt file somewhere inside >the >> >tempdir(), but when i copy it over elsewhere on my machine, the >> >md5sum() >> >gives the same result. thanks again for looking at this >> > >> > > tools::md5sum(infile) >> > >> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_ >> folder/Microdados >> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt >> > "30beb57419486108e98d42ec7a2f8b19" >> > >> > >> > > tools::md5sum( "S:/temp/crash.txt" ) >> > S:/temp/crash.txt >> > "30beb57419486108e98d42ec7a2f8b19" >> > >> > >> > >> > >> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller >> ><jdnew...@dcn.davis.ca.us> >> >wrote: >> > >> >> So you are saying there are two problems... one that produces a >> >corrupt >> >> file from a valid compressed file, and one that segfaults when >> >presented >> >> with that corrupt file? Can you please confirm the file name and >run >> >md5sum >> >> on it and share the result so we can tell when the file problem >has >> >been >> >> reproduced? >> >> -- >> >> Sent from my phone. Please excuse my brevity. >> >> >> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico ><ajdam...@gmail.com> >> >> wrote: >> >> >hi, thank you for attempting this. it looks like your unix >machine >> >> >unzipped >> >> >the txt file without corruption -- if you copied over the same >txt >> >file >> >> >to >> >> >windows 7, i don't think that would reproduce the problem? i >think >> >it >> >> >needs to be the corrupted text file where R.utils::countLines( >> >> >txtfile >> >> >) gives 809367. i am able to reproduce on two distinct windows >> >> >machines >> >> >but no guarantee i'm not doing something dumb >> >> > >> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller >> >> ><jdnew...@dcn.davis.ca.us> >> >> >wrote: >> >> > >> >> >> I am not able to reproduce your segfault on a Windows 7 >platform >> >> >either: >> >> >> >> >> >> ########################## >> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt" >> >> >> sessionInfo() >> >> >> ## R version 3.4.1 (2017-06-30) >> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 >> >> >> ## >> >> >> ## Matrix products: default >> >> >> ## >> >> >> ## locale: >> >> >> ## [1] LC_COLLATE=English_United States.1252 >> >> >> ## [2] LC_CTYPE=English_United States.1252 >> >> >> ## [3] LC_MONETARY=English_United States.1252 >> >> >> ## [4] LC_NUMERIC=C >> >> >> ## [5] LC_TIME=English_United States.1252 >> >> >> ## >> >> >> ## attached base packages: >> >> >> ## [1] stats graphics grDevices utils datasets >methods >> >> >base >> >> >> ## >> >> >> ## loaded via a namespace (and not attached): >> >> >> ## [1] compiler_3.4.1 >> >> >> tools::md5sum( fn1 ) >> >> >> ## d:/DADOS_ENEM_2009.txt >> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e" >> >> >> dat <- readLines( fn1 ) >> >> >> length( dat ) >> >> >> ## [1] 4148721 >> >> >> >> >> >> >> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: >> >> >> >> >> >> I am not able to reproduce this on a Linux platform: >> >> >>> >> >> >>> #######################3 >> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados >Enem >> >> >>> 2009/DADOS_ENEM_2009.txt" >> >> >>> sessionInfo() >> >> >>> ## R version 3.4.1 (2017-06-30) >> >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit) >> >> >>> ## Running under: Ubuntu 14.04.5 LTS >> >> >>> ## >> >> >>> ## Matrix products: default >> >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >> >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >> >> >>> ## >> >> >>> ## locale: >> >> >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> >> >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> >> >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> >> >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> >> >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> >>> ## >> >> >>> ## attached base packages: >> >> >>> ## [1] stats graphics grDevices utils datasets >methods >> >> >base >> >> >>> ## >> >> >>> ## loaded via a namespace (and not attached): >> >> >>> ## [1] compiler_3.4.1 >> >> >>> tools::md5sum( fn1 ) >> >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >> >> >>> 2009/DADOS_ENEM_2009.txt >> >> >>> ## >> >> >>> "83e61c96092285b60d7bf6b0dbc7072e" >> >> >>> dat <- readLines( fn1 ) >> >> >>> length( dat ) >> >> >>> ## [1] 4148721 >> >> >>> >> >> >>> No segfault occurs. >> >> >>> >> >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote: >> >> >>> >> >> >>> hi, i realized that the segfault happens on the text file in a >> >new R >> >> >>>> session. so, creating the segfault-generating text file >> >requires a >> >> >>>> contributed package, but prompting the actual segfault does >not >> >-- >> >> >pretty >> >> >>>> sure that means this is a base R bug? submitted here: >> >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 >> >> >hopefully i >> >> >>>> am >> >> >>>> not doing something remarkably stupid. the text file itself >is >> >4GB >> >> >so >> >> >>>> cannot upload it to bugzilla, and from the >R_AllocStringBugger >> >> >error in >> >> >>>> the >> >> >>>> previous message, i think most or all of it needs to be there >to >> >> >trigger >> >> >>>> the segfault. thanks! >> >> >>>> >> >> >>>> >> >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico >> >> ><ajdam...@gmail.com> >> >> >>>> wrote: >> >> >>>> >> >> >>>> hi, thanks Dr. Murdoch >> >> >>>>> >> >> >>>>> >> >> >>>>> i'd appreciate if anyone on r-help could help me narrow this >> >down? >> >> > i >> >> >>>>> believe the segfault occurs because there's a single line >with >> >4GB >> >> >and >> >> >>>>> also >> >> >>>>> embedded nuls, but i am not sure how to artificially >construct >> >> >that? >> >> >>>>> >> >> >>>>> >> >> >>>>> the lodown package can be removed from my example.. it is >just >> >> >for file >> >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with >> >> >>>>> `download.file` my current example requires a huge >download, >> >so >> >> >sort of >> >> >>>>> painful to repeat but i'm pretty confident that's not the >> >issue. >> >> >>>>> >> >> >>>>> >> >> >>>>> the archive::archive_extract() function unzips a (probably >> >> >corrupt) .RAR >> >> >>>>> file and creates a text file with 80,937 lines. this file >is >> >4GB: >> >> >>>>> >> >> >>>>> > file.size(infile) >> >> >>>>> [1] 4078192743 <(407)%20819-2743> >> >> >>>>> >> >> >>>>> >> >> >>>>> i am pretty sure that nearly all of that 4GB is contained on >a >> >> >single >> >> >>>>> line >> >> >>>>> in the file. here's what happens when i create a file >> >connection >> >> >and >> >> >>>>> scan >> >> >>>>> through.. >> >> >>>>> >> >> >>>>> > file_con <- file( infile , 'r' ) >> >> >>>>> > >> >> >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "1000023930632009" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "36F2924009PAULO" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "AFONSO" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "BA11" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "00000" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "00" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "2924009PAULO" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "AFONSO" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "BA1111" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "467.20" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "346.10" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Read 1 item >> >> >>>>> [1] "414.40" >> >> >>>>> > scan( w , n = 1 , what = character() ) >> >> >>>>> Error in scan(w, n = 1, what = character()) : >> >> >>>>> could not allocate memory (2048 Mb) in C function >> >> >>>>> 'R_AllocStringBuffer' >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> making a huge single-line file does not reproduce the >problem, >> >i >> >> >think >> >> >>>>> the >> >> >>>>> embedded nuls have something to do with it-- >> >> >>>>> >> >> >>>>> >> >> >>>>> # WARNING do not run with less than 64GB RAM >> >> >>>>> tf <- tempfile() >> >> >>>>> a <- rep( "a" , 1000000000 ) >> >> >>>>> b <- paste( a , collapse = '' ) >> >> >>>>> writeLines( b , tf ) ; rm( b ) ; gc() >> >> >>>>> d <- readLines( tf ) >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >> >> >>>>> murdoch.dun...@gmail.com> >> >> >>>>> wrote: >> >> >>>>> >> >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >> >> >>>>>> >> >> >>>>>> hello, the last line of the code below causes a segfault >for >> >me >> >> >on >> >> >>>>>>> 3.4.1. >> >> >>>>>>> i think i should submit to https://bugs.r-project.org/ >> >unless >> >> >others >> >> >>>>>>> have >> >> >>>>>>> advice? thanks >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>> Segfaults are usually worth reporting as bugs. Try to come >up >> >> >with a >> >> >>>>>> self-contained example, not using the lodown and archive >> >> >packages. I >> >> >>>>>> imagine you can do this by uploading the file you >downloaded, >> >or >> >> >>>>>> enough of >> >> >>>>>> a subset of it to trigger the segfault. If you can't do >that, >> >> >then >> >> >>>>>> likely >> >> >>>>>> the bug is with one of those packages, not with R. >> >> >>>>>> >> >> >>>>>> Duncan Murdoch >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> install.packages( "devtools" ) >> >> >>>>>>> devtools::install_github("ajdamico/lodown") >> >> >>>>>>> devtools::install_github("jimhester/archive") >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >> >> >>>>>>> >> >> >>>>>>> tf <- tempfile() >> >> >>>>>>> >> >> >>>>>>> # large download! cachaca saves on your local disk if >> >already >> >> >>>>>>> downloaded >> >> >>>>>>> lodown::cachaca( ' >> >> >>>>>>> >> >http://download.inep.gov.br/microdados/microdados_enem2009.rar' >> >> >, tf >> >> >>>>>>> , >> >> >>>>>>> mode >> >> >>>>>>> = 'wb' ) >> >> >>>>>>> >> >> >>>>>>> archive::archive_extract( tf , dir = normalizePath( >> >file_folder >> >> >) ) >> >> >>>>>>> >> >> >>>>>>> unzipped_files <- list.files( file_folder , recursive = >TRUE >> >, >> >> >>>>>>> full.names = >> >> >>>>>>> TRUE ) >> >> >>>>>>> >> >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , >value = >> >> >TRUE ) >> >> >>>>>>> >> >> >>>>>>> # works >> >> >>>>>>> R.utils::countLines( infile ) >> >> >>>>>>> >> >> >>>>>>> # works with warning >> >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) >> >> >>>>>>> >> >> >>>>>>> # crash >> >> >>>>>>> my_file <- readLines( infile ) >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> # run just before crash >> >> >>>>>>> sessionInfo() >> >> >>>>>>> # R version 3.4.1 (2017-06-30) >> >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> >>>>>>> # Running under: Windows 10 x64 (build 15063) >> >> >>>>>>> >> >> >>>>>>> # Matrix products: default >> >> >>>>>>> >> >> >>>>>>> # locale: >> >> >>>>>>> # [1] LC_COLLATE=English_United States.1252 >> >> >>>>>>> # [2] LC_CTYPE=English_United States.1252 >> >> >>>>>>> # [3] LC_MONETARY=English_United States.1252 >> >> >>>>>>> # [4] LC_NUMERIC=C >> >> >>>>>>> # [5] LC_TIME=English_United States.1252 >> >> >>>>>>> >> >> >>>>>>> # attached base packages: >> >> >>>>>>> # [1] stats graphics grDevices utils datasets >> >methods >> >> > base >> >> >>>>>>> >> >> >>>>>>> # loaded via a namespace (and not attached): >> >> >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >> >> >>>>>>> withr_1.0.2 >> >> >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >> >> >>>>>>> memoise_1.1.0 >> >> >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >> >> >>>>>>> lodown_0.1.0 >> >> >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 >devtools_1.13.2 >> >> >>>>>>> R.oo_1.21.0 >> >> >>>>>>> # [17] archive_0.0.0.9000 >> >> >>>>>>> >> >> >>>>>>> [[alternative HTML version deleted]] >> >> >>>>>>> >> >> >>>>>>> ______________________________________________ >> >> >>>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and >more, >> >> >see >> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >> >>>>>>> PLEASE do read the posting guide >> >http://www.R-project.org/posti >> >> >>>>>>> ng-guide.html >> >> >>>>>>> and provide commented, minimal, self-contained, >reproducible >> >> >code. >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>> >> >> >>>>> >> >> >>>> [[alternative HTML version deleted]] >> >> >>>> >> >> >>>> ______________________________________________ >> >> >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >> >see >> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >> >>>> PLEASE do read the posting guide >http://www.R-project.org/posti >> >> >>>> ng-guide.html >> >> >>>> and provide commented, minimal, self-contained, reproducible >> >code. >> >> >>>> >> >> >>>> >> >> >>> ------------------------------------------------------------ >> >> >>> --------------- >> >> >>> Jeff Newmiller The ..... >..... >> >Go >> >> >>> Live... >> >> >>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. >##.#. >> >> >Live >> >> >>> Go... >> >> >>> Live: OO#.. Dead: OO#.. >> >> >Playing >> >> >>> Research Engineer (Solar/Batteries O.O#. >#.O#. >> >> >with >> >> >>> /Software/Embedded Controllers) .OO#. >.OO#. >> >> >>> rocks...1k >> >> >>> >> >> >>> ______________________________________________ >> >> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, >see >> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help >> >> >>> PLEASE do read the posting guide >http://www.R-project.org/posti >> >> >>> ng-guide.html >> >> >>> and provide commented, minimal, self-contained, reproducible >> >code. >> >> >>> >> >> >>> >> >> >> ------------------------------------------------------------ >> >> >> --------------- >> >> >> Jeff Newmiller The ..... ..... >> >Go >> >> >Live... >> >> >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. >> >Live >> >> >> Go... >> >> >> Live: OO#.. Dead: OO#.. >> >> >Playing >> >> >> Research Engineer (Solar/Batteries O.O#. #.O#. >> >with >> >> >> /Software/Embedded Controllers) .OO#. .OO#. >> >> >rocks...1k >> >> >> ------------------------------------------------------------ >> >> >> --------------- >> >> >> >> >> >> ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.