Re: [R] readLines without skipNul=TRUE causes crash

2017-07-18 Thread Martin Maechler
> Anthony Damico 
> on Sun, 16 Jul 2017 06:40:38 -0400 writes:

> hi, the text file that prompts the segfault is 4gb but only 80,937 lines
>> file.info( "S:/temp/crash.txt")
> size isdir mode   mtime
> ctime   atime exe
> S:/temp/crash.txt 4078192743 FALSE  666 2017-07-15 17:24:35 2017-07-15
> 17:19:47 2017-07-15 17:19:47  no


> On Sun, Jul 16, 2017 at 6:34 AM, Duncan Murdoch 
> wrote:

>> On 16/07/2017 6:17 AM, Anthony Damico wrote:
>> 
>>> thank you for taking the time to write this.  i set it running last
>>> night and it's still going -- if it doesn't finish by tomorrow, i will
>>> try to find a site to host the problem file and add that link to the bug
>>> report so the archive package can be avoided at least.  i'm sorry for
>>> the bother
>>> 
>>> 
>> How big is that text file?  I wouldn't expect my script to take more than
>> a few minutes even on a huge file.
>> 
>> My script might have a bug...
>> 
>> Duncan Murdoch
>> 
>> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
>>> mailto:murdoch.dun...@gmail.com>> wrote:
>>> 
>>> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>>> 
>>> hi, i realized that the segfault happens on the text file in a
>>> new R
>>> session.  so, creating the segfault-generating text file requires
>>> a
>>> contributed package, but prompting the actual segfault does not --
>>> pretty sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>>> 
>>> hopefully i am not doing something remarkably stupid.  the text file 
itself
>>> is 4GB
>>> so cannot upload it to bugzilla, and from the
>>> R_AllocStringBugger error
>>> in the previous message, i think most or all of it needs to be
>>> there to
>>> trigger the segfault.  thanks!

In the mean time, communication has continued a bit at the bugzilla bug tracker
(https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 ), and
as you can read there, the bug is fixed now, also thanks to an
initial patch proposal by Hannes Mühleisen.

Martin Maechler
ETH Zurich (and R Core)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-17 Thread Anthony Damico
awesome, thank you! looks like folks on bugzilla have also reproduced and
submitted a patch, so i am happy. thanks all

On Mon, Jul 17, 2017 at 11:36 AM, William Dunlap  wrote:

> The original file had a lot of trailing null bytes so I tried making a
> similar file with:
>
> tf <- tempfile(); file <- file(tf, "wb")
> for(i in 1:(2^15-1))writeBin(rep(as.raw(32:127), len=2^16), file)
> for(i in 1:(2^15-1))writeBin(rep(as.raw(0L), len=2^16), file)
> close(file)
> log2(file.size(tf))
> #[1] 31.6
>
> Reading this with readLines() caused R-3.4.0 to segfault in
> Rf_con_pushback with the same gdb traceback I saw when reading the original
> file.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Jul 15, 2017 at 4:28 PM, William Dunlap  wrote:
>
>> I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
>> debugging but gdb gives some information when I attach the debugger after
>> the 'R..has stopped working' popup appears.  I don't know how reliable it
>> is:
>>
>> (gdb) info threads
>>   Id   Target Id Frame
>> * 4Thread 11848.0x1500 0x7ffe38dc8861 in ntdll!DbgBreakPoint ()
>> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   3Thread 11848.0x2e90 0x7ffe38dc87e4 in
>> ntdll!ZwWaitForWorkViaWorkerFactory ()
>>from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   2Thread 11848.0x3618 0x7ffe38dc5154 in
>> ntdll!ZwWaitForSingleObject ()
>>from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>>   1Thread 11848.0x1808 0x6c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> (gdb) thread 1
>> [Switching to thread 1 (Thread 11848.0x1808)]
>> #0  0x6c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> (gdb) where
>> #0  0x6c77de3b in Rf_con_pushback () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #1  0x6c7d8919 in R_initAssignSymbols () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #2  0x6c7ef961 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #3  0x6c7f1b70 in R_cmpfun1 () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #4  0x6c7f1ef2 in Rf_applyClosure () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #5  0x6c7efaf7 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #6  0x6c7f3816 in R_execMethod () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #7  0x6c7efcdf in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
>> R.dll
>> #8  0x6c81053c in Rf_ReplIteration () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #9  0x6c810902 in Rf_ReplIteration () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #10 0x6c810992 in run_Rmainloop () from
>> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
>> #11 0x0040171c in ?? ()
>> #12 0x0040155a in ?? ()
>> #13 0x004013e8 in ?? ()
>> #14 0x0040151b in ?? ()
>> #15 0x7ffe37868102 in KERNEL32!BaseThreadInitThunk () from
>> /cygdrive/c/WINDOWS/system32/KERNEL32.DLL
>> #16 0x7ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
>> /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>> #17 0x in ?? ()
>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>> (gdb)
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller > > wrote:
>>
>>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>>
>>> ##
>>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>>> ##
>>> ## Matrix products: default
>>> ##
>>> ## locale:
>>> ## [1] LC_COLLATE=English_United States.1252
>>> ## [2] LC_CTYPE=English_United States.1252
>>> ## [3] LC_MONETARY=English_United States.1252
>>> ## [4] LC_NUMERIC=C
>>> ## [5] LC_TIME=English_United States.1252
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## d:/DADOS_ENEM_2009.txt
>>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>>
>>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>>
>>> I am not able to reproduce this on a Linux platform:

 ###3
 fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
 2009/DADOS_ENEM_2009.txt"
 sessionInfo()
 ## R version 3.4.1 (2017-06-30)
 ## Platform: x86_64-pc-linux-gnu (64-bit)
 ## Running under: Ubuntu 14.04.5 LTS
 ##
 ## Matrix products: default
 ## BLAS: /usr/lib/libblas/libblas.so.3.0
 ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
 ##
 ## locale:
 ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 ##  [5] LC_MONETAR

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-17 Thread William Dunlap via R-help
The original file had a lot of trailing null bytes so I tried making a
similar file with:

tf <- tempfile(); file <- file(tf, "wb")
for(i in 1:(2^15-1))writeBin(rep(as.raw(32:127), len=2^16), file)
for(i in 1:(2^15-1))writeBin(rep(as.raw(0L), len=2^16), file)
close(file)
log2(file.size(tf))
#[1] 31.6

Reading this with readLines() caused R-3.4.0 to segfault in Rf_con_pushback
with the same gdb traceback I saw when reading the original file.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 4:28 PM, William Dunlap  wrote:

> I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
> debugging but gdb gives some information when I attach the debugger after
> the 'R..has stopped working' popup appears.  I don't know how reliable it
> is:
>
> (gdb) info threads
>   Id   Target Id Frame
> * 4Thread 11848.0x1500 0x7ffe38dc8861 in ntdll!DbgBreakPoint ()
> from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   3Thread 11848.0x2e90 0x7ffe38dc87e4 in 
> ntdll!ZwWaitForWorkViaWorkerFactory
> ()
>from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   2Thread 11848.0x3618 0x7ffe38dc5154 in
> ntdll!ZwWaitForSingleObject ()
>from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
>   1Thread 11848.0x1808 0x6c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> (gdb) thread 1
> [Switching to thread 1 (Thread 11848.0x1808)]
> #0  0x6c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> (gdb) where
> #0  0x6c77de3b in Rf_con_pushback () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #1  0x6c7d8919 in R_initAssignSymbols () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #2  0x6c7ef961 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #3  0x6c7f1b70 in R_cmpfun1 () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #4  0x6c7f1ef2 in Rf_applyClosure () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #5  0x6c7efaf7 in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #6  0x6c7f3816 in R_execMethod () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #7  0x6c7efcdf in Rf_eval () from /cygdrive/c/R/R-3.4.0/bin/x64/
> R.dll
> #8  0x6c81053c in Rf_ReplIteration () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #9  0x6c810902 in Rf_ReplIteration () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #10 0x6c810992 in run_Rmainloop () from
> /cygdrive/c/R/R-3.4.0/bin/x64/R.dll
> #11 0x0040171c in ?? ()
> #12 0x0040155a in ?? ()
> #13 0x004013e8 in ?? ()
> #14 0x0040151b in ?? ()
> #15 0x7ffe37868102 in KERNEL32!BaseThreadInitThunk () from
> /cygdrive/c/WINDOWS/system32/KERNEL32.DLL
> #16 0x7ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
> /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
> #17 0x in ?? ()
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> (gdb)
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller 
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> ###3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods   base
>>> ##
>

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-17 Thread Jeff Newmiller
I'll pass. Just because some non-CRAN "archive" package has bugs or your disk 
storage is flaky does not mean that any of dozens or hundreds of other 
compression tools (e.g. the built-in Windows "Send to compressed folder" pop-up 
menu) won't get it right, and we would know if it did fail because of the 
md5sum.
-- 
Sent from my phone. Please excuse my brevity.

On July 17, 2017 5:00:48 AM PDT, Anthony Damico  wrote:
>hi, thanks again for taking the time.  since corrupted compression
>prompted
>the segfault for me in the first place, i've just posted the text file
>as-is.  it's a 2.4GB file so to be avoided on a metered internet
>connection.  i've updated the bugzilla report at
>https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
>relevant info.  these lines of code crash both windows R 3.4.1 and also
>linux R 3.3.3 for me.  thanks again
>
>
># consider changing `tempfile()` to a permanent location
># so you don't lose the large downloaded file after the crash
>tf <- tempfile()
> download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt";
>, tf , mode = 'wb' )
>sessionInfo()
>x <- readLines( tf )
>
>
>
>
>On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller
>
>wrote:
>
>> I am stuck. The archive package won't compile for me on Ubuntu, and
>the
>> CRANextra repo seems to be down so I cannot install packages on
>Windows
>> right now. Perhaps you can zip the corrupt text file and put it
>online
>> somewhere? Don't use the archive package to pack it since there seem
>to be
>> issues with that tool on your machine.
>>
>> I would discourage you from harassing the Brazilian government about
>their
>> RAR file because the RAR file seems fine (no NUL characters appear in
>the
>> text file) when extracted using the file-roller archive tool on
>Ubuntu.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 9:37:17 AM PDT, Anthony Damico 
>> wrote:
>> >hi, yep, there are two problems -- but i think only the segfault is
>> >within
>> >the scope of a base R issue?  i need to look closer at the corrupted
>> >decompression and figure out whether i should talk to the brazilian
>> >government agency that creates that .rar file or open an issue with
>the
>> >archive package maintainer.  my goal in this thread is only to
>figure
>> >out
>> >how to replicate the goofy text file so the r team can turn it into
>an
>> >error instead of a segfault.
>> >
>> >the original example i sent stores the .txt file somewhere inside
>the
>> >tempdir(), but when i copy it over elsewhere on my machine, the
>> >md5sum()
>> >gives the same result.  thanks again for looking at this
>> >
>> >> tools::md5sum(infile)
>> >
>> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
>> folder/Microdados
>> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>> >"30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >> tools::md5sum( "S:/temp/crash.txt" )
>> > S:/temp/crash.txt
>> >"30beb57419486108e98d42ec7a2f8b19"
>> >
>> >
>> >
>> >
>> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
>> >
>> >wrote:
>> >
>> >> So you are saying there are two problems... one that produces a
>> >corrupt
>> >> file from a valid compressed file, and one that segfaults when
>> >presented
>> >> with that corrupt file? Can you please confirm the file name and
>run
>> >md5sum
>> >> on it and share the result so we can tell when the file problem
>has
>> >been
>> >> reproduced?
>> >> --
>> >> Sent from my phone. Please excuse my brevity.
>> >>
>> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico
>
>> >> wrote:
>> >> >hi, thank you for attempting this. it looks like your unix
>machine
>> >> >unzipped
>> >> >the txt file without corruption -- if you copied over the same
>txt
>> >file
>> >> >to
>> >> >windows 7, i don't think that would reproduce the problem?  i
>think
>> >it
>> >> >needs to be the corrupted text file where   R.utils::countLines(
>> >> >txtfile
>> >> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >> >machines
>> >> >but no guarantee i'm not doing something dumb
>> >> >
>> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> >> >
>> >> >wrote:
>> >> >
>> >> >> I am not able to reproduce your segfault on a Windows 7
>platform
>> >> >either:
>> >> >>
>> >> >> ##
>> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> >> sessionInfo()
>> >> >> ## R version 3.4.1 (2017-06-30)
>> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> >> ##
>> >> >> ## Matrix products: default
>> >> >> ##
>> >> >> ## locale:
>> >> >> ## [1] LC_COLLATE=English_United States.1252
>> >> >> ## [2] LC_CTYPE=English_United States.1252
>> >> >> ## [3] LC_MONETARY=English_United States.1252
>> >> >> ## [4] LC_NUMERIC=C
>> >> >> ## [5] LC_TIME=English_United States.1252
>> >> >> ##
>> >> >> ## attached base packages:
>> >> >> ## [1] stats graphics  grDevices utils datasets 
>method

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-17 Thread Anthony Damico
hi, thanks again for taking the time.  since corrupted compression prompted
the segfault for me in the first place, i've just posted the text file
as-is.  it's a 2.4GB file so to be avoided on a metered internet
connection.  i've updated the bugzilla report at
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
relevant info.  these lines of code crash both windows R 3.4.1 and also
linux R 3.3.3 for me.  thanks again


# consider changing `tempfile()` to a permanent location
# so you don't lose the large downloaded file after the crash
tf <- tempfile()
download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt";
, tf , mode = 'wb' )
sessionInfo()
x <- readLines( tf )




On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller 
wrote:

> I am stuck. The archive package won't compile for me on Ubuntu, and the
> CRANextra repo seems to be down so I cannot install packages on Windows
> right now. Perhaps you can zip the corrupt text file and put it online
> somewhere? Don't use the archive package to pack it since there seem to be
> issues with that tool on your machine.
>
> I would discourage you from harassing the Brazilian government about their
> RAR file because the RAR file seems fine (no NUL characters appear in the
> text file) when extracted using the file-roller archive tool on Ubuntu.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 9:37:17 AM PDT, Anthony Damico 
> wrote:
> >hi, yep, there are two problems -- but i think only the segfault is
> >within
> >the scope of a base R issue?  i need to look closer at the corrupted
> >decompression and figure out whether i should talk to the brazilian
> >government agency that creates that .rar file or open an issue with the
> >archive package maintainer.  my goal in this thread is only to figure
> >out
> >how to replicate the goofy text file so the r team can turn it into an
> >error instead of a segfault.
> >
> >the original example i sent stores the .txt file somewhere inside the
> >tempdir(), but when i copy it over elsewhere on my machine, the
> >md5sum()
> >gives the same result.  thanks again for looking at this
> >
> >> tools::md5sum(infile)
> >
> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
> folder/Microdados
> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
> >"30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >> tools::md5sum( "S:/temp/crash.txt" )
> > S:/temp/crash.txt
> >"30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >
> >
> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
> >
> >wrote:
> >
> >> So you are saying there are two problems... one that produces a
> >corrupt
> >> file from a valid compressed file, and one that segfaults when
> >presented
> >> with that corrupt file? Can you please confirm the file name and run
> >md5sum
> >> on it and share the result so we can tell when the file problem has
> >been
> >> reproduced?
> >> --
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico 
> >> wrote:
> >> >hi, thank you for attempting this. it looks like your unix machine
> >> >unzipped
> >> >the txt file without corruption -- if you copied over the same txt
> >file
> >> >to
> >> >windows 7, i don't think that would reproduce the problem?  i think
> >it
> >> >needs to be the corrupted text file where   R.utils::countLines(
> >> >txtfile
> >> >)   gives 809367.  i am able to reproduce on two distinct windows
> >> >machines
> >> >but no guarantee i'm not doing something dumb
> >> >
> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> >> >
> >> >wrote:
> >> >
> >> >> I am not able to reproduce your segfault on a Windows 7 platform
> >> >either:
> >> >>
> >> >> ##
> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> >> sessionInfo()
> >> >> ## R version 3.4.1 (2017-06-30)
> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> >> ##
> >> >> ## Matrix products: default
> >> >> ##
> >> >> ## locale:
> >> >> ## [1] LC_COLLATE=English_United States.1252
> >> >> ## [2] LC_CTYPE=English_United States.1252
> >> >> ## [3] LC_MONETARY=English_United States.1252
> >> >> ## [4] LC_NUMERIC=C
> >> >> ## [5] LC_TIME=English_United States.1252
> >> >> ##
> >> >> ## attached base packages:
> >> >> ## [1] stats graphics  grDevices utils datasets  methods
> >> >base
> >> >> ##
> >> >> ## loaded via a namespace (and not attached):
> >> >> ## [1] compiler_3.4.1
> >> >> tools::md5sum( fn1 )
> >> >> ## d:/DADOS_ENEM_2009.txt
> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> >> dat <- readLines( fn1 )
> >> >> length( dat )
> >> >> ## [1] 4148721
> >> >>
> >> >>
> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >> >>
> >> >> I am not able to reproduce this on a Linux platform:
> >> >>>
> >> >>> ###3
> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Jeff Newmiller
I am stuck. The archive package won't compile for me on Ubuntu, and the 
CRANextra repo seems to be down so I cannot install packages on Windows right 
now. Perhaps you can zip the corrupt text file and put it online somewhere? 
Don't use the archive package to pack it since there seem to be issues with 
that tool on your machine. 

I would discourage you from harassing the Brazilian government about their RAR 
file because the RAR file seems fine (no NUL characters appear in the text 
file) when extracted using the file-roller archive tool on Ubuntu.
-- 
Sent from my phone. Please excuse my brevity.

On July 16, 2017 9:37:17 AM PDT, Anthony Damico  wrote:
>hi, yep, there are two problems -- but i think only the segfault is
>within
>the scope of a base R issue?  i need to look closer at the corrupted
>decompression and figure out whether i should talk to the brazilian
>government agency that creates that .rar file or open an issue with the
>archive package maintainer.  my goal in this thread is only to figure
>out
>how to replicate the goofy text file so the r team can turn it into an
>error instead of a segfault.
>
>the original example i sent stores the .txt file somewhere inside the
>tempdir(), but when i copy it over elsewhere on my machine, the
>md5sum()
>gives the same result.  thanks again for looking at this
>
>> tools::md5sum(infile)
>
>C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
>ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>"30beb57419486108e98d42ec7a2f8b19"
>
>
>> tools::md5sum( "S:/temp/crash.txt" )
> S:/temp/crash.txt
>"30beb57419486108e98d42ec7a2f8b19"
>
>
>
>
>On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
>
>wrote:
>
>> So you are saying there are two problems... one that produces a
>corrupt
>> file from a valid compressed file, and one that segfaults when
>presented
>> with that corrupt file? Can you please confirm the file name and run
>md5sum
>> on it and share the result so we can tell when the file problem has
>been
>> reproduced?
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 3:21:21 AM PDT, Anthony Damico 
>> wrote:
>> >hi, thank you for attempting this. it looks like your unix machine
>> >unzipped
>> >the txt file without corruption -- if you copied over the same txt
>file
>> >to
>> >windows 7, i don't think that would reproduce the problem?  i think
>it
>> >needs to be the corrupted text file where   R.utils::countLines(
>> >txtfile
>> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >machines
>> >but no guarantee i'm not doing something dumb
>> >
>> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> >
>> >wrote:
>> >
>> >> I am not able to reproduce your segfault on a Windows 7 platform
>> >either:
>> >>
>> >> ##
>> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> sessionInfo()
>> >> ## R version 3.4.1 (2017-06-30)
>> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> ##
>> >> ## Matrix products: default
>> >> ##
>> >> ## locale:
>> >> ## [1] LC_COLLATE=English_United States.1252
>> >> ## [2] LC_CTYPE=English_United States.1252
>> >> ## [3] LC_MONETARY=English_United States.1252
>> >> ## [4] LC_NUMERIC=C
>> >> ## [5] LC_TIME=English_United States.1252
>> >> ##
>> >> ## attached base packages:
>> >> ## [1] stats graphics  grDevices utils datasets  methods
>> >base
>> >> ##
>> >> ## loaded via a namespace (and not attached):
>> >> ## [1] compiler_3.4.1
>> >> tools::md5sum( fn1 )
>> >> ## d:/DADOS_ENEM_2009.txt
>> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> dat <- readLines( fn1 )
>> >> length( dat )
>> >> ## [1] 4148721
>> >>
>> >>
>> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >>
>> >> I am not able to reproduce this on a Linux platform:
>> >>>
>> >>> ###3
>> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt"
>> >>> sessionInfo()
>> >>> ## R version 3.4.1 (2017-06-30)
>> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >>> ##
>> >>> ## Matrix products: default
>> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >>> ##
>> >>> ## locale:
>> >>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> >>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> >>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> >>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> >>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>> ##
>> >>> ## attached base packages:
>> >>> ## [1] stats graphics  grDevices utils datasets  methods
>> >base
>> >>> ##
>> >>> ## loaded via a namespace (and not attached):
>> >>> ## [1] compiler_3.4.1
>> >>> tools::md5sum( fn1 )
>> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dado

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, yep, there are two problems -- but i think only the segfault is within
the scope of a base R issue?  i need to look closer at the corrupted
decompression and figure out whether i should talk to the brazilian
government agency that creates that .rar file or open an issue with the
archive package maintainer.  my goal in this thread is only to figure out
how to replicate the goofy text file so the r team can turn it into an
error instead of a segfault.

the original example i sent stores the .txt file somewhere inside the
tempdir(), but when i copy it over elsewhere on my machine, the md5sum()
gives the same result.  thanks again for looking at this

> tools::md5sum(infile)

C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
"30beb57419486108e98d42ec7a2f8b19"


> tools::md5sum( "S:/temp/crash.txt" )
 S:/temp/crash.txt
"30beb57419486108e98d42ec7a2f8b19"




On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller 
wrote:

> So you are saying there are two problems... one that produces a corrupt
> file from a valid compressed file, and one that segfaults when presented
> with that corrupt file? Can you please confirm the file name and run md5sum
> on it and share the result so we can tell when the file problem has been
> reproduced?
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 3:21:21 AM PDT, Anthony Damico 
> wrote:
> >hi, thank you for attempting this. it looks like your unix machine
> >unzipped
> >the txt file without corruption -- if you copied over the same txt file
> >to
> >windows 7, i don't think that would reproduce the problem?  i think it
> >needs to be the corrupted text file where   R.utils::countLines(
> >txtfile
> >)   gives 809367.  i am able to reproduce on two distinct windows
> >machines
> >but no guarantee i'm not doing something dumb
> >
> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> >
> >wrote:
> >
> >> I am not able to reproduce your segfault on a Windows 7 platform
> >either:
> >>
> >> ##
> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> sessionInfo()
> >> ## R version 3.4.1 (2017-06-30)
> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> ##
> >> ## Matrix products: default
> >> ##
> >> ## locale:
> >> ## [1] LC_COLLATE=English_United States.1252
> >> ## [2] LC_CTYPE=English_United States.1252
> >> ## [3] LC_MONETARY=English_United States.1252
> >> ## [4] LC_NUMERIC=C
> >> ## [5] LC_TIME=English_United States.1252
> >> ##
> >> ## attached base packages:
> >> ## [1] stats graphics  grDevices utils datasets  methods
> >base
> >> ##
> >> ## loaded via a namespace (and not attached):
> >> ## [1] compiler_3.4.1
> >> tools::md5sum( fn1 )
> >> ## d:/DADOS_ENEM_2009.txt
> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> dat <- readLines( fn1 )
> >> length( dat )
> >> ## [1] 4148721
> >>
> >>
> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >>
> >> I am not able to reproduce this on a Linux platform:
> >>>
> >>> ###3
> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt"
> >>> sessionInfo()
> >>> ## R version 3.4.1 (2017-06-30)
> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >>> ## Running under: Ubuntu 14.04.5 LTS
> >>> ##
> >>> ## Matrix products: default
> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >>> ##
> >>> ## locale:
> >>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
> >>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
> >>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
> >>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
> >>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>> ##
> >>> ## attached base packages:
> >>> ## [1] stats graphics  grDevices utils datasets  methods
> >base
> >>> ##
> >>> ## loaded via a namespace (and not attached):
> >>> ## [1] compiler_3.4.1
> >>> tools::md5sum( fn1 )
> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt
> >>> ##
> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >>> dat <- readLines( fn1 )
> >>> length( dat )
> >>> ## [1] 4148721
> >>>
> >>> No segfault occurs.
> >>>
> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >>>
> >>> hi, i realized that the segfault happens on the text file in a new R
>  session.  so, creating the segfault-generating text file requires a
>  contributed package, but prompting the actual segfault does not --
> >pretty
>  sure that means this is a base R bug?  submitted here:
>  https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >hopefully i
>  am
>  not doing something remarkably stupid.  the text file itself is 4GB
> >so
>  cannot upload it to bugzilla, and from the R_AllocStringBugger
> >e

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Jeff Newmiller
So you are saying there are two problems... one that produces a corrupt file 
from a valid compressed file, and one that segfaults when presented with that 
corrupt file? Can you please confirm the file name and run md5sum on it and 
share the result so we can tell when the file problem has been reproduced?
-- 
Sent from my phone. Please excuse my brevity.

On July 16, 2017 3:21:21 AM PDT, Anthony Damico  wrote:
>hi, thank you for attempting this. it looks like your unix machine
>unzipped
>the txt file without corruption -- if you copied over the same txt file
>to
>windows 7, i don't think that would reproduce the problem?  i think it
>needs to be the corrupted text file where   R.utils::countLines(
>txtfile
>)   gives 809367.  i am able to reproduce on two distinct windows
>machines
>but no guarantee i'm not doing something dumb
>
>On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>
>wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform
>either:
>>
>> ##
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods  
>base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> ###3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods  
>base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
 session.  so, creating the segfault-generating text file requires a
 contributed package, but prompting the actual segfault does not --
>pretty
 sure that means this is a base R bug?  submitted here:
 https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 
>hopefully i
 am
 not doing something remarkably stupid.  the text file itself is 4GB
>so
 cannot upload it to bugzilla, and from the R_AllocStringBugger
>error in
 the
 previous message, i think most or all of it needs to be there to
>trigger
 the segfault.  thanks!


 On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>
 wrote:

 hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?
> i
> believe the segfault occurs because there's a single line with 4GB
>and
> also
> embedded nuls, but i am not sure how to artificially construct
>that?
>
>
> the lodown package can be removed from my example..  it is just
>for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so
>sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably
>corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>> file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a
>single
> line
> in the file.  here's what happens when i create a file connection
>and
> scan
> through..
>

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, the text file that prompts the segfault is 4gb but only 80,937 lines

> file.info( "S:/temp/crash.txt")
size isdir mode   mtime
ctime   atime exe
S:/temp/crash.txt 4078192743 FALSE  666 2017-07-15 17:24:35 2017-07-15
17:19:47 2017-07-15 17:19:47  no




On Sun, Jul 16, 2017 at 6:34 AM, Duncan Murdoch 
wrote:

> On 16/07/2017 6:17 AM, Anthony Damico wrote:
>
>> thank you for taking the time to write this.  i set it running last
>> night and it's still going -- if it doesn't finish by tomorrow, i will
>> try to find a site to host the problem file and add that link to the bug
>> report so the archive package can be avoided at least.  i'm sorry for
>> the bother
>>
>>
> How big is that text file?  I wouldn't expect my script to take more than
> a few minutes even on a huge file.
>
> My script might have a bug...
>
> Duncan Murdoch
>
> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
>> mailto:murdoch.dun...@gmail.com>> wrote:
>>
>> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a
>> new R
>> session.  so, creating the segfault-generating text file requires
>> a
>> contributed package, but prompting the actual segfault does not --
>> pretty sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> 
>> hopefully i
>> am not doing something remarkably stupid.  the text file itself
>> is 4GB
>> so cannot upload it to bugzilla, and from the
>> R_AllocStringBugger error
>> in the previous message, i think most or all of it needs to be
>> there to
>> trigger the segfault.  thanks!
>>
>>
>> I don't want to download the big file or install the archive
>> package. Could you run the code below on the bad file?  If you're
>> right and it's only nulls that matter, this might allow me to create
>> a file that triggers the bug.
>>
>> f <-  # put the filename of the bad file here
>>
>> con <- file(f, open="rb")
>> zeros <- numeric()
>> repeat {
>>   bytes <- readBin(con, "int", 100, size=1)
>>   zeros <- c(zeros, count + which(bytes == 0))
>>   count <- count + length(bytes)
>>   if (length(bytes) < 100) break
>> }
>> close(con)
>> cat("File length=", count, "\n")
>> cat("Nulls:\n")
>> zeros
>>
>> Here's some code to recreate a file of the same length with nulls in
>> the same places, and spaces everywhere else:
>>
>> size <- count
>> f2 <- tempfile()
>> con <- file(f2, open="wb")
>> count <- 0
>> while (count < size) {
>>   nonzeros <- min(c(size - count, 100, zeros - 1))
>>   if (nonzeros) {
>> writeBin(rep(32L, nonzeros), con, size = 1)
>> count <- count + nonzeros
>>   }
>>   zeros <- zeros - nonzeros
>>   if (length(zeros) && min(zeros) == 1) {
>> writeBin(0L, con, size = 1)
>> count <- count + 1
>> zeros <- zeros[-1] - 1
>>   }
>> }
>> close(con)
>>
>> Duncan Murdoch
>>
>>
>>
>>
>>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Duncan Murdoch

On 16/07/2017 6:17 AM, Anthony Damico wrote:

thank you for taking the time to write this.  i set it running last
night and it's still going -- if it doesn't finish by tomorrow, i will
try to find a site to host the problem file and add that link to the bug
report so the archive package can be avoided at least.  i'm sorry for
the bother



How big is that text file?  I wouldn't expect my script to take more 
than a few minutes even on a huge file.


My script might have a bug...

Duncan Murdoch


On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
mailto:murdoch.dun...@gmail.com>> wrote:

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311

hopefully i
am not doing something remarkably stupid.  the text file itself
is 4GB
so cannot upload it to bugzilla, and from the
R_AllocStringBugger error
in the previous message, i think most or all of it needs to be
there to
trigger the segfault.  thanks!


I don't want to download the big file or install the archive
package. Could you run the code below on the bad file?  If you're
right and it's only nulls that matter, this might allow me to create
a file that triggers the bug.

f <-  # put the filename of the bad file here

con <- file(f, open="rb")
zeros <- numeric()
repeat {
  bytes <- readBin(con, "int", 100, size=1)
  zeros <- c(zeros, count + which(bytes == 0))
  count <- count + length(bytes)
  if (length(bytes) < 100) break
}
close(con)
cat("File length=", count, "\n")
cat("Nulls:\n")
zeros

Here's some code to recreate a file of the same length with nulls in
the same places, and spaces everywhere else:

size <- count
f2 <- tempfile()
con <- file(f2, open="wb")
count <- 0
while (count < size) {
  nonzeros <- min(c(size - count, 100, zeros - 1))
  if (nonzeros) {
writeBin(rep(32L, nonzeros), con, size = 1)
count <- count + nonzeros
  }
  zeros <- zeros - nonzeros
  if (length(zeros) && min(zeros) == 1) {
writeBin(0L, con, size = 1)
count <- count + 1
zeros <- zeros[-1] - 1
  }
}
close(con)

Duncan Murdoch






__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
sorry, typo, 80937 not 809367

On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico  wrote:

> hi, thank you for attempting this. it looks like your unix machine
> unzipped the txt file without corruption -- if you copied over the same txt
> file to windows 7, i don't think that would reproduce the problem?  i think
> it needs to be the corrupted text file where   R.utils::countLines( txtfile
> )   gives 809367.  i am able to reproduce on two distinct windows machines
> but no guarantee i'm not doing something dumb
>
> On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller 
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> ###3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
 session.  so, creating the segfault-generating text file requires a
 contributed package, but prompting the actual segfault does not --
 pretty
 sure that means this is a base R bug?  submitted here:
 https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully
 i am
 not doing something remarkably stupid.  the text file itself is 4GB so
 cannot upload it to bugzilla, and from the R_AllocStringBugger error in
 the
 previous message, i think most or all of it needs to be there to trigger
 the segfault.  thanks!


 On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico 
 wrote:

 hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and
> also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for
> file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort
> of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt)
> .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>> file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single
> line
> in the file.  here's what happens when i create a file connection and
> scan
> through..
>
>> file_con <- file( infile , 'r' )
>>
>> first_80936_lines <- readLines( file_con , n = 80936 )
>> scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "123930632009"
>> scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "36F29

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, thank you for attempting this. it looks like your unix machine unzipped
the txt file without corruption -- if you copied over the same txt file to
windows 7, i don't think that would reproduce the problem?  i think it
needs to be the corrupted text file where   R.utils::countLines( txtfile
)   gives 809367.  i am able to reproduce on two distinct windows machines
but no guarantee i'm not doing something dumb

On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller 
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats graphics  grDevices utils datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> ###3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>> the
>>> previous message, i think most or all of it needs to be there to trigger
>>> the segfault.  thanks!
>>>
>>>
>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico 
>>> wrote:
>>>
>>> hi, thanks Dr. Murdoch


 i'd appreciate if anyone on r-help could help me narrow this down?  i
 believe the segfault occurs because there's a single line with 4GB and
 also
 embedded nuls, but i am not sure how to artificially construct that?


 the lodown package can be removed from my example..  it is just for file
 download cacheing, so `lodown::cachaca` can be replaced with
 `download.file`  my current example requires a huge download, so sort of
 painful to repeat but i'm pretty confident that's not the issue.


 the archive::archive_extract() function unzips a (probably corrupt) .RAR
 file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
 [1] 4078192743 <(407)%20819-2743>


 i am pretty sure that nearly all of that 4GB is contained on a single
 line
 in the file.  here's what happens when i create a file connection and
 scan
 through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "123930632009"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "AFONSO"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "BA11"
> scan( w , n = 1 , what = character() )
 Rea

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
thank you for taking the time to write this.  i set it running last night
and it's still going -- if it doesn't finish by tomorrow, i will try to
find a site to host the problem file and add that link to the bug report so
the archive package can be avoided at least.  i'm sorry for the bother

On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch 
wrote:

> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>
>> hi, i realized that the segfault happens on the text file in a new R
>> session.  so, creating the segfault-generating text file requires a
>> contributed package, but prompting the actual segfault does not --
>> pretty sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>> am not doing something remarkably stupid.  the text file itself is 4GB
>> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
>> in the previous message, i think most or all of it needs to be there to
>> trigger the segfault.  thanks!
>>
>
> I don't want to download the big file or install the archive package.
> Could you run the code below on the bad file?  If you're right and it's
> only nulls that matter, this might allow me to create a file that triggers
> the bug.
>
> f <-  # put the filename of the bad file here
>
> con <- file(f, open="rb")
> zeros <- numeric()
> repeat {
>   bytes <- readBin(con, "int", 100, size=1)
>   zeros <- c(zeros, count + which(bytes == 0))
>   count <- count + length(bytes)
>   if (length(bytes) < 100) break
> }
> close(con)
> cat("File length=", count, "\n")
> cat("Nulls:\n")
> zeros
>
> Here's some code to recreate a file of the same length with nulls in the
> same places, and spaces everywhere else:
>
> size <- count
> f2 <- tempfile()
> con <- file(f2, open="wb")
> count <- 0
> while (count < size) {
>   nonzeros <- min(c(size - count, 100, zeros - 1))
>   if (nonzeros) {
> writeBin(rep(32L, nonzeros), con, size = 1)
> count <- count + nonzeros
>   }
>   zeros <- zeros - nonzeros
>   if (length(zeros) && min(zeros) == 1) {
> writeBin(0L, con, size = 1)
> count <- count + 1
> zeros <- zeros[-1] - 1
>   }
> }
> close(con)
>
> Duncan Murdoch
>
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread William Dunlap via R-help
I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
debugging but gdb gives some information when I attach the debugger after
the 'R..has stopped working' popup appears.  I don't know how reliable it
is:

(gdb) info threads
  Id   Target Id Frame
* 4Thread 11848.0x1500 0x7ffe38dc8861 in ntdll!DbgBreakPoint ()
from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  3Thread 11848.0x2e90 0x7ffe38dc87e4 in
ntdll!ZwWaitForWorkViaWorkerFactory ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  2Thread 11848.0x3618 0x7ffe38dc5154 in
ntdll!ZwWaitForSingleObject ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  1Thread 11848.0x1808 0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) thread 1
[Switching to thread 1 (Thread 11848.0x1808)]
#0  0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) where
#0  0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#1  0x6c7d8919 in R_initAssignSymbols () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#2  0x6c7ef961 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#3  0x6c7f1b70 in R_cmpfun1 () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#4  0x6c7f1ef2 in Rf_applyClosure () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#5  0x6c7efaf7 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#6  0x6c7f3816 in R_execMethod () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#7  0x6c7efcdf in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#8  0x6c81053c in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#9  0x6c810902 in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#10 0x6c810992 in run_Rmainloop () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#11 0x0040171c in ?? ()
#12 0x0040155a in ?? ()
#13 0x004013e8 in ?? ()
#14 0x0040151b in ?? ()
#15 0x7ffe37868102 in KERNEL32!BaseThreadInitThunk () from
/cygdrive/c/WINDOWS/system32/KERNEL32.DLL
#16 0x7ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
/cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#17 0x in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller 
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats graphics  grDevices utils datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> ###3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> n

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Jeff Newmiller

I am not able to reproduce your segfault on a Windows 7 platform either:

##
fn1 <- "d:/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## d:/DADOS_ENEM_2009.txt
## "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721


On Sat, 15 Jul 2017, Jeff Newmiller wrote:


I am not able to reproduce this on a Linux platform:

###3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt"

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt
##
"83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721

No segfault occurs.

On Sat, 15 Jul 2017, Anthony Damico wrote:


hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  
wrote:



hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and 
also

embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

   > file.size(infile)
[1] 4078192743 <(407)%20819-2743>


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

   > file_con <- file( infile , 'r' )
   >
   > first_80936_lines <- readLines( file_con , n = 80936 )
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
   > scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


   

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
am not doing something remarkably stupid.  the text file itself is 4GB
so cannot upload it to bugzilla, and from the R_AllocStringBugger error
in the previous message, i think most or all of it needs to be there to
trigger the segfault.  thanks!


I don't want to download the big file or install the archive package. 
Could you run the code below on the bad file?  If you're right and it's 
only nulls that matter, this might allow me to create a file that 
triggers the bug.


f <-  # put the filename of the bad file here

con <- file(f, open="rb")
zeros <- numeric()
repeat {
  bytes <- readBin(con, "int", 100, size=1)
  zeros <- c(zeros, count + which(bytes == 0))
  count <- count + length(bytes)
  if (length(bytes) < 100) break
}
close(con)
cat("File length=", count, "\n")
cat("Nulls:\n")
zeros

Here's some code to recreate a file of the same length with nulls in the 
same places, and spaces everywhere else:


size <- count
f2 <- tempfile()
con <- file(f2, open="wb")
count <- 0
while (count < size) {
  nonzeros <- min(c(size - count, 100, zeros - 1))
  if (nonzeros) {
writeBin(rep(32L, nonzeros), con, size = 1)
count <- count + nonzeros
  }
  zeros <- zeros - nonzeros
  if (length(zeros) && min(zeros) == 1) {
writeBin(0L, con, size = 1)
count <- count + 1
zeros <- zeros[-1] - 1
  }
}
close(con)

Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Jeff Newmiller

I am not able to reproduce this on a Linux platform:

###3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt
##
"83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721

No segfault occurs.

On Sat, 15 Jul 2017, Anthony Damico wrote:


hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  wrote:


hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

   > file.size(infile)
[1] 4078192743 <(407)%20819-2743>


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

   > file_con <- file( infile , 'r' )
   >
   > first_80936_lines <- readLines( file_con , n = 80936 )
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
   > scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
wrote:


On 15/07/2017 7:35 AM, Anthony Damico wrote:


hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others
have
advice?  thanks



Segfaults are usually worth reporting as bugs.  Try to come up with a
self-contained example, not using the lodown and archive packages.  I
imagine you can do this by uploading the file you downloaded, or enough of
a subset of it to trigger the segfault.  If you can't do that, then likely
the bug is with one of those packages, not with R.

Duncan Murdoch







install.packages( "devtools" )
devtools::install_github("

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
am not doing something remarkably stupid.  the text file itself is 4GB
so cannot upload it to bugzilla, and from the R_AllocStringBugger error
in the previous message, i think most or all of it needs to be there to
trigger the segfault.  thanks!


Hopefully someone can debug it with the info you provided.

Duncan Murdoch



On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico mailto:ajdam...@gmail.com>> wrote:

hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?
i believe the segfault occurs because there's a single line with 4GB
and also embedded nuls, but i am not sure how to artificially
construct that?


the lodown package can be removed from my example..  it is just for
file download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so
sort of painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt)
.RAR file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
[1] 4078192743 


i am pretty sure that nearly all of that 4GB is contained on a
single line in the file.  here's what happens when i create a file
connection and scan through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
> scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i
think the embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
mailto:murdoch.dun...@gmail.com>> wrote:

On 15/07/2017 7:35 AM, Anthony Damico wrote:

hello, the last line of the code below causes a segfault for
me on 3.4.1.
i think i should submit to https://bugs.r-project.org/
unless others have
advice?  thanks


Segfaults are usually worth reporting as bugs.  Try to come up
with a self-contained example, not using the lodown and archive
packages.  I imagine you can do this by uploading the file you
downloaded, or enough of a subset of it to trigger the
segfault.  If you can't do that, then likely the bug is with one
of those packages, not with R.

Duncan Murdoch






install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if
already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar
'
, tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath(
file_folder ) )

unzipped_files <- 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  wrote:

> hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
> > file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single line
> in the file.  here's what happens when i create a file connection and scan
> through..
>
> > file_con <- file( infile , 'r' )
> >
> > first_80936_lines <- readLines( file_con , n = 80936 )
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "123930632009"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "36F2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA11"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "0"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "00"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "467.20"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "346.10"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "414.40"
> > scan( w , n = 1 , what = character() )
> Error in scan(w, n = 1, what = character()) :
>   could not allocate memory (2048 Mb) in C function
> 'R_AllocStringBuffer'
>
>
>
> making a huge single-line file does not reproduce the problem, i think the
> embedded nuls have something to do with it--
>
>
> # WARNING do not run with less than 64GB RAM
> tf <- tempfile()
> a <- rep( "a" , 10 )
> b <- paste( a , collapse = '' )
> writeLines( b , tf ) ; rm( b ) ; gc()
> d <- readLines( tf )
>
>
>
> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
> wrote:
>
>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>
>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>> i think i should submit to https://bugs.r-project.org/  unless others
>>> have
>>> advice?  thanks
>>>
>>
>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>> self-contained example, not using the lodown and archive packages.  I
>> imagine you can do this by uploading the file you downloaded, or enough of
>> a subset of it to trigger the segfault.  If you can't do that, then likely
>> the bug is with one of those packages, not with R.
>>
>> Duncan Murdoch
>>
>>
>>>
>>>
>>>
>>>
>>> install.packages( "devtools" )
>>> devtools::install_github("ajdamico/lodown")
>>> devtools::install_github("jimhester/archive")
>>>
>>>
>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>
>>> tf <- tempfile()
>>>
>>> # large download!  cachaca saves on your local disk if already downloaded
>>> lodown::cachaca( '
>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>> mode
>>> = 'wb' )
>>>
>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>
>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>> full.names =
>>> TRUE  )
>>>
>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>
>>> # works
>>> R.utils::countLines( infile )
>>>
>>> # works with warning
>>> my_file <- readLines( infile , skipNul = TRUE )
>>>
>>> # crash
>>> my_file <- readLines( infile )
>>>
>>>
>>> # run just before crash
>>> sessionInfo()
>>> # R version 3.4.1 (2017-06-30)
>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> # Running under: Windows 10 x64 (build 15063)
>>>
>>> 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
[1] 4078192743


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
> scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
wrote:

> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>> i think i should submit to https://bugs.r-project.org/  unless others
>> have
>> advice?  thanks
>>
>
> Segfaults are usually worth reporting as bugs.  Try to come up with a
> self-contained example, not using the lodown and archive packages.  I
> imagine you can do this by uploading the file you downloaded, or enough of
> a subset of it to trigger the segfault.  If you can't do that, then likely
> the bug is with one of those packages, not with R.
>
> Duncan Murdoch
>
>
>>
>>
>>
>>
>> install.packages( "devtools" )
>> devtools::install_github("ajdamico/lodown")
>> devtools::install_github("jimhester/archive")
>>
>>
>> file_folder <- file.path( tempdir() , "file_folder" )
>>
>> tf <- tempfile()
>>
>> # large download!  cachaca saves on your local disk if already downloaded
>> lodown::cachaca( '
>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>> mode
>> = 'wb' )
>>
>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>
>> unzipped_files <- list.files( file_folder , recursive = TRUE , full.names
>> =
>> TRUE  )
>>
>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>
>> # works
>> R.utils::countLines( infile )
>>
>> # works with warning
>> my_file <- readLines( infile , skipNul = TRUE )
>>
>> # crash
>> my_file <- readLines( infile )
>>
>>
>> # run just before crash
>> sessionInfo()
>> # R version 3.4.1 (2017-06-30)
>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> # Running under: Windows 10 x64 (build 15063)
>>
>> # Matrix products: default
>>
>> # locale:
>> # [1] LC_COLLATE=English_United States.1252
>> # [2] LC_CTYPE=English_United States.1252
>> # [3] LC_MONETARY=English_United States.1252
>> # [4] LC_NUMERIC=C
>> # [5] LC_TIME=English_United States.1252
>>
>> # attached base packages:
>> # [1] stats graphics  grDevices utils datasets  methods   base
>>
>> # loaded via a namespace (and not attached):
>>  # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
>>  withr_1.0.2
>>  # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
>> memoise_1.1.0
>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12
>> lodown_0.1.0
>> # [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2
>> R.oo_1.21.0
>> # [17] archive_0.0.0.9000
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 7:35 AM, Anthony Damico wrote:

hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others have
advice?  thanks


Segfaults are usually worth reporting as bugs.  Try to come up with a 
self-contained example, not using the lodown and archive packages.  I 
imagine you can do this by uploading the file you downloaded, or enough 
of a subset of it to trigger the segfault.  If you can't do that, then 
likely the bug is with one of those packages, not with R.


Duncan Murdoch







install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath( file_folder ) )

unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
TRUE  )

infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

# works
R.utils::countLines( infile )

# works with warning
my_file <- readLines( infile , skipNul = TRUE )

# crash
my_file <- readLines( infile )


# run just before crash
sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 15063)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252

# attached base packages:
# [1] stats graphics  grDevices utils datasets  methods   base

# loaded via a namespace (and not attached):
 # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1   withr_1.0.2
 # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
memoise_1.1.0
 # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12  lodown_0.1.0
# [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2R.oo_1.21.0
# [17] archive_0.0.0.9000

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others have
advice?  thanks





install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath( file_folder ) )

unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
TRUE  )

infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

# works
R.utils::countLines( infile )

# works with warning
my_file <- readLines( infile , skipNul = TRUE )

# crash
my_file <- readLines( infile )


# run just before crash
sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 15063)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252

# attached base packages:
# [1] stats graphics  grDevices utils datasets  methods   base

# loaded via a namespace (and not attached):
 # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1   withr_1.0.2
 # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
memoise_1.1.0
 # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12  lodown_0.1.0
# [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2R.oo_1.21.0
# [17] archive_0.0.0.9000

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.