[R] About doing figures

2017-07-15 Thread lily li
Hi R users,

I still have the problem about plotting. I wanted to put the datasets on
one figure, x-axis represents values B, y-axis represents values C, while
different colors label column A. Each record uses a circle on the figure,
while hollow circles represent DF=1 and solid circles represent DF=2. I put
my code below, but the A labels do not correspond to the true record, so I
don't know what is the problem. Thanks for your help.

dfm
dfm1= subset(dfm, DF==1)
dfm2= subset(dfm, DF==2)
plot(c(15:30),seq(from=0,to=60,by=4),pch=19,col=NULL,xlab='Value
B',ylab='Value C')
Color = as.factor(dfm1$A)
colordist = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(),
invert = T)] # for unique colors
Color.unq = sample(colordist,length(Color))

points(dfm1[,3],dfm1[,4],col=Color.unq,pch=1)
points(dfm2[,3],dfm2[,4],col=Color.unq,pch=19)
legend('bottom',as.character(Color.unq),col=Color.unq,lwd=rep(2,length(Color.unq)),cex=.6,ncol=5)
legend('bottom',as.character(Color),col=Color.unq,lwd=3,cex=.6,ncol=5,text.width=c(9.55,9.6,9.55))

dfm is the dataframe below.

DF   A  B  C
1 65 21 54
1 66 23 55
1 54 24 56
1 44 23 53
1 67 22 52
1 66 21 50
1 45 20 51
1 56 19 57
1 40 25 58
1 39 24 53
2 65 25 52
2 66 20 50
2 54 21 48
2 44 30 49
2 67 27 50
2 66 20 30
2 45 25 56
2 56 14 51
2 40 29 48
2 39 29 23

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread William Dunlap via R-help
I see the problem on Windows 10, R-3.4.0, R.exe.  It is not compiled for
debugging but gdb gives some information when I attach the debugger after
the 'R..has stopped working' popup appears.  I don't know how reliable it
is:

(gdb) info threads
  Id   Target Id Frame
* 4Thread 11848.0x1500 0x7ffe38dc8861 in ntdll!DbgBreakPoint ()
from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  3Thread 11848.0x2e90 0x7ffe38dc87e4 in
ntdll!ZwWaitForWorkViaWorkerFactory ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  2Thread 11848.0x3618 0x7ffe38dc5154 in
ntdll!ZwWaitForSingleObject ()
   from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
  1Thread 11848.0x1808 0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) thread 1
[Switching to thread 1 (Thread 11848.0x1808)]
#0  0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
(gdb) where
#0  0x6c77de3b in Rf_con_pushback () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#1  0x6c7d8919 in R_initAssignSymbols () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#2  0x6c7ef961 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#3  0x6c7f1b70 in R_cmpfun1 () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#4  0x6c7f1ef2 in Rf_applyClosure () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#5  0x6c7efaf7 in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#6  0x6c7f3816 in R_execMethod () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#7  0x6c7efcdf in Rf_eval () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#8  0x6c81053c in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#9  0x6c810902 in Rf_ReplIteration () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#10 0x6c810992 in run_Rmainloop () from
/cygdrive/c/R/R-3.4.0/bin/x64/R.dll
#11 0x0040171c in ?? ()
#12 0x0040155a in ?? ()
#13 0x004013e8 in ?? ()
#14 0x0040151b in ?? ()
#15 0x7ffe37868102 in KERNEL32!BaseThreadInitThunk () from
/cygdrive/c/WINDOWS/system32/KERNEL32.DLL
#16 0x7ffe38d7c5b4 in ntdll!RtlUserThreadStart () from
/cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#17 0x in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Sat, Jul 15, 2017 at 3:29 PM, Jeff Newmiller 
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats graphics  grDevices utils datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> ###3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Jeff Newmiller

I am not able to reproduce your segfault on a Windows 7 platform either:

##
fn1 <- "d:/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## d:/DADOS_ENEM_2009.txt
## "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721


On Sat, 15 Jul 2017, Jeff Newmiller wrote:


I am not able to reproduce this on a Linux platform:

###3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt"

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt
##
"83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721

No segfault occurs.

On Sat, 15 Jul 2017, Anthony Damico wrote:


hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  
wrote:



hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and 
also

embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

   > file.size(infile)
[1] 4078192743 <(407)%20819-2743>


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

   > file_con <- file( infile , 'r' )
   >
   > first_80936_lines <- readLines( file_con , n = 80936 )
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
   > scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something 

Re: [R] select from data frame

2017-07-15 Thread Bert Gunter
...
and here is a slightly cleaner and more transparent way of doing the
same thing (setdiff() does the matching)

> with(df, setdiff(ID,ID[samples %in% c("B","C") ]))
[1] 3

-- Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Jul 15, 2017 at 9:23 AM, Bert Gunter  wrote:
> If I understand correctly, no looping (ave(), for()) or type casting
> (as.character()) is needed -- indexing and matching suffice:
>
>> with(df, ID[!ID %in% unique(ID[samples %in% c("B","C") ])])
> [1] 3 3
>
>
>
> Cheers,
>
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sat, Jul 15, 2017 at 8:54 AM, David Winsemius  
> wrote:
>>
>>> On Jul 15, 2017, at 4:01 AM, Andras Farkas via R-help 
>>>  wrote:
>>>
>>> Dear All,
>>>
>>> wonder if you could please assist with the following
>>>
>>> df<-data.frame(ID=c(1,1,1,2,2,3,3,4,4,5,5),samples=c("A","B","C","A","C","A","D","C","B","A","C"))
>>>
>>> from this data frame the goal is to extract the value of 3 from the ID 
>>> column based on the logic that the ID=3 in the data frame has NO row that 
>>> would pair 3 with either "B", AND/OR "C" in the samples column...
>>>
>>
>> This returns a vector that determines if either of those characters are in 
>> the character values of that factor column you created. Coercing to 
>> character is needed because leaving samples as a factor generated an invalid 
>> factor level warning and gave useless results.
>>
>>  with( df, ave( as.character(samples), ID, FUN=function(x) {!any(x %in% 
>> c("B","C"))}))
>>  [1] "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
>> [10] "FALSE" "FALSE"
>>
>> You can then use it to extract and consolidate to a single value (although 
>> wrapping with as.logical was needed because `ave` returned character class 
>> values):
>>
>>  unique( df$ID[ as.logical(   # fails without this since "FALSE" != FALSE
>> with( df,
>>ave( as.character(samples), ID, FUN=function(x) 
>> {!any(x %in% c("B","C"))})))
>>   ] )
>> #[1] 3
>>
>> The same sort of logic could also be constructed with a for-loop:
>>
>>> for (x in unique(df$ID) ) { if ( !any( df$samples[df$ID==x] %in% 
>>> c("b","C")) ) print(x) }
>> [1] 3
>>
>> Although you are warned that for-loops do not return values and you might 
>> need to make an assignment rather than just printing.
>>
>> --
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
am not doing something remarkably stupid.  the text file itself is 4GB
so cannot upload it to bugzilla, and from the R_AllocStringBugger error
in the previous message, i think most or all of it needs to be there to
trigger the segfault.  thanks!


I don't want to download the big file or install the archive package. 
Could you run the code below on the bad file?  If you're right and it's 
only nulls that matter, this might allow me to create a file that 
triggers the bug.


f <-  # put the filename of the bad file here

con <- file(f, open="rb")
zeros <- numeric()
repeat {
  bytes <- readBin(con, "int", 100, size=1)
  zeros <- c(zeros, count + which(bytes == 0))
  count <- count + length(bytes)
  if (length(bytes) < 100) break
}
close(con)
cat("File length=", count, "\n")
cat("Nulls:\n")
zeros

Here's some code to recreate a file of the same length with nulls in the 
same places, and spaces everywhere else:


size <- count
f2 <- tempfile()
con <- file(f2, open="wb")
count <- 0
while (count < size) {
  nonzeros <- min(c(size - count, 100, zeros - 1))
  if (nonzeros) {
writeBin(rep(32L, nonzeros), con, size = 1)
count <- count + nonzeros
  }
  zeros <- zeros - nonzeros
  if (length(zeros) && min(zeros) == 1) {
writeBin(0L, con, size = 1)
count <- count + 1
zeros <- zeros[-1] - 1
  }
}
close(con)

Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Jeff Newmiller

I am not able to reproduce this on a Linux platform:

###3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics  grDevices utils datasets  methods   base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 
2009/DADOS_ENEM_2009.txt
##
"83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721

No segfault occurs.

On Sat, 15 Jul 2017, Anthony Damico wrote:


hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  wrote:


hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

   > file.size(infile)
[1] 4078192743 <(407)%20819-2743>


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

   > file_con <- file( infile , 'r' )
   >
   > first_80936_lines <- readLines( file_con , n = 80936 )
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
   > scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
   > scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
wrote:


On 15/07/2017 7:35 AM, Anthony Damico wrote:


hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others
have
advice?  thanks



Segfaults are usually worth reporting as bugs.  Try to come up with a
self-contained example, not using the lodown and archive packages.  I
imagine you can do this by uploading the file you downloaded, or enough of
a subset of it to trigger the segfault.  If you can't do that, then likely
the bug is with one of those packages, not with R.

Duncan Murdoch








Re: [R] select from data frame

2017-07-15 Thread Bert Gunter
If I understand correctly, no looping (ave(), for()) or type casting
(as.character()) is needed -- indexing and matching suffice:

> with(df, ID[!ID %in% unique(ID[samples %in% c("B","C") ])])
[1] 3 3



Cheers,

Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Jul 15, 2017 at 8:54 AM, David Winsemius  wrote:
>
>> On Jul 15, 2017, at 4:01 AM, Andras Farkas via R-help  
>> wrote:
>>
>> Dear All,
>>
>> wonder if you could please assist with the following
>>
>> df<-data.frame(ID=c(1,1,1,2,2,3,3,4,4,5,5),samples=c("A","B","C","A","C","A","D","C","B","A","C"))
>>
>> from this data frame the goal is to extract the value of 3 from the ID 
>> column based on the logic that the ID=3 in the data frame has NO row that 
>> would pair 3 with either "B", AND/OR "C" in the samples column...
>>
>
> This returns a vector that determines if either of those characters are in 
> the character values of that factor column you created. Coercing to character 
> is needed because leaving samples as a factor generated an invalid factor 
> level warning and gave useless results.
>
>  with( df, ave( as.character(samples), ID, FUN=function(x) {!any(x %in% 
> c("B","C"))}))
>  [1] "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
> [10] "FALSE" "FALSE"
>
> You can then use it to extract and consolidate to a single value (although 
> wrapping with as.logical was needed because `ave` returned character class 
> values):
>
>  unique( df$ID[ as.logical(   # fails without this since "FALSE" != FALSE
> with( df,
>ave( as.character(samples), ID, FUN=function(x) 
> {!any(x %in% c("B","C"))})))
>   ] )
> #[1] 3
>
> The same sort of logic could also be constructed with a for-loop:
>
>> for (x in unique(df$ID) ) { if ( !any( df$samples[df$ID==x] %in% c("b","C")) 
>> ) print(x) }
> [1] 3
>
> Although you are warned that for-loops do not return values and you might 
> need to make an assignment rather than just printing.
>
> --
>
> David Winsemius
> Alameda, CA, USA
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
am not doing something remarkably stupid.  the text file itself is 4GB
so cannot upload it to bugzilla, and from the R_AllocStringBugger error
in the previous message, i think most or all of it needs to be there to
trigger the segfault.  thanks!


Hopefully someone can debug it with the info you provided.

Duncan Murdoch



On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico > wrote:

hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?
i believe the segfault occurs because there's a single line with 4GB
and also embedded nuls, but i am not sure how to artificially
construct that?


the lodown package can be removed from my example..  it is just for
file download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so
sort of painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt)
.RAR file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
[1] 4078192743 


i am pretty sure that nearly all of that 4GB is contained on a
single line in the file.  here's what happens when i create a file
connection and scan through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
> scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i
think the embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
> wrote:

On 15/07/2017 7:35 AM, Anthony Damico wrote:

hello, the last line of the code below causes a segfault for
me on 3.4.1.
i think i should submit to https://bugs.r-project.org/
unless others have
advice?  thanks


Segfaults are usually worth reporting as bugs.  Try to come up
with a self-contained example, not using the lodown and archive
packages.  I imagine you can do this by uploading the file you
downloaded, or enough of a subset of it to trigger the
segfault.  If you can't do that, then likely the bug is with one
of those packages, not with R.

Duncan Murdoch






install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if
already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar
'
, tf , mode
= 'wb' )

archive::archive_extract( tf , dir = 

Re: [R] select from data frame

2017-07-15 Thread David Winsemius

> On Jul 15, 2017, at 4:01 AM, Andras Farkas via R-help  
> wrote:
> 
> Dear All,
> 
> wonder if you could please assist with the following 
> 
> df<-data.frame(ID=c(1,1,1,2,2,3,3,4,4,5,5),samples=c("A","B","C","A","C","A","D","C","B","A","C"))
> 
> from this data frame the goal is to extract the value of 3 from the ID column 
> based on the logic that the ID=3 in the data frame has NO row that would pair 
> 3 with either "B", AND/OR "C" in the samples column...
> 

This returns a vector that determines if either of those characters are in the 
character values of that factor column you created. Coercing to character is 
needed because leaving samples as a factor generated an invalid factor level 
warning and gave useless results.

 with( df, ave( as.character(samples), ID, FUN=function(x) {!any(x %in% 
c("B","C"))}))
 [1] "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
[10] "FALSE" "FALSE"

You can then use it to extract and consolidate to a single value (although 
wrapping with as.logical was needed because `ave` returned character class 
values):

 unique( df$ID[ as.logical(   # fails without this since "FALSE" != FALSE
with( df, 
   ave( as.character(samples), ID, FUN=function(x) {!any(x 
%in% c("B","C"))})))
  ] )
#[1] 3

The same sort of logic could also be constructed with a for-loop:

> for (x in unique(df$ID) ) { if ( !any( df$samples[df$ID==x] %in% c("b","C")) 
> ) print(x) }
[1] 3

Although you are warned that for-loops do not return values and you might need 
to make an assignment rather than just printing.

-- 

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not -- pretty
sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i am
not doing something remarkably stupid.  the text file itself is 4GB so
cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
previous message, i think most or all of it needs to be there to trigger
the segfault.  thanks!


On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico  wrote:

> hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
> > file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single line
> in the file.  here's what happens when i create a file connection and scan
> through..
>
> > file_con <- file( infile , 'r' )
> >
> > first_80936_lines <- readLines( file_con , n = 80936 )
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "123930632009"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "36F2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA11"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "0"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "00"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "467.20"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "346.10"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "414.40"
> > scan( w , n = 1 , what = character() )
> Error in scan(w, n = 1, what = character()) :
>   could not allocate memory (2048 Mb) in C function
> 'R_AllocStringBuffer'
>
>
>
> making a huge single-line file does not reproduce the problem, i think the
> embedded nuls have something to do with it--
>
>
> # WARNING do not run with less than 64GB RAM
> tf <- tempfile()
> a <- rep( "a" , 10 )
> b <- paste( a , collapse = '' )
> writeLines( b , tf ) ; rm( b ) ; gc()
> d <- readLines( tf )
>
>
>
> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
> wrote:
>
>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>
>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>> i think i should submit to https://bugs.r-project.org/  unless others
>>> have
>>> advice?  thanks
>>>
>>
>> Segfaults are usually worth reporting as bugs.  Try to come up with a
>> self-contained example, not using the lodown and archive packages.  I
>> imagine you can do this by uploading the file you downloaded, or enough of
>> a subset of it to trigger the segfault.  If you can't do that, then likely
>> the bug is with one of those packages, not with R.
>>
>> Duncan Murdoch
>>
>>
>>>
>>>
>>>
>>>
>>> install.packages( "devtools" )
>>> devtools::install_github("ajdamico/lodown")
>>> devtools::install_github("jimhester/archive")
>>>
>>>
>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>
>>> tf <- tempfile()
>>>
>>> # large download!  cachaca saves on your local disk if already downloaded
>>> lodown::cachaca( '
>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>> mode
>>> = 'wb' )
>>>
>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>
>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>> full.names =
>>> TRUE  )
>>>
>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>
>>> # works
>>> R.utils::countLines( infile )
>>>
>>> # works with warning
>>> my_file <- readLines( infile , skipNul = TRUE )
>>>
>>> # crash
>>> my_file <- readLines( infile )
>>>
>>>
>>> # run just before crash
>>> sessionInfo()
>>> # R version 3.4.1 (2017-06-30)
>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> # 

[R] select from data frame

2017-07-15 Thread Andras Farkas via R-help
Dear All,

wonder if you could please assist with the following 

df<-data.frame(ID=c(1,1,1,2,2,3,3,4,4,5,5),samples=c("A","B","C","A","C","A","D","C","B","A","C"))

from this data frame the goal is to extract the value of 3 from the ID column 
based on the logic that the ID=3 in the data frame has NO row that would pair 3 
with either "B", AND/OR "C" in the samples column...


much appreciate your help...

thanks,
 Andras

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hi, thanks Dr. Murdoch


i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?


the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.


the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
[1] 4078192743


i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "123930632009"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "0"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
> scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
  could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'



making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--


# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 10 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )



On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch 
wrote:

> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>> i think i should submit to https://bugs.r-project.org/  unless others
>> have
>> advice?  thanks
>>
>
> Segfaults are usually worth reporting as bugs.  Try to come up with a
> self-contained example, not using the lodown and archive packages.  I
> imagine you can do this by uploading the file you downloaded, or enough of
> a subset of it to trigger the segfault.  If you can't do that, then likely
> the bug is with one of those packages, not with R.
>
> Duncan Murdoch
>
>
>>
>>
>>
>>
>> install.packages( "devtools" )
>> devtools::install_github("ajdamico/lodown")
>> devtools::install_github("jimhester/archive")
>>
>>
>> file_folder <- file.path( tempdir() , "file_folder" )
>>
>> tf <- tempfile()
>>
>> # large download!  cachaca saves on your local disk if already downloaded
>> lodown::cachaca( '
>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>> mode
>> = 'wb' )
>>
>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>
>> unzipped_files <- list.files( file_folder , recursive = TRUE , full.names
>> =
>> TRUE  )
>>
>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>
>> # works
>> R.utils::countLines( infile )
>>
>> # works with warning
>> my_file <- readLines( infile , skipNul = TRUE )
>>
>> # crash
>> my_file <- readLines( infile )
>>
>>
>> # run just before crash
>> sessionInfo()
>> # R version 3.4.1 (2017-06-30)
>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> # Running under: Windows 10 x64 (build 15063)
>>
>> # Matrix products: default
>>
>> # locale:
>> # [1] LC_COLLATE=English_United States.1252
>> # [2] LC_CTYPE=English_United States.1252
>> # [3] LC_MONETARY=English_United States.1252
>> # [4] LC_NUMERIC=C
>> # [5] LC_TIME=English_United States.1252
>>
>> # attached base packages:
>> # [1] stats graphics  grDevices utils datasets  methods   base
>>
>> # loaded via a namespace (and not attached):
>>  # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
>>  withr_1.0.2
>>  # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
>> memoise_1.1.0
>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12
>> lodown_0.1.0
>> # [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2
>> R.oo_1.21.0
>> # [17] archive_0.0.0.9000
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Duncan Murdoch

On 15/07/2017 7:35 AM, Anthony Damico wrote:

hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others have
advice?  thanks


Segfaults are usually worth reporting as bugs.  Try to come up with a 
self-contained example, not using the lodown and archive packages.  I 
imagine you can do this by uploading the file you downloaded, or enough 
of a subset of it to trigger the segfault.  If you can't do that, then 
likely the bug is with one of those packages, not with R.


Duncan Murdoch







install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath( file_folder ) )

unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
TRUE  )

infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

# works
R.utils::countLines( infile )

# works with warning
my_file <- readLines( infile , skipNul = TRUE )

# crash
my_file <- readLines( infile )


# run just before crash
sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 15063)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252

# attached base packages:
# [1] stats graphics  grDevices utils datasets  methods   base

# loaded via a namespace (and not attached):
 # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1   withr_1.0.2
 # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
memoise_1.1.0
 # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12  lodown_0.1.0
# [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2R.oo_1.21.0
# [17] archive_0.0.0.9000

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] readLines without skipNul=TRUE causes crash

2017-07-15 Thread Anthony Damico
hello, the last line of the code below causes a segfault for me on 3.4.1.
i think i should submit to https://bugs.r-project.org/  unless others have
advice?  thanks





install.packages( "devtools" )
devtools::install_github("ajdamico/lodown")
devtools::install_github("jimhester/archive")


file_folder <- file.path( tempdir() , "file_folder" )

tf <- tempfile()

# large download!  cachaca saves on your local disk if already downloaded
lodown::cachaca( '
http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , mode
= 'wb' )

archive::archive_extract( tf , dir = normalizePath( file_folder ) )

unzipped_files <- list.files( file_folder , recursive = TRUE , full.names =
TRUE  )

infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )

# works
R.utils::countLines( infile )

# works with warning
my_file <- readLines( infile , skipNul = TRUE )

# crash
my_file <- readLines( infile )


# run just before crash
sessionInfo()
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 15063)

# Matrix products: default

# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252

# attached base packages:
# [1] stats graphics  grDevices utils datasets  methods   base

# loaded via a namespace (and not attached):
 # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1   withr_1.0.2
 # [5] tibble_1.3.3   curl_2.6   Rcpp_0.12.11
memoise_1.1.0
 # [9] R.methodsS3_1.7.1  git2r_0.18.0   digest_0.6.12  lodown_0.1.0
# [13] R.utils_2.5.0  rlang_0.1.1devtools_1.13.2R.oo_1.21.0
# [17] archive_0.0.0.9000

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.