Re: [R] Matrix logical operator

2017-07-16 Thread Berend Hasselman

> On 17 Jul 2017, at 07:27, Jeremie Juste  wrote:
> 
> 
> Hello,
> 
> I have some trouble understanding why !b & is TRUE. Do you have an idea?
> 
> 
>> b <- matrix(c(0,1,1,0,1,0),2)
> 
>> !b
>  [,1]  [,2]  [,3]
> [1,]  TRUE FALSE FALSE
> [2,] FALSE  TRUE  TRUE
>> !b &
> [1] TRUE
> 

Read the help for &&. You can see it like this: ?`&&`
Try

!b[1] && TRUE

and

!b[2] && TRUE


Berend hasselman

> Best regards,
> 
> Jeremie
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Matrix logical operator

2017-07-16 Thread Jeremie Juste

Hello,

I have some trouble understanding why !b & is TRUE. Do you have an idea?


> b <- matrix(c(0,1,1,0,1,0),2)

> !b
  [,1]  [,2]  [,3]
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
> !b &
[1] TRUE


Best regards,

Jeremie

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About doing figures

2017-07-16 Thread Jim Lemon
Hi lily,
To answer your questions more or less in order:

rainbow() is an easy way to get a small number of distinct colors. You
only had nine unique values in "dfm$A". Obviously it gets harder to
distinguish colors when you have more of them, so just increasing the
number that rainbow() returns will eventually make some of the colors
hard to distinguish. You can use any method of getting distinct colors
that you want, but simply indexing into colors() does not guarantee
you will get distinct colors.

As you had used the "ncol" argument in your example, I thought you
knew how to stretch out the legend horizontally. I only added the
second legend as a suggestion. How does the viewer know about the
symbols and DF?

Jim


On Mon, Jul 17, 2017 at 1:30 AM, lily li  wrote:
> For more than 10 records, how to reformat the colors? Also, how to show the
> first legend only, but at the bottom, while the second legend in your code
> is not necessary? In all, the same A values have the same color, but
> different symbols in DF==1 and DF==2.
> Thanks for your help.
>
> On Sun, Jul 16, 2017 at 9:28 AM, lily li  wrote:
>>
>> Hi Jim,
>>
>> For true color, I meant that the points in the figure do not correspond to
>> the values from the dataframe. Also, why to use rainbow(9) here? And the
>> legend is straight in the middle, is it possible to reformat it to the very
>> bottom? Thanks again.
>>
>> On Sun, Jul 16, 2017 at 2:50 AM, Jim Lemon  wrote:
>>>
>>> Hi lily,
>>> As I have no idea of what the "true record" is, I can only guess.
>>> Maybe this will help:
>>>
>>> # get some fairly distinct colors
>>> rainbow_colors<-rainbow(9)
>>> # this should sort the numbers in dfm$A
>>> dfm$Acolor<-factor(dfm$A)
>>> plot(dfm$B,dfm$C,pch=ifelse(dfm$DF==1,1,19),
>>>  col=rainbow_colors[as.numeric(dfm$Acolor)])
>>> legend("bottom",legend=sort(unique(dfm$A)),
>>>  fill=rainbow_colors)
>>> legend(25,35,c("DF=1","DF=2"),pch=c(1,19))
>>>
>>> Jim
>>>
>>>
>>> On Sun, Jul 16, 2017 at 3:43 PM, lily li  wrote:
>>> > Hi R users,
>>> >
>>> > I still have the problem about plotting. I wanted to put the datasets
>>> > on
>>> > one figure, x-axis represents values B, y-axis represents values C,
>>> > while
>>> > different colors label column A. Each record uses a circle on the
>>> > figure,
>>> > while hollow circles represent DF=1 and solid circles represent DF=2. I
>>> > put
>>> > my code below, but the A labels do not correspond to the true record,
>>> > so I
>>> > don't know what is the problem. Thanks for your help.
>>> >
>>> > dfm
>>> > dfm1= subset(dfm, DF==1)
>>> > dfm2= subset(dfm, DF==2)
>>> > plot(c(15:30),seq(from=0,to=60,by=4),pch=19,col=NULL,xlab='Value
>>> > B',ylab='Value C')
>>> > Color = as.factor(dfm1$A)
>>> > colordist = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(),
>>> > invert = T)] # for unique colors
>>> > Color.unq = sample(colordist,length(Color))
>>> >
>>> > points(dfm1[,3],dfm1[,4],col=Color.unq,pch=1)
>>> > points(dfm2[,3],dfm2[,4],col=Color.unq,pch=19)
>>> >
>>> > legend('bottom',as.character(Color.unq),col=Color.unq,lwd=rep(2,length(Color.unq)),cex=.6,ncol=5)
>>> >
>>> > legend('bottom',as.character(Color),col=Color.unq,lwd=3,cex=.6,ncol=5,text.width=c(9.55,9.6,9.55))
>>> >
>>> > dfm is the dataframe below.
>>> >
>>> > DF   A  B  C
>>> > 1 65 21 54
>>> > 1 66 23 55
>>> > 1 54 24 56
>>> > 1 44 23 53
>>> > 1 67 22 52
>>> > 1 66 21 50
>>> > 1 45 20 51
>>> > 1 56 19 57
>>> > 1 40 25 58
>>> > 1 39 24 53
>>> > 2 65 25 52
>>> > 2 66 20 50
>>> > 2 54 21 48
>>> > 2 44 30 49
>>> > 2 67 27 50
>>> > 2 66 20 30
>>> > 2 45 25 56
>>> > 2 56 14 51
>>> > 2 40 29 48
>>> > 2 39 29 23
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> > __
>>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Arranging column data to create plots

2017-07-16 Thread Jeff Newmiller

Correction at the end.

On Sun, 16 Jul 2017, Jeff Newmiller wrote:


On Sat, 15 Jul 2017, Michael Reed via R-help wrote:


Dear All,

I need some help arranging data that was imported.


It would be helpful if you were to use dput to give us the sample data since 
you say you have already imported it.


The imported data frame looks something like this (the actual file is huge, 
so this is example data)


DF:
IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
Name1  21  15  25  10
Name2  15  18  35  24  27  45
Name3  17  21  30  22  15  40  32  55


That data is missing in X3 etc, but would be NA in an actual data frame, so I 
don't know if my workaround was the same as your workaround. Dput

would have clarified the starting point.


I would like to create a new data frame with the following

NewDF:
IDKey   X   Y
Name1  21  15
Name1  25  10
Name2  15  18
Name2  35  24
Name2  27  45
Name3  17  21
Name3  30  22
Name3  15  40
Name3  32  55

With the data like this I think I can do the following

ggplot(NewDF, aes(x=X, y=Y, color=IDKey) + geom_line


You are missing parentheses. If you use the reprex library to test your 
examples before posting them, you can be sure your simple errors don't send 
us off on wild goose chases.



and get 3 lines with the various number of points.

The point is that each of the XY pairs is a data point tied to NameX. I 
would like to rearrange the data so I can plot the points/lines by the 
IDKey.  There will be at least 2 points, but the number of points for each 
IDKey can be as many as 4.


I have tried using the gather() function from the tidyverse package, but


The tidyverse package is a virtual package that pulls in many packages.

I can't make it work.  The issue is that I believe I need two separate 
gather statements (one for X, another for Y) to consolidate the data. This 
causes the pairs to not stay together and the data becomes jumbled.


No, what you need is a gather-spread.

##
library(dplyr)
library(tidyr)

DF <- read.table( text=
"IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
Name1   21  15  25  10  NA  NA  NA  NA
Name2   15  18  35  24  27  45  NA  NA
Name3   17  21  30  22  15  40  32  55
", header=TRUE, as.is=TRUE )

NewDF <- (   dta
%>% gather( XY, value, -IDKey )
%>% separate( XY, c( "Coord", "Num" ), 1 )
%>% spread( Coord, value )
%>% filter( !is.na( X ) & !is.na( Y ) )
)
##


Sorry, should have practiced what I preached...

##
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)

DF <- structure(list(IDKey = c("Name1", "Name2", "Name3"), X1 = c(21L, 
15L, 17L), Y1 = c(15L, 18L, 21L), X2 = c(25L, 35L, 30L), Y2 = c(10L, 24L, 
22L), X3 = c(NA, 27L, 15L), Y3 = c(NA, 45L, 40L), X4 = c(NA, NA, 32L), Y4 
= c(NA, NA, 55L)), .Names = c("IDKey", "X1", "Y1", "X2", "Y2", "X3", "Y3", 
"X4", "Y4"), class = "data.frame", row.names = c(NA, -3L))


NewDF <- (   DF
 %>% gather( XY, value, -IDKey )
 %>% separate( XY, c( "Coord", "Num" ), 1 )
 %>% spread( Coord, value )
 %>% filter( !is.na( X ) & !is.na( Y ) )
 )
NewDF
#>   IDKey Num  X  Y
#> 1 Name1   1 21 15
#> 2 Name1   2 25 10
#> 3 Name2   1 15 18
#> 4 Name2   2 35 24
#> 5 Name2   3 27 45
#> 6 Name3   1 17 21
#> 7 Name3   2 30 22
#> 8 Name3   3 15 40
#> 9 Name3   4 32 55
##

---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Arranging column data to create plots

2017-07-16 Thread Jeff Newmiller

On Sat, 15 Jul 2017, Michael Reed via R-help wrote:


Dear All,

I need some help arranging data that was imported.


It would be helpful if you were to use dput to give us the sample data 
since you say you have already imported it.


The imported data frame looks something like this (the actual file is 
huge, so this is example data)


DF:
IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
Name1  21  15  25  10
Name2  15  18  35  24  27  45
Name3  17  21  30  22  15  40  32  55


That data is missing in X3 etc, but would be NA in an actual data frame, 
so I don't know if my workaround was the same as your workaround. Dput

would have clarified the starting point.


I would like to create a new data frame with the following

NewDF:
IDKey   X   Y
Name1  21  15
Name1  25  10
Name2  15  18
Name2  35  24
Name2  27  45
Name3  17  21
Name3  30  22
Name3  15  40
Name3  32  55

With the data like this I think I can do the following

ggplot(NewDF, aes(x=X, y=Y, color=IDKey) + geom_line


You are missing parentheses. If you use the reprex library to test your 
examples before posting them, you can be sure your simple errors don't 
send us off on wild goose chases.



and get 3 lines with the various number of points.

The point is that each of the XY pairs is a data point tied to NameX. 
I would like to rearrange the data so I can plot the points/lines by the 
IDKey.  There will be at least 2 points, but the number of points for 
each IDKey can be as many as 4.


I have tried using the gather() function from the tidyverse package, but


The tidyverse package is a virtual package that pulls in many packages.

I can't make it work.  The issue is that I believe I need two separate 
gather statements (one for X, another for Y) to consolidate the data. 
This causes the pairs to not stay together and the data becomes jumbled.


No, what you need is a gather-spread.

##
library(dplyr)
library(tidyr)

DF <- read.table( text=
"IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
Name1   21  15  25  10  NA  NA  NA  NA
Name2   15  18  35  24  27  45  NA  NA
Name3   17  21  30  22  15  40  32  55
", header=TRUE, as.is=TRUE )

NewDF <- (   dta
 %>% gather( XY, value, -IDKey )
 %>% separate( XY, c( "Coord", "Num" ), 1 )
 %>% spread( Coord, value )
 %>% filter( !is.na( X ) & !is.na( Y ) )
 )
##

---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] label sunflower point

2017-07-16 Thread David Winsemius

> On Jul 16, 2017, at 6:36 AM, Nada Gh  wrote:
> 
> Hi,
> 
> I create a plot using sunflowerplot, I need to highlight one point to show
> its importance. What suggestion you have to accomplish this?
> 
> Thanks,
> Aden
> 
>   [[alternative HTML version deleted]]

Please read the Posting Guide. You are expected to provide example data and 
code. Also, Rhelp is a plain text mailing list and you are sending 
HTML-formatted messages.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Jeff Newmiller
I am stuck. The archive package won't compile for me on Ubuntu, and the 
CRANextra repo seems to be down so I cannot install packages on Windows right 
now. Perhaps you can zip the corrupt text file and put it online somewhere? 
Don't use the archive package to pack it since there seem to be issues with 
that tool on your machine. 

I would discourage you from harassing the Brazilian government about their RAR 
file because the RAR file seems fine (no NUL characters appear in the text 
file) when extracted using the file-roller archive tool on Ubuntu.
-- 
Sent from my phone. Please excuse my brevity.

On July 16, 2017 9:37:17 AM PDT, Anthony Damico  wrote:
>hi, yep, there are two problems -- but i think only the segfault is
>within
>the scope of a base R issue?  i need to look closer at the corrupted
>decompression and figure out whether i should talk to the brazilian
>government agency that creates that .rar file or open an issue with the
>archive package maintainer.  my goal in this thread is only to figure
>out
>how to replicate the goofy text file so the r team can turn it into an
>error instead of a segfault.
>
>the original example i sent stores the .txt file somewhere inside the
>tempdir(), but when i copy it over elsewhere on my machine, the
>md5sum()
>gives the same result.  thanks again for looking at this
>
>> tools::md5sum(infile)
>
>C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
>ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
>"30beb57419486108e98d42ec7a2f8b19"
>
>
>> tools::md5sum( "S:/temp/crash.txt" )
> S:/temp/crash.txt
>"30beb57419486108e98d42ec7a2f8b19"
>
>
>
>
>On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
>
>wrote:
>
>> So you are saying there are two problems... one that produces a
>corrupt
>> file from a valid compressed file, and one that segfaults when
>presented
>> with that corrupt file? Can you please confirm the file name and run
>md5sum
>> on it and share the result so we can tell when the file problem has
>been
>> reproduced?
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 16, 2017 3:21:21 AM PDT, Anthony Damico 
>> wrote:
>> >hi, thank you for attempting this. it looks like your unix machine
>> >unzipped
>> >the txt file without corruption -- if you copied over the same txt
>file
>> >to
>> >windows 7, i don't think that would reproduce the problem?  i think
>it
>> >needs to be the corrupted text file where   R.utils::countLines(
>> >txtfile
>> >)   gives 809367.  i am able to reproduce on two distinct windows
>> >machines
>> >but no guarantee i'm not doing something dumb
>> >
>> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>> >
>> >wrote:
>> >
>> >> I am not able to reproduce your segfault on a Windows 7 platform
>> >either:
>> >>
>> >> ##
>> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> >> sessionInfo()
>> >> ## R version 3.4.1 (2017-06-30)
>> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> >> ##
>> >> ## Matrix products: default
>> >> ##
>> >> ## locale:
>> >> ## [1] LC_COLLATE=English_United States.1252
>> >> ## [2] LC_CTYPE=English_United States.1252
>> >> ## [3] LC_MONETARY=English_United States.1252
>> >> ## [4] LC_NUMERIC=C
>> >> ## [5] LC_TIME=English_United States.1252
>> >> ##
>> >> ## attached base packages:
>> >> ## [1] stats graphics  grDevices utils datasets  methods
>> >base
>> >> ##
>> >> ## loaded via a namespace (and not attached):
>> >> ## [1] compiler_3.4.1
>> >> tools::md5sum( fn1 )
>> >> ## d:/DADOS_ENEM_2009.txt
>> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> >> dat <- readLines( fn1 )
>> >> length( dat )
>> >> ## [1] 4148721
>> >>
>> >>
>> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>> >>
>> >> I am not able to reproduce this on a Linux platform:
>> >>>
>> >>> ###3
>> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> >>> 2009/DADOS_ENEM_2009.txt"
>> >>> sessionInfo()
>> >>> ## R version 3.4.1 (2017-06-30)
>> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> >>> ## Running under: Ubuntu 14.04.5 LTS
>> >>> ##
>> >>> ## Matrix products: default
>> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> >>> ##
>> >>> ## locale:
>> >>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> >>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> >>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> >>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> >>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>> ##
>> >>> ## attached base packages:
>> >>> ## [1] stats graphics  grDevices utils datasets  methods
>> >base
>> >>> ##
>> >>> ## loaded via a namespace (and not attached):
>> >>> ## [1] 

[R] label sunflower point

2017-07-16 Thread Nada Gh
Hi,

I create a plot using sunflowerplot, I need to highlight one point to show
its importance. What suggestion you have to accomplish this?

Thanks,
Aden

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] select from data frame

2017-07-16 Thread Andras Farkas via R-help
thank you David and Bert, these solutions will work for me... Andras  

On Saturday, July 15, 2017 6:05 PM, Bert Gunter  
wrote:
 

 ...
and here is a slightly cleaner and more transparent way of doing the
same thing (setdiff() does the matching)

> with(df, setdiff(ID,ID[samples %in% c("B","C") ]))
[1] 3

-- Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Jul 15, 2017 at 9:23 AM, Bert Gunter  wrote:
> If I understand correctly, no looping (ave(), for()) or type casting
> (as.character()) is needed -- indexing and matching suffice:
>
>> with(df, ID[!ID %in% unique(ID[samples %in% c("B","C") ])])
> [1] 3 3
>
>
>
> Cheers,
>
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sat, Jul 15, 2017 at 8:54 AM, David Winsemius  
> wrote:
>>
>>> On Jul 15, 2017, at 4:01 AM, Andras Farkas via R-help 
>>>  wrote:
>>>
>>> Dear All,
>>>
>>> wonder if you could please assist with the following
>>>
>>> df<-data.frame(ID=c(1,1,1,2,2,3,3,4,4,5,5),samples=c("A","B","C","A","C","A","D","C","B","A","C"))
>>>
>>> from this data frame the goal is to extract the value of 3 from the ID 
>>> column based on the logic that the ID=3 in the data frame has NO row that 
>>> would pair 3 with either "B", AND/OR "C" in the samples column...
>>>
>>
>> This returns a vector that determines if either of those characters are in 
>> the character values of that factor column you created. Coercing to 
>> character is needed because leaving samples as a factor generated an invalid 
>> factor level warning and gave useless results.
>>
>>  with( df, ave( as.character(samples), ID, FUN=function(x) {!any(x %in% 
>>c("B","C"))}))
>>  [1] "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
>> [10] "FALSE" "FALSE"
>>
>> You can then use it to extract and consolidate to a single value (although 
>> wrapping with as.logical was needed because `ave` returned character class 
>> values):
>>
>>  unique( df$ID[ as.logical(  # fails without this since "FALSE" != FALSE
>>                    with( df,
>>                        ave( as.character(samples), ID, FUN=function(x) 
>>{!any(x %in% c("B","C"))})))
>>              ] )
>> #[1] 3
>>
>> The same sort of logic could also be constructed with a for-loop:
>>
>>> for (x in unique(df$ID) ) { if ( !any( df$samples[df$ID==x] %in% 
>>> c("b","C")) ) print(x) }
>> [1] 3
>>
>> Although you are warned that for-loops do not return values and you might 
>> need to make an assignment rather than just printing.
>>
>> --
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

   
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to formulate quadratic function with interaction terms for the PLS fitting model?

2017-07-16 Thread David Winsemius

> On Jul 16, 2017, at 8:47 AM, Bert Gunter  wrote:
> 
> ??
> If I haven't misunderstood, they are completely different!
> 
> 1) NIR must be a matrix, or poly(NIR,...) will fail.
> 2) Due to the previously identified bug in poly, degree must be
> explicitly given as poly(NIR, degree =2,raw = TRUE).
> 
> Now consider the following example:
> 
>> df <-matrix(runif(60),ncol=3)
>> y <- runif(20)
>> mdl1 <-lm(y~df*I(df^2))
>> mdl2 <-lm(y~df*poly(df,degree=2,raw=TRUE))
>> length(coef(mdl1))
> [1] 16
>> length(coef(mdl2))
> [1] 40
> 
> Explanation:
> In mdl1, I(df^2) gives the squared values of the 3 columns of df. The
> formula df*I(df^2) gives the 3 (linear) terms of df, the 3 pure
> quadratics of I(df^2), the 9 cubic terms obtained by crossing these,
> and the constant coefficient = 16 coefs.
> 
> In mdl2,  the poly() expression gives 9 variiables: 3 linear, 3 pure
> quadratic, 3 interactions (1.2, 1.3, 2.3) of these.  The df*poly()
> term would then give the 3 linear terms of df, the 9 terms of poly(),
> the crossings between these, and the constant coef = 40 coefs. Many of
> these will be NA since terms are repeated (e.g. the 3 linear terms of
> poly() and df) and therefore cannot be estimated.
> 
> Have I totally misunderstood what you meant or committed some other blunder?


I was thinking about different model specifications, but clearly I had failed 
to test my assumptions and will need to do more study:

> df <-matrix(runif(60),ncol=3)
> y <- runif(20)
> mdl1 <-lm(y~ df +I(df^2) )
> mdl2 <-lm(y~  poly(df,degree=2,raw=TRUE))
> mdl1

Call:
lm(formula = y ~ df + I(df^2))

Coefficients:
(Intercept)  df1  df2  df3 I(df^2)1 I(df^2)2
 I(df^2)3  
 1.3382  -1.1431  -1.7894  -1.2675   0.9686   1.6605
   1.0411  

> mdl2

Call:
lm(formula = y ~ poly(df, degree = 2, raw = TRUE))

Coefficients:
  (Intercept)  poly(df, degree = 2, raw = TRUE)1.0.0  
  1.28217   -0.98032  
poly(df, degree = 2, raw = TRUE)2.0.0  poly(df, degree = 2, raw = TRUE)0.1.0  
  0.89955   -1.89019  
poly(df, degree = 2, raw = TRUE)1.1.0  poly(df, degree = 2, raw = TRUE)0.2.0  
  0.095281.63065  
poly(df, degree = 2, raw = TRUE)0.0.1  poly(df, degree = 2, raw = TRUE)1.0.1  
 -1.03744   -0.29368  
poly(df, degree = 2, raw = TRUE)0.1.1  poly(df, degree = 2, raw = TRUE)0.0.2  
  0.093570.93400  

> length(coef(mdl2))
[1] 10

I had been reason from my experience with atomic vectors. Clearly poly() 
handles matrices differently than the combination of `+.formula` with `I`. 
Thanks for furthering my education:

> df <-data.frame(y = runif(20), x=runif(20) )
> 
> (mdl1 <-lm(y~x +I(x^2) ,data=df) )

Call:
lm(formula = y ~ x + I(x^2), data = df)

Coefficients:
(Intercept)x   I(x^2)  
 0.6435  -1.4477   1.8282  

> (mdl2 <-lm(y~  poly(x,degree=2,raw=TRUE), data=df) )

Call:
lm(formula = y ~ poly(x, degree = 2, raw = TRUE), data = df)

Coefficients:
 (Intercept)  poly(x, degree = 2, raw = TRUE)1  
  0.6435   -1.4477  
poly(x, degree = 2, raw = TRUE)2  
  1.8282  


Best;
David.


> 
> Cheers,
> Bert
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Sun, Jul 16, 2017 at 7:36 AM, David Winsemius  
> wrote:
>> 
>>> On Jul 13, 2017, at 7:43 AM, Bert Gunter  wrote:
>>> 
>>> Below.
>>> 
>>> -- Bert
>>> Bert Gunter
>>> 
>>> 
>>> 
>>> On Thu, Jul 13, 2017 at 3:07 AM, Luigi Biagini  
>>> wrote:
 I have two ideas about it.
 
 1-
 i) Entering variables in quadratic form is done with the command I
 (variable ^ 2) -
 plsr (octane ~ NIR + I (nir ^ 2), ncomp = 10, data = gasTrain, validation =
 "LOO"
 You could also use a new variable NIR_sq <- (NIR) ^ 2
 
 ii) To insert a square variable, use syntax I (x ^ 2) - it is very
 important to insert I before the parentheses.
>>> 
>>> True, but better I believe: see ?poly.
>>> e.g. poly(cbind(x1,x2,x3), degree = 2, raw = TRUE) is a full quadratic
>>> polynomial in x1,x2,x3 .
>>> 
>> 
>> Is there any real difference between
>> 
>> octane ~ NIR * I(NIR^2)
>> octane ~ NIR * poly(NIR, degree=2, raw=TRUE)
>> 
>> ?
>> (I though that adding raw = TRUE prevented the beneficial process of 
>> centering the second degree terms.)
>> __
>> David
>>> 
 
 iii) If you want to make the interaction between x and x ^ 2 use the
 command ":" -> x: I(x ^ 2)
 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, yep, there are two problems -- but i think only the segfault is within
the scope of a base R issue?  i need to look closer at the corrupted
decompression and figure out whether i should talk to the brazilian
government agency that creates that .rar file or open an issue with the
archive package maintainer.  my goal in this thread is only to figure out
how to replicate the goofy text file so the r team can turn it into an
error instead of a segfault.

the original example i sent stores the .txt file somewhere inside the
tempdir(), but when i copy it over elsewhere on my machine, the md5sum()
gives the same result.  thanks again for looking at this

> tools::md5sum(infile)

C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
"30beb57419486108e98d42ec7a2f8b19"


> tools::md5sum( "S:/temp/crash.txt" )
 S:/temp/crash.txt
"30beb57419486108e98d42ec7a2f8b19"




On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller 
wrote:

> So you are saying there are two problems... one that produces a corrupt
> file from a valid compressed file, and one that segfaults when presented
> with that corrupt file? Can you please confirm the file name and run md5sum
> on it and share the result so we can tell when the file problem has been
> reproduced?
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 3:21:21 AM PDT, Anthony Damico 
> wrote:
> >hi, thank you for attempting this. it looks like your unix machine
> >unzipped
> >the txt file without corruption -- if you copied over the same txt file
> >to
> >windows 7, i don't think that would reproduce the problem?  i think it
> >needs to be the corrupted text file where   R.utils::countLines(
> >txtfile
> >)   gives 809367.  i am able to reproduce on two distinct windows
> >machines
> >but no guarantee i'm not doing something dumb
> >
> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> >
> >wrote:
> >
> >> I am not able to reproduce your segfault on a Windows 7 platform
> >either:
> >>
> >> ##
> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> sessionInfo()
> >> ## R version 3.4.1 (2017-06-30)
> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> ##
> >> ## Matrix products: default
> >> ##
> >> ## locale:
> >> ## [1] LC_COLLATE=English_United States.1252
> >> ## [2] LC_CTYPE=English_United States.1252
> >> ## [3] LC_MONETARY=English_United States.1252
> >> ## [4] LC_NUMERIC=C
> >> ## [5] LC_TIME=English_United States.1252
> >> ##
> >> ## attached base packages:
> >> ## [1] stats graphics  grDevices utils datasets  methods
> >base
> >> ##
> >> ## loaded via a namespace (and not attached):
> >> ## [1] compiler_3.4.1
> >> tools::md5sum( fn1 )
> >> ## d:/DADOS_ENEM_2009.txt
> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> dat <- readLines( fn1 )
> >> length( dat )
> >> ## [1] 4148721
> >>
> >>
> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >>
> >> I am not able to reproduce this on a Linux platform:
> >>>
> >>> ###3
> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt"
> >>> sessionInfo()
> >>> ## R version 3.4.1 (2017-06-30)
> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >>> ## Running under: Ubuntu 14.04.5 LTS
> >>> ##
> >>> ## Matrix products: default
> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >>> ##
> >>> ## locale:
> >>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
> >>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
> >>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
> >>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
> >>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>> ##
> >>> ## attached base packages:
> >>> ## [1] stats graphics  grDevices utils datasets  methods
> >base
> >>> ##
> >>> ## loaded via a namespace (and not attached):
> >>> ## [1] compiler_3.4.1
> >>> tools::md5sum( fn1 )
> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt
> >>> ##
> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >>> dat <- readLines( fn1 )
> >>> length( dat )
> >>> ## [1] 4148721
> >>>
> >>> No segfault occurs.
> >>>
> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >>>
> >>> hi, i realized that the segfault happens on the text file in a new R
>  session.  so, creating the segfault-generating text file requires a
>  contributed package, but prompting the actual segfault does not --
> >pretty
>  sure that means this is a base R bug?  submitted here:
>  https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >hopefully i
>  am
>  not doing something remarkably stupid.  the text file itself is 4GB
> >so
> 

Re: [R] How to formulate quadratic function with interaction terms for the PLS fitting model?

2017-07-16 Thread Bert Gunter
??
If I haven't misunderstood, they are completely different!

1) NIR must be a matrix, or poly(NIR,...) will fail.
2) Due to the previously identified bug in poly, degree must be
explicitly given as poly(NIR, degree =2,raw = TRUE).

Now consider the following example:

> df <-matrix(runif(60),ncol=3)
> y <- runif(20)
> mdl1 <-lm(y~df*I(df^2))
> mdl2 <-lm(y~df*poly(df,degree=2,raw=TRUE))
> length(coef(mdl1))
[1] 16
> length(coef(mdl2))
[1] 40

Explanation:
In mdl1, I(df^2) gives the squared values of the 3 columns of df. The
formula df*I(df^2) gives the 3 (linear) terms of df, the 3 pure
quadratics of I(df^2), the 9 cubic terms obtained by crossing these,
and the constant coefficient = 16 coefs.

In mdl2,  the poly() expression gives 9 variiables: 3 linear, 3 pure
quadratic, 3 interactions (1.2, 1.3, 2.3) of these.  The df*poly()
term would then give the 3 linear terms of df, the 9 terms of poly(),
the crossings between these, and the constant coef = 40 coefs. Many of
these will be NA since terms are repeated (e.g. the 3 linear terms of
poly() and df) and therefore cannot be estimated.

Have I totally misunderstood what you meant or committed some other blunder?


Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jul 16, 2017 at 7:36 AM, David Winsemius  wrote:
>
>> On Jul 13, 2017, at 7:43 AM, Bert Gunter  wrote:
>>
>> Below.
>>
>> -- Bert
>> Bert Gunter
>>
>>
>>
>> On Thu, Jul 13, 2017 at 3:07 AM, Luigi Biagini  
>> wrote:
>>> I have two ideas about it.
>>>
>>> 1-
>>> i) Entering variables in quadratic form is done with the command I
>>> (variable ^ 2) -
>>> plsr (octane ~ NIR + I (nir ^ 2), ncomp = 10, data = gasTrain, validation =
>>> "LOO"
>>> You could also use a new variable NIR_sq <- (NIR) ^ 2
>>>
>>> ii) To insert a square variable, use syntax I (x ^ 2) - it is very
>>> important to insert I before the parentheses.
>>
>> True, but better I believe: see ?poly.
>> e.g. poly(cbind(x1,x2,x3), degree = 2, raw = TRUE) is a full quadratic
>> polynomial in x1,x2,x3 .
>>
>
> Is there any real difference between
>
> octane ~ NIR * I(NIR^2)
> octane ~ NIR * poly(NIR, degree=2, raw=TRUE)
>
> ?
> (I though that adding raw = TRUE prevented the beneficial process of 
> centering the second degree terms.)
> __
> David
>>
>>>
>>> iii) If you want to make the interaction between x and x ^ 2 use the
>>> command ":" -> x: I(x ^ 2)
>>>
>>> iv) For multiple interactions between x and x ^ 2 use the command "*" -> x
>>> *I (x ^ 2)
>>>
>>> i) plsr (octane ~ NIR + NIR_sq, ncomp = 10, data = gasTrain, validation =
>>> "LOO") I (x ^ 2)
>>> ii)p lsr (octane ~ NIR + I(NIR^2), ncomp = 10, data = gasTrain, validation
>>> = "LOO") I (x ^ 2)
>>> iii)p lsr (octane ~ NIR : I(NIR^2), ncomp = 10, data = gasTrain, validation
>>> = "LOO") I (x ^ 2)
>>> iv)p lsr (octane ~ NIR * I(NIR^2), ncomp = 10, data = gasTrain, validation
>>> = "LOO") I (x ^ 2)
>>>
>>> 2 - For your regression, did you plan to use MARS instead of PLS?
>>>
>>>
>>>
>>>
>>> Dear all,
 I am using the pls package of R to perform partial least square on a set of
 multivariate data.  Instead of fitting a linear model, I want to fit my
 data with a quadratic function with interaction terms.  But I am not sure
 how.  I will use an example to illustrate my problem:
 Following the example in the PLS manual:
 ## Read data
 data(gasoline)
 gasTrain <- gasoline[1:50,]
 ## Perform PLS
 gas1 <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
 where octane ~ NIR is the model that this example is fitting with.
 NIR is a collective of variables, i.e. NIR spectra consists of 401 diffuse
 reflectance measurements from 900 to 1700 nm.
 Instead of fitting with octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] + ...
 I want to fit the data with:
 octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] + ... +
 b[0]*NIR[0,i]*NIR[0,i] + b[1] * NIR[0,i]*NIR[1,i] + ...
 i.e. quadratic with interaction terms.
 But I don't know how to formulate this.
 May I have some help please?
 Thanks,
 Kelvin
>>>
>>>[[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

Re: [R] About doing figures

2017-07-16 Thread lily li
For more than 10 records, how to reformat the colors? Also, how to show the
first legend only, but at the bottom, while the second legend in your code
is not necessary? In all, the same A values have the same color, but
different symbols in DF==1 and DF==2.
Thanks for your help.

On Sun, Jul 16, 2017 at 9:28 AM, lily li  wrote:

> Hi Jim,
>
> For true color, I meant that the points in the figure do not correspond to
> the values from the dataframe. Also, why to use rainbow(9) here? And the
> legend is straight in the middle, is it possible to reformat it to the very
> bottom? Thanks again.
>
> On Sun, Jul 16, 2017 at 2:50 AM, Jim Lemon  wrote:
>
>> Hi lily,
>> As I have no idea of what the "true record" is, I can only guess.
>> Maybe this will help:
>>
>> # get some fairly distinct colors
>> rainbow_colors<-rainbow(9)
>> # this should sort the numbers in dfm$A
>> dfm$Acolor<-factor(dfm$A)
>> plot(dfm$B,dfm$C,pch=ifelse(dfm$DF==1,1,19),
>>  col=rainbow_colors[as.numeric(dfm$Acolor)])
>> legend("bottom",legend=sort(unique(dfm$A)),
>>  fill=rainbow_colors)
>> legend(25,35,c("DF=1","DF=2"),pch=c(1,19))
>>
>> Jim
>>
>>
>> On Sun, Jul 16, 2017 at 3:43 PM, lily li  wrote:
>> > Hi R users,
>> >
>> > I still have the problem about plotting. I wanted to put the datasets on
>> > one figure, x-axis represents values B, y-axis represents values C,
>> while
>> > different colors label column A. Each record uses a circle on the
>> figure,
>> > while hollow circles represent DF=1 and solid circles represent DF=2. I
>> put
>> > my code below, but the A labels do not correspond to the true record,
>> so I
>> > don't know what is the problem. Thanks for your help.
>> >
>> > dfm
>> > dfm1= subset(dfm, DF==1)
>> > dfm2= subset(dfm, DF==2)
>> > plot(c(15:30),seq(from=0,to=60,by=4),pch=19,col=NULL,xlab='Value
>> > B',ylab='Value C')
>> > Color = as.factor(dfm1$A)
>> > colordist = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(),
>> > invert = T)] # for unique colors
>> > Color.unq = sample(colordist,length(Color))
>> >
>> > points(dfm1[,3],dfm1[,4],col=Color.unq,pch=1)
>> > points(dfm2[,3],dfm2[,4],col=Color.unq,pch=19)
>> > legend('bottom',as.character(Color.unq),col=Color.unq,lwd=re
>> p(2,length(Color.unq)),cex=.6,ncol=5)
>> > legend('bottom',as.character(Color),col=Color.unq,lwd=3,cex=
>> .6,ncol=5,text.width=c(9.55,9.6,9.55))
>> >
>> > dfm is the dataframe below.
>> >
>> > DF   A  B  C
>> > 1 65 21 54
>> > 1 66 23 55
>> > 1 54 24 56
>> > 1 44 23 53
>> > 1 67 22 52
>> > 1 66 21 50
>> > 1 45 20 51
>> > 1 56 19 57
>> > 1 40 25 58
>> > 1 39 24 53
>> > 2 65 25 52
>> > 2 66 20 50
>> > 2 54 21 48
>> > 2 44 30 49
>> > 2 67 27 50
>> > 2 66 20 30
>> > 2 45 25 56
>> > 2 56 14 51
>> > 2 40 29 48
>> > 2 39 29 23
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > __
>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About doing figures

2017-07-16 Thread lily li
Hi Jim,

For true color, I meant that the points in the figure do not correspond to
the values from the dataframe. Also, why to use rainbow(9) here? And the
legend is straight in the middle, is it possible to reformat it to the very
bottom? Thanks again.

On Sun, Jul 16, 2017 at 2:50 AM, Jim Lemon  wrote:

> Hi lily,
> As I have no idea of what the "true record" is, I can only guess.
> Maybe this will help:
>
> # get some fairly distinct colors
> rainbow_colors<-rainbow(9)
> # this should sort the numbers in dfm$A
> dfm$Acolor<-factor(dfm$A)
> plot(dfm$B,dfm$C,pch=ifelse(dfm$DF==1,1,19),
>  col=rainbow_colors[as.numeric(dfm$Acolor)])
> legend("bottom",legend=sort(unique(dfm$A)),
>  fill=rainbow_colors)
> legend(25,35,c("DF=1","DF=2"),pch=c(1,19))
>
> Jim
>
>
> On Sun, Jul 16, 2017 at 3:43 PM, lily li  wrote:
> > Hi R users,
> >
> > I still have the problem about plotting. I wanted to put the datasets on
> > one figure, x-axis represents values B, y-axis represents values C, while
> > different colors label column A. Each record uses a circle on the figure,
> > while hollow circles represent DF=1 and solid circles represent DF=2. I
> put
> > my code below, but the A labels do not correspond to the true record, so
> I
> > don't know what is the problem. Thanks for your help.
> >
> > dfm
> > dfm1= subset(dfm, DF==1)
> > dfm2= subset(dfm, DF==2)
> > plot(c(15:30),seq(from=0,to=60,by=4),pch=19,col=NULL,xlab='Value
> > B',ylab='Value C')
> > Color = as.factor(dfm1$A)
> > colordist = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(),
> > invert = T)] # for unique colors
> > Color.unq = sample(colordist,length(Color))
> >
> > points(dfm1[,3],dfm1[,4],col=Color.unq,pch=1)
> > points(dfm2[,3],dfm2[,4],col=Color.unq,pch=19)
> > legend('bottom',as.character(Color.unq),col=Color.unq,lwd=
> rep(2,length(Color.unq)),cex=.6,ncol=5)
> > legend('bottom',as.character(Color),col=Color.unq,lwd=3,
> cex=.6,ncol=5,text.width=c(9.55,9.6,9.55))
> >
> > dfm is the dataframe below.
> >
> > DF   A  B  C
> > 1 65 21 54
> > 1 66 23 55
> > 1 54 24 56
> > 1 44 23 53
> > 1 67 22 52
> > 1 66 21 50
> > 1 45 20 51
> > 1 56 19 57
> > 1 40 25 58
> > 1 39 24 53
> > 2 65 25 52
> > 2 66 20 50
> > 2 54 21 48
> > 2 44 30 49
> > 2 67 27 50
> > 2 66 20 30
> > 2 45 25 56
> > 2 56 14 51
> > 2 40 29 48
> > 2 39 29 23
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to formulate quadratic function with interaction terms for the PLS fitting model?

2017-07-16 Thread David Winsemius

> On Jul 13, 2017, at 7:43 AM, Bert Gunter  wrote:
> 
> Below.
> 
> -- Bert
> Bert Gunter
> 
> 
> 
> On Thu, Jul 13, 2017 at 3:07 AM, Luigi Biagini  
> wrote:
>> I have two ideas about it.
>> 
>> 1-
>> i) Entering variables in quadratic form is done with the command I
>> (variable ^ 2) -
>> plsr (octane ~ NIR + I (nir ^ 2), ncomp = 10, data = gasTrain, validation =
>> "LOO"
>> You could also use a new variable NIR_sq <- (NIR) ^ 2
>> 
>> ii) To insert a square variable, use syntax I (x ^ 2) - it is very
>> important to insert I before the parentheses.
> 
> True, but better I believe: see ?poly.
> e.g. poly(cbind(x1,x2,x3), degree = 2, raw = TRUE) is a full quadratic
> polynomial in x1,x2,x3 .
> 

Is there any real difference between 

octane ~ NIR * I(NIR^2)
octane ~ NIR * poly(NIR, degree=2, raw=TRUE)

?
(I though that adding raw = TRUE prevented the beneficial process of centering 
the second degree terms.)
__ 
David
> 
>> 
>> iii) If you want to make the interaction between x and x ^ 2 use the
>> command ":" -> x: I(x ^ 2)
>> 
>> iv) For multiple interactions between x and x ^ 2 use the command "*" -> x
>> *I (x ^ 2)
>> 
>> i) plsr (octane ~ NIR + NIR_sq, ncomp = 10, data = gasTrain, validation =
>> "LOO") I (x ^ 2)
>> ii)p lsr (octane ~ NIR + I(NIR^2), ncomp = 10, data = gasTrain, validation
>> = "LOO") I (x ^ 2)
>> iii)p lsr (octane ~ NIR : I(NIR^2), ncomp = 10, data = gasTrain, validation
>> = "LOO") I (x ^ 2)
>> iv)p lsr (octane ~ NIR * I(NIR^2), ncomp = 10, data = gasTrain, validation
>> = "LOO") I (x ^ 2)
>> 
>> 2 - For your regression, did you plan to use MARS instead of PLS?
>> 
>> 
>> 
>> 
>> Dear all,
>>> I am using the pls package of R to perform partial least square on a set of
>>> multivariate data.  Instead of fitting a linear model, I want to fit my
>>> data with a quadratic function with interaction terms.  But I am not sure
>>> how.  I will use an example to illustrate my problem:
>>> Following the example in the PLS manual:
>>> ## Read data
>>> data(gasoline)
>>> gasTrain <- gasoline[1:50,]
>>> ## Perform PLS
>>> gas1 <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
>>> where octane ~ NIR is the model that this example is fitting with.
>>> NIR is a collective of variables, i.e. NIR spectra consists of 401 diffuse
>>> reflectance measurements from 900 to 1700 nm.
>>> Instead of fitting with octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] + ...
>>> I want to fit the data with:
>>> octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] + ... +
>>> b[0]*NIR[0,i]*NIR[0,i] + b[1] * NIR[0,i]*NIR[1,i] + ...
>>> i.e. quadratic with interaction terms.
>>> But I don't know how to formulate this.
>>> May I have some help please?
>>> Thanks,
>>> Kelvin
>> 
>>[[alternative HTML version deleted]]
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Jeff Newmiller
So you are saying there are two problems... one that produces a corrupt file 
from a valid compressed file, and one that segfaults when presented with that 
corrupt file? Can you please confirm the file name and run md5sum on it and 
share the result so we can tell when the file problem has been reproduced?
-- 
Sent from my phone. Please excuse my brevity.

On July 16, 2017 3:21:21 AM PDT, Anthony Damico  wrote:
>hi, thank you for attempting this. it looks like your unix machine
>unzipped
>the txt file without corruption -- if you copied over the same txt file
>to
>windows 7, i don't think that would reproduce the problem?  i think it
>needs to be the corrupted text file where   R.utils::countLines(
>txtfile
>)   gives 809367.  i am able to reproduce on two distinct windows
>machines
>but no guarantee i'm not doing something dumb
>
>On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
>
>wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform
>either:
>>
>> ##
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods  
>base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> ###3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods  
>base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
 session.  so, creating the segfault-generating text file requires a
 contributed package, but prompting the actual segfault does not --
>pretty
 sure that means this is a base R bug?  submitted here:
 https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 
>hopefully i
 am
 not doing something remarkably stupid.  the text file itself is 4GB
>so
 cannot upload it to bugzilla, and from the R_AllocStringBugger
>error in
 the
 previous message, i think most or all of it needs to be there to
>trigger
 the segfault.  thanks!


 On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
>
 wrote:

 hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?
> i
> believe the segfault occurs because there's a single line with 4GB
>and
> also
> embedded nuls, but i am not sure how to artificially construct
>that?
>
>
> the lodown package can be removed from my example..  it is just
>for file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so
>sort of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably
>corrupt) .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>> file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a
>single
> line
> in the file.  here's what happens 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, the text file that prompts the segfault is 4gb but only 80,937 lines

> file.info( "S:/temp/crash.txt")
size isdir mode   mtime
ctime   atime exe
S:/temp/crash.txt 4078192743 FALSE  666 2017-07-15 17:24:35 2017-07-15
17:19:47 2017-07-15 17:19:47  no




On Sun, Jul 16, 2017 at 6:34 AM, Duncan Murdoch 
wrote:

> On 16/07/2017 6:17 AM, Anthony Damico wrote:
>
>> thank you for taking the time to write this.  i set it running last
>> night and it's still going -- if it doesn't finish by tomorrow, i will
>> try to find a site to host the problem file and add that link to the bug
>> report so the archive package can be avoided at least.  i'm sorry for
>> the bother
>>
>>
> How big is that text file?  I wouldn't expect my script to take more than
> a few minutes even on a huge file.
>
> My script might have a bug...
>
> Duncan Murdoch
>
> On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
>> > wrote:
>>
>> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a
>> new R
>> session.  so, creating the segfault-generating text file requires
>> a
>> contributed package, but prompting the actual segfault does not --
>> pretty sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
>> 
>> hopefully i
>> am not doing something remarkably stupid.  the text file itself
>> is 4GB
>> so cannot upload it to bugzilla, and from the
>> R_AllocStringBugger error
>> in the previous message, i think most or all of it needs to be
>> there to
>> trigger the segfault.  thanks!
>>
>>
>> I don't want to download the big file or install the archive
>> package. Could you run the code below on the bad file?  If you're
>> right and it's only nulls that matter, this might allow me to create
>> a file that triggers the bug.
>>
>> f <-  # put the filename of the bad file here
>>
>> con <- file(f, open="rb")
>> zeros <- numeric()
>> repeat {
>>   bytes <- readBin(con, "int", 100, size=1)
>>   zeros <- c(zeros, count + which(bytes == 0))
>>   count <- count + length(bytes)
>>   if (length(bytes) < 100) break
>> }
>> close(con)
>> cat("File length=", count, "\n")
>> cat("Nulls:\n")
>> zeros
>>
>> Here's some code to recreate a file of the same length with nulls in
>> the same places, and spaces everywhere else:
>>
>> size <- count
>> f2 <- tempfile()
>> con <- file(f2, open="wb")
>> count <- 0
>> while (count < size) {
>>   nonzeros <- min(c(size - count, 100, zeros - 1))
>>   if (nonzeros) {
>> writeBin(rep(32L, nonzeros), con, size = 1)
>> count <- count + nonzeros
>>   }
>>   zeros <- zeros - nonzeros
>>   if (length(zeros) && min(zeros) == 1) {
>> writeBin(0L, con, size = 1)
>> count <- count + 1
>> zeros <- zeros[-1] - 1
>>   }
>> }
>> close(con)
>>
>> Duncan Murdoch
>>
>>
>>
>>
>>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Duncan Murdoch

On 16/07/2017 6:17 AM, Anthony Damico wrote:

thank you for taking the time to write this.  i set it running last
night and it's still going -- if it doesn't finish by tomorrow, i will
try to find a site to host the problem file and add that link to the bug
report so the archive package can be avoided at least.  i'm sorry for
the bother



How big is that text file?  I wouldn't expect my script to take more 
than a few minutes even on a huge file.


My script might have a bug...

Duncan Murdoch


On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch
> wrote:

On 15/07/2017 11:33 AM, Anthony Damico wrote:

hi, i realized that the segfault happens on the text file in a new R
session.  so, creating the segfault-generating text file requires a
contributed package, but prompting the actual segfault does not --
pretty sure that means this is a base R bug?  submitted here:
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311

hopefully i
am not doing something remarkably stupid.  the text file itself
is 4GB
so cannot upload it to bugzilla, and from the
R_AllocStringBugger error
in the previous message, i think most or all of it needs to be
there to
trigger the segfault.  thanks!


I don't want to download the big file or install the archive
package. Could you run the code below on the bad file?  If you're
right and it's only nulls that matter, this might allow me to create
a file that triggers the bug.

f <-  # put the filename of the bad file here

con <- file(f, open="rb")
zeros <- numeric()
repeat {
  bytes <- readBin(con, "int", 100, size=1)
  zeros <- c(zeros, count + which(bytes == 0))
  count <- count + length(bytes)
  if (length(bytes) < 100) break
}
close(con)
cat("File length=", count, "\n")
cat("Nulls:\n")
zeros

Here's some code to recreate a file of the same length with nulls in
the same places, and spaces everywhere else:

size <- count
f2 <- tempfile()
con <- file(f2, open="wb")
count <- 0
while (count < size) {
  nonzeros <- min(c(size - count, 100, zeros - 1))
  if (nonzeros) {
writeBin(rep(32L, nonzeros), con, size = 1)
count <- count + nonzeros
  }
  zeros <- zeros - nonzeros
  if (length(zeros) && min(zeros) == 1) {
writeBin(0L, con, size = 1)
count <- count + 1
zeros <- zeros[-1] - 1
  }
}
close(con)

Duncan Murdoch






__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
sorry, typo, 80937 not 809367

On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico  wrote:

> hi, thank you for attempting this. it looks like your unix machine
> unzipped the txt file without corruption -- if you copied over the same txt
> file to windows 7, i don't think that would reproduce the problem?  i think
> it needs to be the corrupted text file where   R.utils::countLines( txtfile
> )   gives 809367.  i am able to reproduce on two distinct windows machines
> but no guarantee i'm not doing something dumb
>
> On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller 
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> ###3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics  grDevices utils datasets  methods   base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
 session.  so, creating the segfault-generating text file requires a
 contributed package, but prompting the actual segfault does not --
 pretty
 sure that means this is a base R bug?  submitted here:
 https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully
 i am
 not doing something remarkably stupid.  the text file itself is 4GB so
 cannot upload it to bugzilla, and from the R_AllocStringBugger error in
 the
 previous message, i think most or all of it needs to be there to trigger
 the segfault.  thanks!


 On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico 
 wrote:

 hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?  i
> believe the segfault occurs because there's a single line with 4GB and
> also
> embedded nuls, but i am not sure how to artificially construct that?
>
>
> the lodown package can be removed from my example..  it is just for
> file
> download cacheing, so `lodown::cachaca` can be replaced with
> `download.file`  my current example requires a huge download, so sort
> of
> painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt)
> .RAR
> file and creates a text file with 80,937 lines.  this file is 4GB:
>
>> file.size(infile)
> [1] 4078192743 <(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a single
> line
> in the file.  here's what happens when i create a file connection and
> scan
> through..
>
>> file_con <- file( infile , 'r' )
>>
>> first_80936_lines <- readLines( file_con , n = 80936 )
>> scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "123930632009"
>> scan( w , n = 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
hi, thank you for attempting this. it looks like your unix machine unzipped
the txt file without corruption -- if you copied over the same txt file to
windows 7, i don't think that would reproduce the problem?  i think it
needs to be the corrupted text file where   R.utils::countLines( txtfile
)   gives 809367.  i am able to reproduce on two distinct windows machines
but no guarantee i'm not doing something dumb

On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller 
wrote:

> I am not able to reproduce your segfault on a Windows 7 platform either:
>
> ##
> fn1 <- "d:/DADOS_ENEM_2009.txt"
> sessionInfo()
> ## R version 3.4.1 (2017-06-30)
> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> ##
> ## Matrix products: default
> ##
> ## locale:
> ## [1] LC_COLLATE=English_United States.1252
> ## [2] LC_CTYPE=English_United States.1252
> ## [3] LC_MONETARY=English_United States.1252
> ## [4] LC_NUMERIC=C
> ## [5] LC_TIME=English_United States.1252
> ##
> ## attached base packages:
> ## [1] stats graphics  grDevices utils datasets  methods   base
> ##
> ## loaded via a namespace (and not attached):
> ## [1] compiler_3.4.1
> tools::md5sum( fn1 )
> ## d:/DADOS_ENEM_2009.txt
> ## "83e61c96092285b60d7bf6b0dbc7072e"
> dat <- readLines( fn1 )
> length( dat )
> ## [1] 4148721
>
>
> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>
> I am not able to reproduce this on a Linux platform:
>>
>> ###3
>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>> ## Running under: Ubuntu 14.04.5 LTS
>> ##
>> ## Matrix products: default
>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>> ##
>> ## locale:
>> ##  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> ##  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> ##  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> ##  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>> ##  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> ##
>> ## attached base packages:
>> ## [1] stats graphics  grDevices utils datasets  methods   base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>> 2009/DADOS_ENEM_2009.txt
>> ##
>> "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>> No segfault occurs.
>>
>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>
>> hi, i realized that the segfault happens on the text file in a new R
>>> session.  so, creating the segfault-generating text file requires a
>>> contributed package, but prompting the actual segfault does not -- pretty
>>> sure that means this is a base R bug?  submitted here:
>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>>> am
>>> not doing something remarkably stupid.  the text file itself is 4GB so
>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>> the
>>> previous message, i think most or all of it needs to be there to trigger
>>> the segfault.  thanks!
>>>
>>>
>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico 
>>> wrote:
>>>
>>> hi, thanks Dr. Murdoch


 i'd appreciate if anyone on r-help could help me narrow this down?  i
 believe the segfault occurs because there's a single line with 4GB and
 also
 embedded nuls, but i am not sure how to artificially construct that?


 the lodown package can be removed from my example..  it is just for file
 download cacheing, so `lodown::cachaca` can be replaced with
 `download.file`  my current example requires a huge download, so sort of
 painful to repeat but i'm pretty confident that's not the issue.


 the archive::archive_extract() function unzips a (probably corrupt) .RAR
 file and creates a text file with 80,937 lines.  this file is 4GB:

> file.size(infile)
 [1] 4078192743 <(407)%20819-2743>


 i am pretty sure that nearly all of that 4GB is contained on a single
 line
 in the file.  here's what happens when i create a file connection and
 scan
 through..

> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "123930632009"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "AFONSO"
> scan( w , n = 1 , what = character() )
 Read 1 item
 [1] "BA11"
> 

Re: [R] readLines without skipNul=TRUE causes crash

2017-07-16 Thread Anthony Damico
thank you for taking the time to write this.  i set it running last night
and it's still going -- if it doesn't finish by tomorrow, i will try to
find a site to host the problem file and add that link to the bug report so
the archive package can be avoided at least.  i'm sorry for the bother

On Sat, Jul 15, 2017 at 4:14 PM, Duncan Murdoch 
wrote:

> On 15/07/2017 11:33 AM, Anthony Damico wrote:
>
>> hi, i realized that the segfault happens on the text file in a new R
>> session.  so, creating the segfault-generating text file requires a
>> contributed package, but prompting the actual segfault does not --
>> pretty sure that means this is a base R bug?  submitted here:
>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
>> am not doing something remarkably stupid.  the text file itself is 4GB
>> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
>> in the previous message, i think most or all of it needs to be there to
>> trigger the segfault.  thanks!
>>
>
> I don't want to download the big file or install the archive package.
> Could you run the code below on the bad file?  If you're right and it's
> only nulls that matter, this might allow me to create a file that triggers
> the bug.
>
> f <-  # put the filename of the bad file here
>
> con <- file(f, open="rb")
> zeros <- numeric()
> repeat {
>   bytes <- readBin(con, "int", 100, size=1)
>   zeros <- c(zeros, count + which(bytes == 0))
>   count <- count + length(bytes)
>   if (length(bytes) < 100) break
> }
> close(con)
> cat("File length=", count, "\n")
> cat("Nulls:\n")
> zeros
>
> Here's some code to recreate a file of the same length with nulls in the
> same places, and spaces everywhere else:
>
> size <- count
> f2 <- tempfile()
> con <- file(f2, open="wb")
> count <- 0
> while (count < size) {
>   nonzeros <- min(c(size - count, 100, zeros - 1))
>   if (nonzeros) {
> writeBin(rep(32L, nonzeros), con, size = 1)
> count <- count + nonzeros
>   }
>   zeros <- zeros - nonzeros
>   if (length(zeros) && min(zeros) == 1) {
> writeBin(0L, con, size = 1)
> count <- count + 1
> zeros <- zeros[-1] - 1
>   }
> }
> close(con)
>
> Duncan Murdoch
>
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About doing figures

2017-07-16 Thread Jim Lemon
Hi lily,
As I have no idea of what the "true record" is, I can only guess.
Maybe this will help:

# get some fairly distinct colors
rainbow_colors<-rainbow(9)
# this should sort the numbers in dfm$A
dfm$Acolor<-factor(dfm$A)
plot(dfm$B,dfm$C,pch=ifelse(dfm$DF==1,1,19),
 col=rainbow_colors[as.numeric(dfm$Acolor)])
legend("bottom",legend=sort(unique(dfm$A)),
 fill=rainbow_colors)
legend(25,35,c("DF=1","DF=2"),pch=c(1,19))

Jim


On Sun, Jul 16, 2017 at 3:43 PM, lily li  wrote:
> Hi R users,
>
> I still have the problem about plotting. I wanted to put the datasets on
> one figure, x-axis represents values B, y-axis represents values C, while
> different colors label column A. Each record uses a circle on the figure,
> while hollow circles represent DF=1 and solid circles represent DF=2. I put
> my code below, but the A labels do not correspond to the true record, so I
> don't know what is the problem. Thanks for your help.
>
> dfm
> dfm1= subset(dfm, DF==1)
> dfm2= subset(dfm, DF==2)
> plot(c(15:30),seq(from=0,to=60,by=4),pch=19,col=NULL,xlab='Value
> B',ylab='Value C')
> Color = as.factor(dfm1$A)
> colordist = grDevices::colors()[grep('gr(a|e)y', grDevices::colors(),
> invert = T)] # for unique colors
> Color.unq = sample(colordist,length(Color))
>
> points(dfm1[,3],dfm1[,4],col=Color.unq,pch=1)
> points(dfm2[,3],dfm2[,4],col=Color.unq,pch=19)
> legend('bottom',as.character(Color.unq),col=Color.unq,lwd=rep(2,length(Color.unq)),cex=.6,ncol=5)
> legend('bottom',as.character(Color),col=Color.unq,lwd=3,cex=.6,ncol=5,text.width=c(9.55,9.6,9.55))
>
> dfm is the dataframe below.
>
> DF   A  B  C
> 1 65 21 54
> 1 66 23 55
> 1 54 24 56
> 1 44 23 53
> 1 67 22 52
> 1 66 21 50
> 1 45 20 51
> 1 56 19 57
> 1 40 25 58
> 1 39 24 53
> 2 65 25 52
> 2 66 20 50
> 2 54 21 48
> 2 44 30 49
> 2 67 27 50
> 2 66 20 30
> 2 45 25 56
> 2 56 14 51
> 2 40 29 48
> 2 39 29 23
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Arranging column data to create plots

2017-07-16 Thread Ulrik Stervbo
Hi Michael,

Try gather from the tidyr package

HTH
Ulrik

Michael Reed via R-help  schrieb am So., 16. Juli
2017, 10:19:

> Dear All,
>
> I need some help arranging data that was imported.
>
> The imported data frame looks something like this (the actual file is
> huge, so this is example data)
>
> DF:
> IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
> Name1  21  15  25  10
> Name2  15  18  35  24  27  45
> Name3  17  21  30  22  15  40  32  55
>
> I would like to create a new data frame with the following
>
> NewDF:
> IDKey   X   Y
> Name1  21  15
> Name1  25  10
> Name2  15  18
> Name2  35  24
> Name2  27  45
> Name3  17  21
> Name3  30  22
> Name3  15  40
> Name3  32  55
>
> With the data like this I think I can do the following
>
> ggplot(NewDF, aes(x=X, y=Y, color=IDKey) + geom_line
>
> and get 3 lines with the various number of points.
>
> The point is that each of the XY pairs is a data point tied to NameX.  I
> would like to rearrange the data so I can plot the points/lines by the
> IDKey.  There will be at least 2 points, but the number of points for each
> IDKey can be as many as 4.
>
> I have tried using the gather() function from the tidyverse package, but I
> can't make it work.  The issue is that I believe I need two separate gather
> statements (one for X, another for Y) to consolidate the data.  This causes
> the pairs to not stay together and the data becomes jumbled.
>
> Thoughts
> Thanks for your help
>
> Michael E. Reed
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to formulate quadratic function with interaction terms for the PLS fitting model?

2017-07-16 Thread Ng, Kelvin Sai-cheong
I see.  Thank you for the help.


On Thu, Jul 13, 2017 at 10:43 PM, Bert Gunter 
wrote:

> Below.
>
> -- Bert
> Bert Gunter
>
>
>
> On Thu, Jul 13, 2017 at 3:07 AM, Luigi Biagini 
> wrote:
> > I have two ideas about it.
> >
> > 1-
> > i) Entering variables in quadratic form is done with the command I
> > (variable ^ 2) -
> > plsr (octane ~ NIR + I (nir ^ 2), ncomp = 10, data = gasTrain,
> validation =
> > "LOO"
> > You could also use a new variable NIR_sq <- (NIR) ^ 2
> >
> > ii) To insert a square variable, use syntax I (x ^ 2) - it is very
> > important to insert I before the parentheses.
>
> True, but better I believe: see ?poly.
> e.g. poly(cbind(x1,x2,x3), degree = 2, raw = TRUE) is a full quadratic
> polynomial in x1,x2,x3 .
>
>
> >
> > iii) If you want to make the interaction between x and x ^ 2 use the
> > command ":" -> x: I(x ^ 2)
> >
> > iv) For multiple interactions between x and x ^ 2 use the command "*" ->
> x
> > *I (x ^ 2)
> >
> > i) plsr (octane ~ NIR + NIR_sq, ncomp = 10, data = gasTrain, validation =
> > "LOO") I (x ^ 2)
> > ii)p lsr (octane ~ NIR + I(NIR^2), ncomp = 10, data = gasTrain,
> validation
> > = "LOO") I (x ^ 2)
> > iii)p lsr (octane ~ NIR : I(NIR^2), ncomp = 10, data = gasTrain,
> validation
> > = "LOO") I (x ^ 2)
> > iv)p lsr (octane ~ NIR * I(NIR^2), ncomp = 10, data = gasTrain,
> validation
> > = "LOO") I (x ^ 2)
> >
> > 2 - For your regression, did you plan to use MARS instead of PLS?
> >
> >
> >
> >
> > Dear all,
> >> I am using the pls package of R to perform partial least square on a
> set of
> >> multivariate data.  Instead of fitting a linear model, I want to fit my
> >> data with a quadratic function with interaction terms.  But I am not
> sure
> >> how.  I will use an example to illustrate my problem:
> >> Following the example in the PLS manual:
> >> ## Read data
> >>  data(gasoline)
> >> gasTrain <- gasoline[1:50,]
> >> ## Perform PLS
> >> gas1 <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation =
> "LOO")
> >> where octane ~ NIR is the model that this example is fitting with.
> >> NIR is a collective of variables, i.e. NIR spectra consists of 401
> diffuse
> >> reflectance measurements from 900 to 1700 nm.
> >> Instead of fitting with octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] +
> ...
> >> I want to fit the data with:
> >> octane[i] = a[0] * NIR[0,i] + a[1] * NIR[1,i] + ... +
> >> b[0]*NIR[0,i]*NIR[0,i] + b[1] * NIR[0,i]*NIR[1,i] + ...
> >> i.e. quadratic with interaction terms.
> >> But I don't know how to formulate this.
> >> May I have some help please?
> >> Thanks,
> >> Kelvin
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Arranging column data to create plots

2017-07-16 Thread Michael Reed via R-help
Dear All,

I need some help arranging data that was imported.

The imported data frame looks something like this (the actual file is huge, so 
this is example data)

DF:
IDKey  X1  Y1  X2  Y2  X3  Y3  X4  Y4
Name1  21  15  25  10
Name2  15  18  35  24  27  45
Name3  17  21  30  22  15  40  32  55 

I would like to create a new data frame with the following

NewDF:
IDKey   X   Y
Name1  21  15
Name1  25  10
Name2  15  18
Name2  35  24
Name2  27  45
Name3  17  21
Name3  30  22
Name3  15  40
Name3  32  55

With the data like this I think I can do the following

ggplot(NewDF, aes(x=X, y=Y, color=IDKey) + geom_line

and get 3 lines with the various number of points.

The point is that each of the XY pairs is a data point tied to NameX.  I would 
like to rearrange the data so I can plot the points/lines by the IDKey.  There 
will be at least 2 points, but the number of points for each IDKey can be as 
many as 4.

I have tried using the gather() function from the tidyverse package, but I 
can't make it work.  The issue is that I believe I need two separate gather 
statements (one for X, another for Y) to consolidate the data.  This causes the 
pairs to not stay together and the data becomes jumbled.

Thoughts
Thanks for your help

Michael E. Reed


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.