[R] Logical statements and subseting data...
Hi, I'm scratching my head as to why I can't use the subset() command to remove one line of data from a data frame. There is just one row (out of 45840) that I'd like to remove and it can be identified using dim(raw.all.clean) [1] 4584010 subset(raw.all.clean, Height.1 == 0 Height.2 == 0) Sample.Name Well SNP Allele.1 Allele.2 Size.1 Size.2 Height.1 47068 CA0153 O02 rs2106776 NA NA0 Height.2 Pool 4706803 (Note that the row index of 47068 which is higher than the rows reported by dim() is simply because I have already removed a number of rows). So I want to remove this one instance where Height.1 == 0 Height.2 == 0. I'd have thought that a logical expression where Height.1 != 0 Height.2 != 0 would have achieved this, but it doesn't seem to correctly drop out this one observation, instead its dropping out far more observations... t - subset(raw.all.clean, Height.1 != 0 Height.2 != 0) dim(t) [1] 3815010 Thus 7690 rows have been removed. It seems to be that the '' operator is being interparated as an 'OR' (|) since... dim(subset(raw.all.clean, Height.1 != 0)) [1] 4215210 dim(subset(raw.all.clean, Height.2 != 0)) [1] 4183710 ...and... dim(raw.all.clean) - dim(subset(raw.all.clean, Height.1 != 0)) [1] 36880 dim(raw.all.clean) - dim(subset(raw.all.clean, Height.2 != 0)) [1] 40030 3688 + 4003 [1] 7691 (This is one more than the number of rows being removed, but given that there is one sample where both Height.1 and Height.2 are '0' thats fine). I thought I understood how logical expressions are constructed, and have gone back and read the entries on precedence, but can't work out why the above is happening? Whats particularly perplexing (to me) is that the test for exact equality works, but not for inequality? I feel like I'm missing something blatantly obvious, but can't work out what it is. Cheers, Neil -- Email - [EMAIL PROTECTED] / [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logical statements and subseting data...
Thanks Thierry, they do both leave me with what I expected. On Mon, Feb 25, 2008 at 2:28 PM, ONKELINX, Thierry [EMAIL PROTECTED] wrote: The negation of Height.1 == 0 Height.2 == 0 was incorrect. Use subset(raw.all.clean, !(Height.1 == 0 Height.2 == 0)) I can see clearly how this expression works (negating the whole test), but... or subset(raw.all.clean, Height.1 != 0 | Height.2 != 0) ...not how this works, since the above to me is saying Height.1 is NOT zero OR Height.2 is NOT zero, which to my mind would pick out samples where either one or the other is not equal to zero (and of course those instances where both are equal to zero)? It seems to me that (AND) and | (OR) are used the wrong way round in this case, since the intersection of the two tests for inequality is what is required? Neil -- Email - [EMAIL PROTECTED] / [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logical statements and subseting data...
The negation of Height.1 == 0 Height.2 == 0 was incorrect. Use subset(raw.all.clean, !(Height.1 == 0 Height.2 == 0)) or subset(raw.all.clean, Height.1 != 0 | Height.2 != 0) HTH, Thierry ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney -Oorspronkelijk bericht- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens Neil Shephard Verzonden: maandag 25 februari 2008 15:21 Aan: r-help Onderwerp: [R] Logical statements and subseting data... Hi, I'm scratching my head as to why I can't use the subset() command to remove one line of data from a data frame. There is just one row (out of 45840) that I'd like to remove and it can be identified using dim(raw.all.clean) [1] 4584010 subset(raw.all.clean, Height.1 == 0 Height.2 == 0) Sample.Name Well SNP Allele.1 Allele.2 Size.1 Size.2 Height.1 47068 CA0153 O02 rs2106776 NA NA 0 Height.2 Pool 4706803 (Note that the row index of 47068 which is higher than the rows reported by dim() is simply because I have already removed a number of rows). So I want to remove this one instance where Height.1 == 0 Height.2 == 0. I'd have thought that a logical expression where Height.1 != 0 Height.2 != 0 would have achieved this, but it doesn't seem to correctly drop out this one observation, instead its dropping out far more observations... t - subset(raw.all.clean, Height.1 != 0 Height.2 != 0) dim(t) [1] 3815010 Thus 7690 rows have been removed. It seems to be that the '' operator is being interparated as an 'OR' (|) since... dim(subset(raw.all.clean, Height.1 != 0)) [1] 4215210 dim(subset(raw.all.clean, Height.2 != 0)) [1] 4183710 ...and... dim(raw.all.clean) - dim(subset(raw.all.clean, Height.1 != 0)) [1] 36880 dim(raw.all.clean) - dim(subset(raw.all.clean, Height.2 != 0)) [1] 40030 3688 + 4003 [1] 7691 (This is one more than the number of rows being removed, but given that there is one sample where both Height.1 and Height.2 are '0' thats fine). I thought I understood how logical expressions are constructed, and have gone back and read the entries on precedence, but can't work out why the above is happening? Whats particularly perplexing (to me) is that the test for exact equality works, but not for inequality? I feel like I'm missing something blatantly obvious, but can't work out what it is. Cheers, Neil -- Email - [EMAIL PROTECTED] / [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logical statements and subseting data...
Neil, Maybe this example will make things more clear to you. DF - expand.grid(A = 0:1, B = 0:1) cbind(DF, DF$A != 0, DF$B != 0, DF$A != 0 DF$B != 0, DF$A != 0 | DF$B != 0) A B DF$A != 0 DF$B != 0 DF$A != 0 DF$B != 0 DF$A != 0 | DF$B != 0 1 0 0 FALSE FALSE FALSE FALSE 2 1 0 TRUE FALSE FALSE TRUE 3 0 1 FALSE TRUE FALSE TRUE 4 1 1 TRUE TRUE TRUE TRUE Thierry ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney -Oorspronkelijk bericht- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens Neil Shephard Verzonden: maandag 25 februari 2008 15:36 Aan: ONKELINX, Thierry CC: r-help Onderwerp: Re: [R] Logical statements and subseting data... Thanks Thierry, they do both leave me with what I expected. On Mon, Feb 25, 2008 at 2:28 PM, ONKELINX, Thierry [EMAIL PROTECTED] wrote: The negation of Height.1 == 0 Height.2 == 0 was incorrect. Use subset(raw.all.clean, !(Height.1 == 0 Height.2 == 0)) I can see clearly how this expression works (negating the whole test), but... or subset(raw.all.clean, Height.1 != 0 | Height.2 != 0) ...not how this works, since the above to me is saying Height.1 is NOT zero OR Height.2 is NOT zero, which to my mind would pick out samples where either one or the other is not equal to zero (and of course those instances where both are equal to zero)? It seems to me that (AND) and | (OR) are used the wrong way round in this case, since the intersection of the two tests for inequality is what is required? Neil -- Email - [EMAIL PROTECTED] / [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.