[R] Randomly drop a percent of data from a data.frame

2013-08-16 Thread Christopher Desjardins
Hi,
I have the following data.

 set.seed(6245)
 data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
 round(data,digits=3)
  x1 x2 x3 x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256 -0.069  0.354
4 -0.086  0.475  0.244  0.781
5  0.690 -0.181  1.274  1.633

What I would like to do is drop 20% of the data. But I want this 20% to
only come from dropping data from x3 and x4. It doesn't have to be evenly,
i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
observation has missing data on only one variable. I just want to drop 20%
of the data through x3 and x4 only.  In other words,

   x1 x2 x3 x4
1  0.482  1.320 -0.859 NA
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256  NA  0.354
4 -0.086  0.475  NA  0.781
5  0.690 -0.181  NA  1.633

OR

  x1 x2 x3 x4
1  0.482  1.320 NA -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256  NA  NA
4 -0.086  0.475  0.244  NA
5  0.690 -0.181  1.274  1.633

OR

  x1 x2 x3 x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063 NA
3  0.028 -0.256 -0.069 NA
4 -0.086  0.475  0.244 NA
5  0.690 -0.181  1.274 NA

ETC. are all fine.

Any ideas how I can do this?
Chris

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Randomly drop a percent of data from a data.frame

2013-08-16 Thread arun
Hi,
May be this helps:
#data1 (changed `data` to `data1`)
set.seed(6245)
 data1 - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
 data1- round(data1,digits=3)

data2- data1

data1[,3:4]-lapply(data1[,3:4],function(x){x1- 
match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4]);x[is.na(x1)]-NA;x})
 data1
#  x1 x2 x3 x4
#1  0.482  1.320 NA -0.142
#2 -0.753 -0.041 -0.063  0.886
#3  0.028 -0.256 -0.069  0.354
#4 -0.086  0.475  0.244  0.781
#5  0.690 -0.181  1.274  1.633


#or
data2[,3:4]-lapply(data2[,3:4],function(x){x1- 
match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4]);x[is.na(x1)]-NA;x})
 data2
#  x1 x2 x3 x4
#1  0.482  1.320 -0.859 -0.142
#2 -0.753 -0.041 NA NA
#3  0.028 -0.256 -0.069  0.354
#4 -0.086  0.475  0.244  0.781
#5  0.690 -0.181  1.274  1.633
A.K.



- Original Message -
From: Christopher Desjardins cddesjard...@gmail.com
To: r-help@r-project.org r-help@r-project.org
Cc: 
Sent: Friday, August 16, 2013 3:02 PM
Subject: [R] Randomly drop a percent of data from a data.frame

Hi,
I have the following data.

 set.seed(6245)
 data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
 round(data,digits=3)
      x1     x2     x3     x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256 -0.069  0.354
4 -0.086  0.475  0.244  0.781
5  0.690 -0.181  1.274  1.633

What I would like to do is drop 20% of the data. But I want this 20% to
only come from dropping data from x3 and x4. It doesn't have to be evenly,
i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
observation has missing data on only one variable. I just want to drop 20%
of the data through x3 and x4 only.  In other words,

       x1     x2     x3     x4
1  0.482  1.320 -0.859 NA
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256      NA  0.354
4 -0.086  0.475      NA  0.781
5  0.690 -0.181      NA  1.633

OR

      x1     x2     x3     x4
1  0.482  1.320     NA -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256      NA  NA
4 -0.086  0.475  0.244  NA
5  0.690 -0.181  1.274  1.633

OR

      x1     x2     x3     x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063     NA
3  0.028 -0.256 -0.069     NA
4 -0.086  0.475  0.244     NA
5  0.690 -0.181  1.274     NA

ETC. are all fine.

Any ideas how I can do this?
Chris

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Randomly drop a percent of data from a data.frame

2013-08-16 Thread arun


Hi,
Suppose the dataset had odd number of columns:
set.seed(6458)
 data2- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5))
n- prod(dim(data2))
 n
#[1] 15
dummy- rep(F,n/2)
dummy[sample(1:(n/2),n*.2)]-T
dummy
#[1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

data2[,c(x2, x3)][matrix(dummy, nc = 2)]  - NA
#Error in `[-.data.frame`(`*tmp*`, matrix(dummy, nc = 2), value = NA) : 
 # unsupported matrix index in replacement
#In addition: Warning message:
#In matrix(dummy, nc = 2) :
 # data length [7] is not a sub-multiple or multiple of the number of rows [4]

I might do:
n1- 2*nrow(data2) ##for 2 columns
dummy- rep(FALSE,n1)
 dummy[sample(1:n1,n1*.2)]-TRUE
data2[,c(x2,x3)][matrix(dummy,nc=2)]-NA
data2
#   x1 x2 x3
#1 -0.55899744  0.6622481 -0.3305958
#2  0.12776368 NA NA
#3 -1.09734838  0.2069539 -0.6997853
#4  0.75919499 -0.5683809  0.4752002
#5 -0.03063141 -0.7549605  2.6038635


A.K.

From: Richard Kwock richardkw...@gmail.com
To: arun smartpink...@yahoo.com 
Cc: Christopher Desjardins cddesjard...@gmail.com; R help 
r-help@r-project.org 
Sent: Friday, August 16, 2013 5:55 PM
Subject: Re: [R] Randomly drop a percent of data from a data.frame



Try this:

data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
data - round(data,digits=3)

#get the total counts
n = prod(dim(data))

#set up a dummy array/matrix
dummy - rep(F, n/2)
dummy[sample(1:(n/2), n*.2)] - T

# 5x2 dummy matrix with T and F
matrix(dummy, nc = 2)


#subset the T indices in x3 and x4 and replace with NAs
data[,c(x3, x4)][matrix(dummy, nc = 2)]  - NA

data

#      x1     x2     x3     x4
#1 -1.310  0.659     NA  0.510
#2 -3.003 -0.004     NA     NA
#3  0.584  0.310     NA -0.087
#4  1.644 -2.792 -0.390 -0.382
#5 -1.791  0.840  1.137  0.820

Richard



On Fri, Aug 16, 2013 at 2:34 PM, arun smartpink...@yahoo.com wrote:

Hi,
May be this helps:
#data1 (changed `data` to `data1`)
set.seed(6245)
 data1 - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
 data1- round(data1,digits=3)

data2- data1

data1[,3:4]-lapply(data1[,3:4],function(x){x1- 
match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4]);x[is.na(x1)]-NA;x})
 data1
#  x1 x2 x3 x4
#1  0.482  1.320 NA -0.142
#2 -0.753 -0.041 -0.063  0.886
#3  0.028 -0.256 -0.069  0.354
#4 -0.086  0.475  0.244  0.781
#5  0.690 -0.181  1.274  1.633


#or
data2[,3:4]-lapply(data2[,3:4],function(x){x1- 
match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4]);x[is.na(x1)]-NA;x})
 data2
#  x1 x2 x3 x4
#1  0.482  1.320 -0.859 -0.142
#2 -0.753 -0.041 NA NA
#3  0.028 -0.256 -0.069  0.354
#4 -0.086  0.475  0.244  0.781
#5  0.690 -0.181  1.274  1.633
A.K.




- Original Message -
From: Christopher Desjardins cddesjard...@gmail.com
To: r-help@r-project.org r-help@r-project.org
Cc:
Sent: Friday, August 16, 2013 3:02 PM
Subject: [R] Randomly drop a percent of data from a data.frame

Hi,
I have the following data.

 set.seed(6245)
 data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
 round(data,digits=3)
      x1     x2     x3     x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256 -0.069  0.354
4 -0.086  0.475  0.244  0.781
5  0.690 -0.181  1.274  1.633

What I would like to do is drop 20% of the data. But I want this 20% to
only come from dropping data from x3 and x4. It doesn't have to be evenly,
i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
observation has missing data on only one variable. I just want to drop 20%
of the data through x3 and x4 only.  In other words,

       x1     x2     x3     x4
1  0.482  1.320 -0.859 NA
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256      NA  0.354
4 -0.086  0.475      NA  0.781
5  0.690 -0.181      NA  1.633

OR

      x1     x2     x3     x4
1  0.482  1.320     NA -0.142
2 -0.753 -0.041 -0.063  0.886
3  0.028 -0.256      NA  NA
4 -0.086  0.475  0.244  NA
5  0.690 -0.181  1.274  1.633

OR

      x1     x2     x3     x4
1  0.482  1.320 -0.859 -0.142
2 -0.753 -0.041 -0.063     NA
3  0.028 -0.256 -0.069     NA
4 -0.086  0.475  0.244     NA
5  0.690 -0.181  1.274     NA

ETC. are all fine.

Any ideas how I can do this?
Chris

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do 

Re: [R] Randomly drop a percent of data from a data.frame

2013-08-16 Thread Christopher Desjardins
Hi,
Thanks for the help. What I actually ended up doing was writing a copy of
for loops and I ended up getting something works.
Thanks.
Chris


On Fri, Aug 16, 2013 at 4:34 PM, arun smartpink...@yahoo.com wrote:

 Hi,
 May be this helps:
 #data1 (changed `data` to `data1`)
 set.seed(6245)
  data1 - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
  data1- round(data1,digits=3)

 data2- data1

 data1[,3:4]-lapply(data1[,3:4],function(x){x1-
 match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4]);x[
 is.na(x1)]-NA;x})
  data1
 #  x1 x2 x3 x4
 #1  0.482  1.320 NA -0.142
 #2 -0.753 -0.041 -0.063  0.886
 #3  0.028 -0.256 -0.069  0.354
 #4 -0.086  0.475  0.244  0.781
 #5  0.690 -0.181  1.274  1.633


 #or
 data2[,3:4]-lapply(data2[,3:4],function(x){x1-
 match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4]);x[
 is.na(x1)]-NA;x})
  data2
 #  x1 x2 x3 x4
 #1  0.482  1.320 -0.859 -0.142
 #2 -0.753 -0.041 NA NA
 #3  0.028 -0.256 -0.069  0.354
 #4 -0.086  0.475  0.244  0.781
 #5  0.690 -0.181  1.274  1.633
 A.K.



 - Original Message -
 From: Christopher Desjardins cddesjard...@gmail.com
 To: r-help@r-project.org r-help@r-project.org
 Cc:
 Sent: Friday, August 16, 2013 3:02 PM
 Subject: [R] Randomly drop a percent of data from a data.frame

 Hi,
 I have the following data.

  set.seed(6245)
  data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
  round(data,digits=3)
   x1 x2 x3 x4
 1  0.482  1.320 -0.859 -0.142
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256 -0.069  0.354
 4 -0.086  0.475  0.244  0.781
 5  0.690 -0.181  1.274  1.633

 What I would like to do is drop 20% of the data. But I want this 20% to
 only come from dropping data from x3 and x4. It doesn't have to be evenly,
 i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
 observation has missing data on only one variable. I just want to drop 20%
 of the data through x3 and x4 only.  In other words,

x1 x2 x3 x4
 1  0.482  1.320 -0.859 NA
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256  NA  0.354
 4 -0.086  0.475  NA  0.781
 5  0.690 -0.181  NA  1.633

 OR

   x1 x2 x3 x4
 1  0.482  1.320 NA -0.142
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256  NA  NA
 4 -0.086  0.475  0.244  NA
 5  0.690 -0.181  1.274  1.633

 OR

   x1 x2 x3 x4
 1  0.482  1.320 -0.859 -0.142
 2 -0.753 -0.041 -0.063 NA
 3  0.028 -0.256 -0.069 NA
 4 -0.086  0.475  0.244 NA
 5  0.690 -0.181  1.274 NA

 ETC. are all fine.

 Any ideas how I can do this?
 Chris

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Randomly drop a percent of data from a data.frame

2013-08-16 Thread Richard Kwock
Try this:

data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
data - round(data,digits=3)

#get the total counts
n = prod(dim(data))

#set up a dummy array/matrix
dummy - rep(F, n/2)
dummy[sample(1:(n/2), n*.2)] - T

# 5x2 dummy matrix with T and F
matrix(dummy, nc = 2)

#subset the T indices in x3 and x4 and replace with NAs
data[,c(x3, x4)][matrix(dummy, nc = 2)]  - NA

data

#  x1 x2 x3 x4
#1 -1.310  0.659 NA  0.510
#2 -3.003 -0.004 NA NA
#3  0.584  0.310 NA -0.087
#4  1.644 -2.792 -0.390 -0.382
#5 -1.791  0.840  1.137  0.820

Richard


On Fri, Aug 16, 2013 at 2:34 PM, arun smartpink...@yahoo.com wrote:

 Hi,
 May be this helps:
 #data1 (changed `data` to `data1`)
 set.seed(6245)
  data1 - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
  data1- round(data1,digits=3)

 data2- data1

 data1[,3:4]-lapply(data1[,3:4],function(x){x1-
 match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4]);x[
 is.na(x1)]-NA;x})
  data1
 #  x1 x2 x3 x4
 #1  0.482  1.320 NA -0.142
 #2 -0.753 -0.041 -0.063  0.886
 #3  0.028 -0.256 -0.069  0.354
 #4 -0.086  0.475  0.244  0.781
 #5  0.690 -0.181  1.274  1.633


 #or
 data2[,3:4]-lapply(data2[,3:4],function(x){x1-
 match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4]);x[
 is.na(x1)]-NA;x})
  data2
 #  x1 x2 x3 x4
 #1  0.482  1.320 -0.859 -0.142
 #2 -0.753 -0.041 NA NA
 #3  0.028 -0.256 -0.069  0.354
 #4 -0.086  0.475  0.244  0.781
 #5  0.690 -0.181  1.274  1.633
 A.K.



 - Original Message -
 From: Christopher Desjardins cddesjard...@gmail.com
 To: r-help@r-project.org r-help@r-project.org
 Cc:
 Sent: Friday, August 16, 2013 3:02 PM
 Subject: [R] Randomly drop a percent of data from a data.frame

 Hi,
 I have the following data.

  set.seed(6245)
  data - data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
  round(data,digits=3)
   x1 x2 x3 x4
 1  0.482  1.320 -0.859 -0.142
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256 -0.069  0.354
 4 -0.086  0.475  0.244  0.781
 5  0.690 -0.181  1.274  1.633

 What I would like to do is drop 20% of the data. But I want this 20% to
 only come from dropping data from x3 and x4. It doesn't have to be evenly,
 i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
 observation has missing data on only one variable. I just want to drop 20%
 of the data through x3 and x4 only.  In other words,

x1 x2 x3 x4
 1  0.482  1.320 -0.859 NA
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256  NA  0.354
 4 -0.086  0.475  NA  0.781
 5  0.690 -0.181  NA  1.633

 OR

   x1 x2 x3 x4
 1  0.482  1.320 NA -0.142
 2 -0.753 -0.041 -0.063  0.886
 3  0.028 -0.256  NA  NA
 4 -0.086  0.475  0.244  NA
 5  0.690 -0.181  1.274  1.633

 OR

   x1 x2 x3 x4
 1  0.482  1.320 -0.859 -0.142
 2 -0.753 -0.041 -0.063 NA
 3  0.028 -0.256 -0.069 NA
 4 -0.086  0.475  0.244 NA
 5  0.690 -0.181  1.274 NA

 ETC. are all fine.

 Any ideas how I can do this?
 Chris

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.