Re: [R] sampling rows with values never sampled before

2015-06-23 Thread Jon Skoien
If df is the data.frame with values and you want nn samples, then this 
is a slightly different approach:


# example data.frame:
df = data.frame(a1 = sample(1:20,50, replace = TRUE),
a2 =  sample(seq(0.1,10,length.out = 
30),50, replace = TRUE),
a3 = sample(seq(0.3, 20,length.out = 
20),50,replace = TRUE))

nrow = dim(df)[1] # 50
ncol = dim(df)[2]  # 3

# start by randomizing the order in your data.frame
randomOrder = sample(1:nrow, nrow, replace = FALSE)
dff = df[randomOrder,]

# find and remove all duplicates from all columns. With this you will 
only keep the first instance of any unique value:

rem = NULL
for (ic in 1:ncol) rem = c(rem, which(duplicated(dff[, ic])))
if (length(rem)  0) dff = dff[-unique(rem),]

# Reduce to the length you need
if (dim(dff)[1]  nn)  res = dff[1:nn,] else res = dff

I am not sure how this scales if you have a really big data, and whether 
you could get some FAQ 7.31 problems depending on how you fill your 
data.frame.


Cheers,
Jon

On 6/23/2015 12:13 AM, C W wrote:

Hi Jean,

Thanks!

Daniel,
Yes, you are absolutely right.  I want sampled vectors to be as different
as possible.

I added a little more to the earlier data set.
 x1  x2  x3
  [1,]  1 3.7  2.1
  [2,]  2 3.7  5.3
  [3,]  3 3.7  6.2
  [4,]  4 3.7  8.9
  [5,]  5 3.7  4.1
  [6,]  1 2.9  2.1
  [7,]  2 2.9  5.3
  [8,]  3 2.9  6.2
  [9,]  4 2.9  8.9
[10,]  5 2.9 4.1
[11,]  1 5.2 2.1
[12,]  2 5.2 5.3
[13,]  3 5.2 6.2
[14,]  4 5.2 8.9
[15,]  5 5.2 4.1

If I sampled row, 1, 6, 11, solving the system of equations will not be
possible.  So, I am avoiding similar vectors.

Thanks,

Mike


On Mon, Jun 22, 2015 at 2:19 PM, Daniel Nordlund djnordl...@frontier.com
wrote:


On 6/22/2015 9:42 AM, C W wrote:


Hello R list,

I am have question about sampling unique coordinate values.

Here's how my data looks like

  dat - cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))

dat


x1  x2
   [1,]  1 3.7
   [2,]  2 3.7
   [3,]  3 3.7
   [4,]  4 3.7
   [5,]  5 3.7
   [6,]  1 2.9
   [7,]  2 2.9
   [8,]  3 2.9
   [9,]  4 2.9
[10,]  5 2.9
[11,]  1 5.2
[12,]  2 5.2
[13,]  3 5.2
[14,]  4 5.2
[15,]  5 5.2


If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).

I want to avoid either the first or second coordinate repeated.  It leads
to undefined matrix inversion.

I thought of using sampling(), but not sure about applying it to a data
frame.

Thanks in advance,

Mike



I am not sure you gave us enough information to solve your real world
problem.  But I have a few comments and a potential solution.

1. In your example the unique values in in x1 are completely crossed with
the unique values in x2.
2. since you don't want duplicates of either number, then the maximum
number of samples that you can take is the minimum number of unique values
in either vector, x1 or x2 (in this case x2 with 3 unique values).
3. Sample without replace from the smallest set of unique values first.
4. Sample without replacement from the larger set second.


x - 1:5
xx - c(3.7, 2.9, 5.2)
s2 - sample(xx,2, replace=FALSE)
s1 - sample(x,2, replace=FALSE)
samp - cbind(s1,s2)

samp

  s1  s2
[1,]  5 3.7
[2,]  1 5.2
Your actual data is probably larger, and the unique values in each vector
may not be completely crossed, in which case the task is a little harder.
In that case, you could remove values from your data as you sample.  This
may not be efficient, but it will work.

smpl - function(dat, size){
   mysamp - numeric(0)
   for(i in 1:size) {
 s - dat[sample(nrow(dat),1),]
 mysamp - rbind(mysamp,s, deparse.level=0)
 dat - dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
 }
   mysamp
}


This is just an example of how you might approach your real world
problem.  There is no error checking, and for large samples it may not
scale well.


Hope this is helpful,

Dan

--
Daniel Nordlund
Bothell, WA USA


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Jon Olav Skøien
Joint Research Centre - European Commission
Institute for Environment and Sustainability (IES)
Climate Risk Management Unit

Via Fermi 2749, TP 100-01,  I-21027 Ispra (VA), ITALY

jon.sko...@jrc.ec.europa.eu
Tel:  +39 0332 789205

Disclaimer: Views expressed in this email are those of the individual and do 
not necessarily represent official views of the European Commission.


Re: [R] sampling rows with values never sampled before

2015-06-22 Thread Adams, Jean
Mike,

There may be a more efficient way to do this, but this works on your
example.

# mix up the order of the rows
mix - dat[order(runif(dim(dat)[1])), ]

# get rid of duplicate x1s and x2s
sub - mix[!duplicated(mix[, x1])  !duplicated(mix[, x2]), ]
sub

Jean

On Mon, Jun 22, 2015 at 11:42 AM, C W tmrs...@gmail.com wrote:

 Hello R list,

 I am have question about sampling unique coordinate values.

 Here's how my data looks like

  dat - cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
  dat
   x1  x2
  [1,]  1 3.7
  [2,]  2 3.7
  [3,]  3 3.7
  [4,]  4 3.7
  [5,]  5 3.7
  [6,]  1 2.9
  [7,]  2 2.9
  [8,]  3 2.9
  [9,]  4 2.9
 [10,]  5 2.9
 [11,]  1 5.2
 [12,]  2 5.2
 [13,]  3 5.2
 [14,]  4 5.2
 [15,]  5 5.2


 If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).

 I want to avoid either the first or second coordinate repeated.  It leads
 to undefined matrix inversion.

 I thought of using sampling(), but not sure about applying it to a data
 frame.

 Thanks in advance,

 Mike

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sampling rows with values never sampled before

2015-06-22 Thread C W
Hi Jean,

Thanks!

Daniel,
Yes, you are absolutely right.  I want sampled vectors to be as different
as possible.

I added a little more to the earlier data set.
x1  x2  x3
 [1,]  1 3.7  2.1
 [2,]  2 3.7  5.3
 [3,]  3 3.7  6.2
 [4,]  4 3.7  8.9
 [5,]  5 3.7  4.1
 [6,]  1 2.9  2.1
 [7,]  2 2.9  5.3
 [8,]  3 2.9  6.2
 [9,]  4 2.9  8.9
[10,]  5 2.9 4.1
[11,]  1 5.2 2.1
[12,]  2 5.2 5.3
[13,]  3 5.2 6.2
[14,]  4 5.2 8.9
[15,]  5 5.2 4.1

If I sampled row, 1, 6, 11, solving the system of equations will not be
possible.  So, I am avoiding similar vectors.

Thanks,

Mike


On Mon, Jun 22, 2015 at 2:19 PM, Daniel Nordlund djnordl...@frontier.com
wrote:

 On 6/22/2015 9:42 AM, C W wrote:

 Hello R list,

 I am have question about sampling unique coordinate values.

 Here's how my data looks like

  dat - cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
 dat

x1  x2
   [1,]  1 3.7
   [2,]  2 3.7
   [3,]  3 3.7
   [4,]  4 3.7
   [5,]  5 3.7
   [6,]  1 2.9
   [7,]  2 2.9
   [8,]  3 2.9
   [9,]  4 2.9
 [10,]  5 2.9
 [11,]  1 5.2
 [12,]  2 5.2
 [13,]  3 5.2
 [14,]  4 5.2
 [15,]  5 5.2


 If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).

 I want to avoid either the first or second coordinate repeated.  It leads
 to undefined matrix inversion.

 I thought of using sampling(), but not sure about applying it to a data
 frame.

 Thanks in advance,

 Mike


 I am not sure you gave us enough information to solve your real world
 problem.  But I have a few comments and a potential solution.

 1. In your example the unique values in in x1 are completely crossed with
 the unique values in x2.
 2. since you don't want duplicates of either number, then the maximum
 number of samples that you can take is the minimum number of unique values
 in either vector, x1 or x2 (in this case x2 with 3 unique values).
 3. Sample without replace from the smallest set of unique values first.
 4. Sample without replacement from the larger set second.

  x - 1:5
  xx - c(3.7, 2.9, 5.2)
  s2 - sample(xx,2, replace=FALSE)
  s1 - sample(x,2, replace=FALSE)
  samp - cbind(s1,s2)
 
  samp
  s1  s2
 [1,]  5 3.7
 [2,]  1 5.2
 

 Your actual data is probably larger, and the unique values in each vector
 may not be completely crossed, in which case the task is a little harder.
 In that case, you could remove values from your data as you sample.  This
 may not be efficient, but it will work.

 smpl - function(dat, size){
   mysamp - numeric(0)
   for(i in 1:size) {
 s - dat[sample(nrow(dat),1),]
 mysamp - rbind(mysamp,s, deparse.level=0)
 dat - dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
 }
   mysamp
 }


 This is just an example of how you might approach your real world
 problem.  There is no error checking, and for large samples it may not
 scale well.


 Hope this is helpful,

 Dan

 --
 Daniel Nordlund
 Bothell, WA USA


 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sampling rows with values never sampled before

2015-06-22 Thread Daniel Nordlund

On 6/22/2015 9:42 AM, C W wrote:

Hello R list,

I am have question about sampling unique coordinate values.

Here's how my data looks like


dat - cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
dat

   x1  x2
  [1,]  1 3.7
  [2,]  2 3.7
  [3,]  3 3.7
  [4,]  4 3.7
  [5,]  5 3.7
  [6,]  1 2.9
  [7,]  2 2.9
  [8,]  3 2.9
  [9,]  4 2.9
[10,]  5 2.9
[11,]  1 5.2
[12,]  2 5.2
[13,]  3 5.2
[14,]  4 5.2
[15,]  5 5.2


If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).

I want to avoid either the first or second coordinate repeated.  It leads
to undefined matrix inversion.

I thought of using sampling(), but not sure about applying it to a data
frame.

Thanks in advance,

Mike



I am not sure you gave us enough information to solve your real world 
problem.  But I have a few comments and a potential solution.


1. In your example the unique values in in x1 are completely crossed 
with the unique values in x2.
2. since you don't want duplicates of either number, then the maximum 
number of samples that you can take is the minimum number of unique 
values in either vector, x1 or x2 (in this case x2 with 3 unique values).

3. Sample without replace from the smallest set of unique values first.
4. Sample without replacement from the larger set second.

 x - 1:5
 xx - c(3.7, 2.9, 5.2)
 s2 - sample(xx,2, replace=FALSE)
 s1 - sample(x,2, replace=FALSE)
 samp - cbind(s1,s2)

 samp
 s1  s2
[1,]  5 3.7
[2,]  1 5.2


Your actual data is probably larger, and the unique values in each 
vector may not be completely crossed, in which case the task is a little 
harder.  In that case, you could remove values from your data as you 
sample.  This may not be efficient, but it will work.


smpl - function(dat, size){
  mysamp - numeric(0)
  for(i in 1:size) {
s - dat[sample(nrow(dat),1),]
mysamp - rbind(mysamp,s, deparse.level=0)
dat - dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
}
  mysamp
}


This is just an example of how you might approach your real world 
problem.  There is no error checking, and for large samples it may not 
scale well.



Hope this is helpful,

Dan

--
Daniel Nordlund
Bothell, WA USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.