[R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread Arnaud Gaboury
Dear list,

I am still struggling with something that should be easy: I compare two data 
frames with a lot of common rows and want to keep only rows that are NOT in 
both data frames, unique.

Here are an example of these data frame.

reported -
structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L), 
.Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), class = factor), 
Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 
1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), 
.Names = c(Product, Price, Nbr.Lots), row.names = c(1L, 2L, 3L, 4L, 6L, 
7L, 5L, 10L, 8L, 9L), class = data.frame)

exportfile -
structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, 
GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 
17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 
1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(NA, 9L), class = 
data.frame)

I can rbind() them, thus resulting in one data frame with duplicated row, but I 
have no idea how to delete duplicated rows. I have tried plyaing with unique(), 
duplicated with no success

v-rbind(exportfile,reported)
v -
structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, 
Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa, 
Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, 
ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 
24.81, 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45, 
17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 
40, -1, -1, -1, 1, -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = 
c(Product, 
Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5, 
6, 7, 8, 9, 11, 21, 31, 41, 61, 71, 51, 
10, 81, 91), class = data.frame)


TY for your help

Arnaud Gaboury
 
A2CT2 Ltd.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread jim holtman
is this what you want:

 v - rbind(reported, exportfile)
 v[!duplicated(v), ]
   ProductPrice Nbr.Lots
1Cocoa  2331.00  -61
2Cocoa  2356.00  -61
3Cocoa  2440.005
4Cocoa  2450.001
6 Coffee C   204.55   40
7 Coffee C   205.45   40
5   GC 17792.00   -1
10 Sugar No 1124.81   -1
8   ZS  1273.50   -1
9   ZS  1276.251
13   Cocoa  2440.006



On Mon, Feb 27, 2012 at 12:36 PM, Arnaud Gaboury
arnaud.gabo...@a2ct2.com wrote:
 Dear list,

 I am still struggling with something that should be easy: I compare two data 
 frames with a lot of common rows and want to keep only rows that are NOT in 
 both data frames, unique.

 Here are an example of these data frame.

 reported -
 structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L), 
 .Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), class = 
 factor), Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 
 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 40L, -1L, -1L, -1L, 
 1L)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1L, 2L, 3L, 
 4L, 6L, 7L, 5L, 10L, 8L, 9L), class = data.frame)

 exportfile -
 structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, 
 GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 
 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 
 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(NA, 9L), class 
 = data.frame)

 I can rbind() them, thus resulting in one data frame with duplicated row, but 
 I have no idea how to delete duplicated rows. I have tried plyaing with 
 unique(), duplicated with no success

 v-rbind(exportfile,reported)
 v -
 structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C,
 Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa,
 Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11,
 ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792,
 24.81, 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45,
 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40,
 40, -1, -1, -1, 1, -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = 
 c(Product,
 Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5,
 6, 7, 8, 9, 11, 21, 31, 41, 61, 71, 51,
 10, 81, 91), class = data.frame)


 TY for your help

 Arnaud Gaboury

 A2CT2 Ltd.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread Arnaud Gaboury
No, but I tried your way too.

In fact, the only three unique rows are these ones:

 Product Price Nbr.Lots
   Cocoa  24405
   Cocoa  24501
   Cocoa  24406

Here is a dirty working trick I found :

 df-merge(exportfile,reported,all.y=T)
 df1-merge(exportfile,reported)
 dff1-do.call(paste,df)
 dff-do.call(paste,df)
 dff1-do.call(paste,df1)
 df[!dff %in% dff1,]
  Product Price Nbr.Lots
3   Cocoa  24405
4   Cocoa  24501
 

My two problems are : I do think it is not so a clean code, then I won't know 
by advance which of my two df will have the greates dimension (I can add some 
lines to deal with it, but again, seems very heavy).

I hoped I could find a better solution.


A2CT2 Ltd.


-Original Message-
From: jim holtman [mailto:jholt...@gmail.com] 
Sent: lundi 27 février 2012 18:42
To: Arnaud Gaboury
Cc: r-help@r-project.org
Subject: Re: [R] compare two data frames of different dimensions and only keep 
unique rows

is this what you want:

 v - rbind(reported, exportfile)
 v[!duplicated(v), ]
   ProductPrice Nbr.Lots
1Cocoa  2331.00  -61
2Cocoa  2356.00  -61
3Cocoa  2440.005
4Cocoa  2450.001
6 Coffee C   204.55   40
7 Coffee C   205.45   40
5   GC 17792.00   -1
10 Sugar No 1124.81   -1
8   ZS  1273.50   -1
9   ZS  1276.251
13   Cocoa  2440.006



On Mon, Feb 27, 2012 at 12:36 PM, Arnaud Gaboury arnaud.gabo...@a2ct2.com 
wrote:
 Dear list,

 I am still struggling with something that should be easy: I compare two data 
 frames with a lot of common rows and want to keep only rows that are NOT in 
 both data frames, unique.

 Here are an example of these data frame.

 reported -
 structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 
 5L, 5L), .Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), 
 class = factor), Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 
 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 
 40L, -1L, -1L, -1L, 1L)), .Names = c(Product, Price, Nbr.Lots), 
 row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 8L, 9L), class = 
 data.frame)

 exportfile -
 structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, 
 Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 
 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = 
 c(-61, -61, 6, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, 
 Nbr.Lots), row.names = c(NA, 9L), class = data.frame)

 I can rbind() them, thus resulting in one data frame with duplicated 
 row, but I have no idea how to delete duplicated rows. I have tried 
 plyaing with unique(), duplicated with no success

 v-rbind(exportfile,reported)
 v -
 structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, 
 Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa, 
 Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, 
 ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 
 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 
 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1, 
 -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = c(Product, 
 Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5, 6, 7, 
 8, 9, 11, 21, 31, 41, 61, 71, 51, 10, 81, 91), 
 class = data.frame)


 TY for your help

 Arnaud Gaboury

 A2CT2 Ltd.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread Rui Barradas
Hello,

 
 In fact, the only three unique rows are these ones:
 
  Product Price Nbr.Lots
Cocoa  24405
Cocoa  24501
Cocoa  24406
 

The code below doesn't give the three in one step.

(ixmat - apply(reported, 1, function(x) apply(exportfile, 1, function(y)
any(x != y
reported[apply(ixmat, 2, all), ]
exportfile[apply(ixmat, 1, all), ]

But it does give all three.

Problem: it's inefficient. If n1 - nrow(reported) and  n2 -
nrow(exportfile),
to form the index matrix alone it will make ncol times n1*n2 comparisons.
And the matrix dims are n1, n2 ...

Anyway, I hope it gives ideas,

Rui Barradas




--
View this message in context: 
http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4425605.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread Petr Savicky
On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
 No, but I tried your way too.
 
 In fact, the only three unique rows are these ones:
 
  Product Price Nbr.Lots
Cocoa  24405
Cocoa  24501
Cocoa  24406
 
 Here is a dirty working trick I found :
 
  df-merge(exportfile,reported,all.y=T)
  df1-merge(exportfile,reported)
  dff1-do.call(paste,df)
  dff-do.call(paste,df)
  dff1-do.call(paste,df1)
  df[!dff %in% dff1,]
   Product Price Nbr.Lots
 3   Cocoa  24405
 4   Cocoa  24501
  
 
 My two problems are : I do think it is not so a clean code, then I won't know 
 by advance which of my two df will have the greates dimension (I can add some 
 lines to deal with it, but again, seems very heavy).

Hi.

Try the following.

  setdiffDF - function(A, B)
  {
  A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
  }

  df1 - setdiffDF(reported, exportfile)
  df2 - setdiffDF(exportfile, reported)
  rbind(df1, df2)

I obtained

 Product Price Nbr.Lots
  3Cocoa  24405
  4Cocoa  24501
  31   Cocoa  24406

Is this correct? I see the row

  Cocoa  2440.006

only in exportfile and not in reported.

The trick with paste() is not a bad idea. A variant of
it is used also in the base function duplicated.matrix(),
since it contains

  apply(x, MARGIN, function(x) paste(x, collapse = \r))

If speed is critical, then possibly the paste() trick
written for the whole columns, for example

  paste(df[[1]], df[[2]], df[[3]], sep=\r)

and then setdiff() can be better.

Hope this helps.

Petr Savicky.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compare two data frames of different dimensions and only keep unique rows

2012-02-27 Thread Rui Barradas
Hello,

I've made Petr's solution a bit more general


Petr Savicky wrote
 
 On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote:
 No, but I tried your way too.
 
 In fact, the only three unique rows are these ones:
 
  Product Price Nbr.Lots
Cocoa  24405
Cocoa  24501
Cocoa  24406
 
 Here is a dirty working trick I found :
 
  df-merge(exportfile,reported,all.y=T)
  df1-merge(exportfile,reported)
  dff1-do.call(paste,df)
  dff-do.call(paste,df)
  dff1-do.call(paste,df1)
  df[!dff %in% dff1,]
   Product Price Nbr.Lots
 3   Cocoa  24405
 4   Cocoa  24501
  
 
 My two problems are : I do think it is not so a clean code, then I won't
 know by advance which of my two df will have the greates dimension (I can
 add some lines to deal with it, but again, seems very heavy).
 
 Hi.
 
 Try the following.
 
   setdiffDF - function(A, B)
   {
   A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ]
   }
 
   df1 - setdiffDF(reported, exportfile)
   df2 - setdiffDF(exportfile, reported)
   rbind(df1, df2)
 
 I obtained
 
  Product Price Nbr.Lots
   3Cocoa  24405
   4Cocoa  24501
   31   Cocoa  24406
 
 Is this correct? I see the row
 
   Cocoa  2440.006
 
 only in exportfile and not in reported.
 
 The trick with paste() is not a bad idea. A variant of
 it is used also in the base function duplicated.matrix(),
 since it contains
 
   apply(x, MARGIN, function(x) paste(x, collapse = \r))
 
 If speed is critical, then possibly the paste() trick
 written for the whole columns, for example
 
   paste(df[[1]], df[[2]], df[[3]], sep=\r)
 
 and then setdiff() can be better.
 
 Hope this helps.
 
 Petr Savicky.
 
 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

It produces the symmetric difference for vectors, matrices, data.frames and
(so-so tested) lists.

#-
# First the set difference

`%-%` - function(x, y) UseMethod(%-%)
`%-%.default` - function(x, y){
f - function(A, B)
!duplicated(c(B, A))[length(B) + 1:length(A)]
ix - f(x, y)
x[ix]
}
`%-%.matrix` - `%-%.data.frame` - function(x, y){
f - function(A, B)
!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)]
ix - f(x, y)
x[ix, ]
}
`%-%.list` - function(x, y){
f - function(A, B)
if(class(A) == class(B)) A %-% B
lapply(y, function(Y) lapply(x, f, Y))
}

# Then the set symmetric difference
symdiff - function(x, y)  UseMethod(symdiff)
symdiff.default - function(x, y)
c(x %-% y, y %-% x)
symdiff.matrix - symdiff.data.frame - function(x, y){
xclass - class(x)
res - rbind(x %-% y, y %-% x)
class(res) - xclass
res
}
symdiff.list - function(x, y){
f - function(A, B)
if(class(A) == class(B)) symdiff(A, B)
lapply(y, function(Y) lapply(x, f, Y))
}

# Test it with data.frames first (the OP data)

reported %-% exportfile
exportfile %-% reported

symdiff(reported, exportfile)
symdiff(exportfile, reported)

#-
# And some other data types

x - 1:5
y - 3:8
x %-% y
y %-% x
symdiff(x, y)
symdiff(y, x)

X - list(a=x, rp=reported)
Y - list(b=y, ef=exportfile)
X %-% Y
Y %-% X
symdiff(X, Y)
symdiff(Y, X)

P.S. This question seems to pop-up repeatedly

Rui Barradas


--
View this message in context: 
http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4426607.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.