[R] compare two data frames of different dimensions and only keep unique rows
Dear list, I am still struggling with something that should be easy: I compare two data frames with a lot of common rows and want to keep only rows that are NOT in both data frames, unique. Here are an example of these data frame. reported - structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L), .Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), class = factor), Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 8L, 9L), class = data.frame) exportfile - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(NA, 9L), class = data.frame) I can rbind() them, thus resulting in one data frame with duplicated row, but I have no idea how to delete duplicated rows. I have tried plyaing with unique(), duplicated with no success v-rbind(exportfile,reported) v - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1, -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 21, 31, 41, 61, 71, 51, 10, 81, 91), class = data.frame) TY for your help Arnaud Gaboury A2CT2 Ltd. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compare two data frames of different dimensions and only keep unique rows
is this what you want: v - rbind(reported, exportfile) v[!duplicated(v), ] ProductPrice Nbr.Lots 1Cocoa 2331.00 -61 2Cocoa 2356.00 -61 3Cocoa 2440.005 4Cocoa 2450.001 6 Coffee C 204.55 40 7 Coffee C 205.45 40 5 GC 17792.00 -1 10 Sugar No 1124.81 -1 8 ZS 1273.50 -1 9 ZS 1276.251 13 Cocoa 2440.006 On Mon, Feb 27, 2012 at 12:36 PM, Arnaud Gaboury arnaud.gabo...@a2ct2.com wrote: Dear list, I am still struggling with something that should be easy: I compare two data frames with a lot of common rows and want to keep only rows that are NOT in both data frames, unique. Here are an example of these data frame. reported - structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L), .Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), class = factor), Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 8L, 9L), class = data.frame) exportfile - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(NA, 9L), class = data.frame) I can rbind() them, thus resulting in one data frame with duplicated row, but I have no idea how to delete duplicated rows. I have tried plyaing with unique(), duplicated with no success v-rbind(exportfile,reported) v - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1, -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 21, 31, 41, 61, 71, 51, 10, 81, 91), class = data.frame) TY for your help Arnaud Gaboury A2CT2 Ltd. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compare two data frames of different dimensions and only keep unique rows
No, but I tried your way too. In fact, the only three unique rows are these ones: Product Price Nbr.Lots Cocoa 24405 Cocoa 24501 Cocoa 24406 Here is a dirty working trick I found : df-merge(exportfile,reported,all.y=T) df1-merge(exportfile,reported) dff1-do.call(paste,df) dff-do.call(paste,df) dff1-do.call(paste,df1) df[!dff %in% dff1,] Product Price Nbr.Lots 3 Cocoa 24405 4 Cocoa 24501 My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy). I hoped I could find a better solution. A2CT2 Ltd. -Original Message- From: jim holtman [mailto:jholt...@gmail.com] Sent: lundi 27 février 2012 18:42 To: Arnaud Gaboury Cc: r-help@r-project.org Subject: Re: [R] compare two data frames of different dimensions and only keep unique rows is this what you want: v - rbind(reported, exportfile) v[!duplicated(v), ] ProductPrice Nbr.Lots 1Cocoa 2331.00 -61 2Cocoa 2356.00 -61 3Cocoa 2440.005 4Cocoa 2450.001 6 Coffee C 204.55 40 7 Coffee C 205.45 40 5 GC 17792.00 -1 10 Sugar No 1124.81 -1 8 ZS 1273.50 -1 9 ZS 1276.251 13 Cocoa 2440.006 On Mon, Feb 27, 2012 at 12:36 PM, Arnaud Gaboury arnaud.gabo...@a2ct2.com wrote: Dear list, I am still struggling with something that should be easy: I compare two data frames with a lot of common rows and want to keep only rows that are NOT in both data frames, unique. Here are an example of these data frame. reported - structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L), .Label = c(Cocoa, Coffee C, GC, Sugar No 11, ZS), class = factor), Price = c(2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L, 5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 8L, 9L), class = data.frame) exportfile - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(NA, 9L), class = data.frame) I can rbind() them, thus resulting in one data frame with duplicated row, but I have no idea how to delete duplicated rows. I have tried plyaing with unique(), duplicated with no success v-rbind(exportfile,reported) v - structure(list(Product = c(Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS, Cocoa, Cocoa, Cocoa, Cocoa, Coffee C, Coffee C, GC, Sugar No 11, ZS, ZS), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25, 2331, 2356, 2440, 2450, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1, -61, -61, 5, 1, 40, 40, -1, -1, -1, 1)), .Names = c(Product, Price, Nbr.Lots), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 21, 31, 41, 61, 71, 51, 10, 81, 91), class = data.frame) TY for your help Arnaud Gaboury A2CT2 Ltd. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compare two data frames of different dimensions and only keep unique rows
Hello, In fact, the only three unique rows are these ones: Product Price Nbr.Lots Cocoa 24405 Cocoa 24501 Cocoa 24406 The code below doesn't give the three in one step. (ixmat - apply(reported, 1, function(x) apply(exportfile, 1, function(y) any(x != y reported[apply(ixmat, 2, all), ] exportfile[apply(ixmat, 1, all), ] But it does give all three. Problem: it's inefficient. If n1 - nrow(reported) and n2 - nrow(exportfile), to form the index matrix alone it will make ncol times n1*n2 comparisons. And the matrix dims are n1, n2 ... Anyway, I hope it gives ideas, Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4425605.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compare two data frames of different dimensions and only keep unique rows
On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote: No, but I tried your way too. In fact, the only three unique rows are these ones: Product Price Nbr.Lots Cocoa 24405 Cocoa 24501 Cocoa 24406 Here is a dirty working trick I found : df-merge(exportfile,reported,all.y=T) df1-merge(exportfile,reported) dff1-do.call(paste,df) dff-do.call(paste,df) dff1-do.call(paste,df1) df[!dff %in% dff1,] Product Price Nbr.Lots 3 Cocoa 24405 4 Cocoa 24501 My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy). Hi. Try the following. setdiffDF - function(A, B) { A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ] } df1 - setdiffDF(reported, exportfile) df2 - setdiffDF(exportfile, reported) rbind(df1, df2) I obtained Product Price Nbr.Lots 3Cocoa 24405 4Cocoa 24501 31 Cocoa 24406 Is this correct? I see the row Cocoa 2440.006 only in exportfile and not in reported. The trick with paste() is not a bad idea. A variant of it is used also in the base function duplicated.matrix(), since it contains apply(x, MARGIN, function(x) paste(x, collapse = \r)) If speed is critical, then possibly the paste() trick written for the whole columns, for example paste(df[[1]], df[[2]], df[[3]], sep=\r) and then setdiff() can be better. Hope this helps. Petr Savicky. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compare two data frames of different dimensions and only keep unique rows
Hello, I've made Petr's solution a bit more general Petr Savicky wrote On Mon, Feb 27, 2012 at 07:10:57PM +0100, Arnaud Gaboury wrote: No, but I tried your way too. In fact, the only three unique rows are these ones: Product Price Nbr.Lots Cocoa 24405 Cocoa 24501 Cocoa 24406 Here is a dirty working trick I found : df-merge(exportfile,reported,all.y=T) df1-merge(exportfile,reported) dff1-do.call(paste,df) dff-do.call(paste,df) dff1-do.call(paste,df1) df[!dff %in% dff1,] Product Price Nbr.Lots 3 Cocoa 24405 4 Cocoa 24501 My two problems are : I do think it is not so a clean code, then I won't know by advance which of my two df will have the greates dimension (I can add some lines to deal with it, but again, seems very heavy). Hi. Try the following. setdiffDF - function(A, B) { A[!duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)], ] } df1 - setdiffDF(reported, exportfile) df2 - setdiffDF(exportfile, reported) rbind(df1, df2) I obtained Product Price Nbr.Lots 3Cocoa 24405 4Cocoa 24501 31 Cocoa 24406 Is this correct? I see the row Cocoa 2440.006 only in exportfile and not in reported. The trick with paste() is not a bad idea. A variant of it is used also in the base function duplicated.matrix(), since it contains apply(x, MARGIN, function(x) paste(x, collapse = \r)) If speed is critical, then possibly the paste() trick written for the whole columns, for example paste(df[[1]], df[[2]], df[[3]], sep=\r) and then setdiff() can be better. Hope this helps. Petr Savicky. __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. It produces the symmetric difference for vectors, matrices, data.frames and (so-so tested) lists. #- # First the set difference `%-%` - function(x, y) UseMethod(%-%) `%-%.default` - function(x, y){ f - function(A, B) !duplicated(c(B, A))[length(B) + 1:length(A)] ix - f(x, y) x[ix] } `%-%.matrix` - `%-%.data.frame` - function(x, y){ f - function(A, B) !duplicated(rbind(B, A))[nrow(B) + 1:nrow(A)] ix - f(x, y) x[ix, ] } `%-%.list` - function(x, y){ f - function(A, B) if(class(A) == class(B)) A %-% B lapply(y, function(Y) lapply(x, f, Y)) } # Then the set symmetric difference symdiff - function(x, y) UseMethod(symdiff) symdiff.default - function(x, y) c(x %-% y, y %-% x) symdiff.matrix - symdiff.data.frame - function(x, y){ xclass - class(x) res - rbind(x %-% y, y %-% x) class(res) - xclass res } symdiff.list - function(x, y){ f - function(A, B) if(class(A) == class(B)) symdiff(A, B) lapply(y, function(Y) lapply(x, f, Y)) } # Test it with data.frames first (the OP data) reported %-% exportfile exportfile %-% reported symdiff(reported, exportfile) symdiff(exportfile, reported) #- # And some other data types x - 1:5 y - 3:8 x %-% y y %-% x symdiff(x, y) symdiff(y, x) X - list(a=x, rp=reported) Y - list(b=y, ef=exportfile) X %-% Y Y %-% X symdiff(X, Y) symdiff(Y, X) P.S. This question seems to pop-up repeatedly Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/compare-two-data-frames-of-different-dimensions-and-only-keep-unique-rows-tp4425379p4426607.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.