Hi Henrik, Thanks for pointing out the diffobj package and the clear example. Nice!
On Sun, Jan 28, 2018 at 6:22 PM, Marsh Hardy ARA/RISK <[email protected]> wrote: > Thanks, I think I've found the most succinct expression of differences in > two data.frames... > > length(which( rowSums( x1 != x2 ) > 0)) > > gives a count of the # of records in two data.frames that do not match. > > // > ________________________________________ > From: Henrik Bengtsson [[email protected]] > Sent: Sunday, January 28, 2018 11:12 AM > To: Ulrik Stervbo > Cc: Marsh Hardy ARA/RISK; [email protected] > Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > > The diffobj package (https://cran.r-project.org/package=diffobj) is > really helpful here. It provides "diff" functions diffPrint(), > diffStr(), and diffChr() to compare two object 'x' and 'y' and provide > neat colorized summary output. > > Example: > > > iris2 <- iris > > iris2[122:125,4] <- iris2[122:125,4] + 0.1 > > > diffobj::diffPrint(iris2, iris) > < iris2 > > iris > @@ 121,8 / 121,8 @@ > ~ Sepal.Length Sepal.Width Petal.Length Petal.Width Species > 120 6.0 2.2 5.0 1.5 virginica > 121 6.9 3.2 5.7 2.3 virginica > < 122 5.6 2.8 4.9 2.1 virginica > > 122 5.6 2.8 4.9 2.0 virginica > < 123 7.7 2.8 6.7 2.1 virginica > > 123 7.7 2.8 6.7 2.0 virginica > < 124 6.3 2.7 4.9 1.9 virginica > > 124 6.3 2.7 4.9 1.8 virginica > < 125 6.7 3.3 5.7 2.2 virginica > > 125 6.7 3.3 5.7 2.1 virginica > 126 7.2 3.2 6.0 1.8 virginica > 127 6.2 2.8 4.8 1.8 virginica > > What's not show here is that the colored output (supported by many > terminals these days) also highlights exactly which elements in those > rows differ. > > /Henrik > > On Sun, Jan 28, 2018 at 12:17 AM, Ulrik Stervbo <[email protected]> > wrote: > > The anti_join from the package dplyr might also be handy. > > > > install.package("dplyr") > > library(dplyr) > > anti_join (x1, x2) > > > > You can get help on the different functions by ?function.name(), so > > ?anti_join() will bring you help - and examples - on the anti_join > > function. > > > > It might be worth testing your approach on a small subset of the data. > That > > makes it easier for you to follow what happens and evaluate the outcome. > > > > HTH > > Ulrik > > > > Marsh Hardy ARA/RISK <[email protected]> schrieb am So., 28. Jan. 2018, > 04:14: > > > >> Cool, looks like that'd do it, almost as if converting an entire record > to > >> a character string and comparing strings. > >> > >> ________________________________________ > >> From: William Dunlap [[email protected]] > >> Sent: Saturday, January 27, 2018 4:57 PM > >> To: Marsh Hardy ARA/RISK > >> Cc: Ulrik Stervbo; Eric Berger; [email protected] > >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > >> > >> If your two objects have class "data.frame" (look at class(objectName)) > >> and they > >> both have the same number of columns and the same order of columns and > the > >> column types match closely enough (use all.equal(x1, x2) for that), then > >> you can try > >> which( rowSums( x1 != x2 ) > 0) > >> E.g., > >> > x1 <- data.frame(X=1:5, Y=rep(c("A","B"),c(3,2))) > >> > x2 <- data.frame(X=c(1,2,-3,-4,5), Y=rep(c("A","B"),c(2,3))) > >> > x1 > >> X Y > >> 1 1 A > >> 2 2 A > >> 3 3 A > >> 4 4 B > >> 5 5 B > >> > x2 > >> X Y > >> 1 1 A > >> 2 2 A > >> 3 -3 B > >> 4 -4 B > >> 5 5 B > >> > which( rowSums( x1 != x2 ) > 0) > >> [1] 3 4 > >> > >> If you want to allow small numeric differences but exactly character > >> matches > >> you will have to get a bit fancier. Splitting the data.frames into > >> character and > >> numeric parts and comparing each works well. > >> > >> Bill Dunlap > >> TIBCO Software > >> wdunlap tibco.com<http://tibco.com> > >> > >> On Sat, Jan 27, 2018 at 1:18 PM, Marsh Hardy ARA/RISK <[email protected] > >> <mailto:[email protected]>> wrote: > >> Hi Guys, I apologize for my rank & utter newness at R. > >> > >> I used summary() and found about 95 variables, both character and > numeric, > >> all with "Length:368842" I assume is the # of records. > >> > >> I'd like to know the record number (row #?) of any record where the data > >> doesn't match in the 2 files of what should be the same output. > >> > >> Thanks in advance, M. > >> > >> // > >> ________________________________________ > >> From: Ulrik Stervbo [[email protected]<mailto: > >> [email protected]>] > >> Sent: Saturday, January 27, 2018 10:00 AM > >> To: Eric Berger > >> Cc: Marsh Hardy ARA/RISK; [email protected]<mailto:r- > [email protected] > >> > > >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row. > >> > >> Also, it will be easier to provide helpful information if you'd describe > >> what in your data you want to compare and what you hope to get out of > the > >> comparison. > >> > >> Best wishes, > >> Ulrik > >> > >> Eric Berger <[email protected]<mailto:[email protected] > ><mailto: > >> [email protected]<mailto:[email protected]>>> schrieb am Sa., > 27. > >> Jan. 2018, 08:18: > >> Hi Marsh, > >> An RDS is not a data structure such as a data.frame. It can be anything. > >> For example if I want to save my objects a, b, c I could do: > >> > saveRDS( list(a,b,c,), file="tmp.RDS") > >> Then read them back later with > >> > myList <- readRDS( "tmp.RDS" ) > >> > >> Do you have additional information about your "RDSs" ? > >> > >> Eric > >> > >> > >> On Sat, Jan 27, 2018 at 6:54 AM, Marsh Hardy ARA/RISK <[email protected] > >> <mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>> > >> wrote: > >> > >> > Each RDS is 40 MBs. What's a slick code to compare them row by row, > IDing > >> > row numbers with mismatches? > >> > > >> > Thanks in advance. > >> > > >> > // > >> > > >> > ______________________________________________ > >> > [email protected]<mailto:[email protected]><mailto: > >> [email protected]<mailto:[email protected]>> mailing list -- To > >> UNSUBSCRIBE and more, see > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide http://www.R-project.org/ > >> > posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> [email protected]<mailto:[email protected]><mailto: > >> [email protected]<mailto:[email protected]>> mailing list -- To > >> UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > >> ______________________________________________ > >> [email protected]<mailto:[email protected]> mailing list -- To > >> UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > >> > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > [email protected] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [email protected] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

