Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
Currently unique() does duplicated() internally and then extracts. One could make a countUnique that simply counts, rather than allocate the logical return value of duplicated(). But so much of the cost is in the hash operation that it probably won't help much, but that might depend on the sizes of things. The more unique elements, the better it would perform. On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty haverty.pe...@gene.com wrote: How about unique them both and compare the lengths? It's less work, especially allocation. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The idea does have some merit, though. Apropos, why is there no setcontains()? -pd On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
I was thinking something like: setequal - function(x,y) { xu = unique(x) yu = unique(y) if (length(xu) != length(yu)) { return FALSE; } return (all( match( xu, yu, 0L ) 0L ) ) } This lets you fail early for cheap (skipping the allocation from the 0Ls). Whether or not this goes fast depends a lot on the uniqueness of x and y and whether or not you want to optimize for the TRUE or FALSE case. You'd do much better to make some real hashes in C and compare the keys, but it's probably not worth the complexity. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty phave...@gene.com wrote: How about unique them both and compare the lengths? It's less work, especially allocation. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The idea does have some merit, though. Apropos, why is there no setcontains()? -pd On 06 Jan 2015, at 22:02 , Herv� Pag�s hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'�tre of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Herv� Pag�s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
Try this out. It looks like a 2X speedup for some cases and a wash in others. unique does two allocations, but skipping the 0L allocation could make up for it. library(microbenchmark) library(RUnit) x = sample.int(1e4, 1e5, TRUE) y = sample.int(1e4, 1e5, TRUE) set_equal - function(x, y) { xu = .Internal(unique(x, FALSE, FALSE, NA)) yu = .Internal(unique(y, FALSE, FALSE, NA)) if (length(xu) != length(yu)) { return(FALSE); } return( all(match(xu, yu, 0L) 0L) ) } set_equal2 - function(x, y) { xu = .Internal(unique(x, FALSE, FALSE, NA)) yu = .Internal(unique(y, FALSE, FALSE, NA)) if (length(xu) != length(yu)) { return(FALSE); } return( !anyNA(match(xu, yu)) ) } microbenchmark( a = setequal(x, y), b = set_equal(x, y), c = set_equal2(x, y) ) checkIdentical(setequal(x, y), set_equal(x, y)) checkIdentical(setequal(x, y), set_equal2(x, y)) x = y microbenchmark( a = setequal(x, y), b = set_equal(x, y), c = set_equal2(x, y) ) checkIdentical(setequal(x, y), set_equal(x, y)) checkIdentical(setequal(x, y), set_equal2(x, y)) Sorry, I'm probably over-posting today. Regards, [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
On 01/08/2015 01:30 PM, peter dalgaard wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... But you wouldn't bet money on that right? Because you know you would loose. Readability of source code is not usually our prime concern. Don't sacrifice readability if you do not have a good reason for it. What's your reason here? Are you seriously suggesting that inlining makes a significant difference? As Michael pointed out, the expensive operation here is the hashing. But sadly some people like inlining and want to use it everywhere: it's easy and they feel good about it, even if it hurts readability and maintainability (if you use x %in% y instead of the inlined version, the day someone changes the implementation of x %in% y for something faster, or fixes a bug in it, your code will automatically benefit, right now it won't). More simply put: good readability generally leads to better code. The idea does have some merit, though. Apropos, why is there no setcontains()? Wait... shouldn't everybody use all(match(x, y, nomatch = 0L) 0L) ? H. -pd On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
why is there no setcontains()? Several packages define is.subset(), which I am assuming is what you are proposing, but it its arguments reversed. E.g., package:algstat has is.subset - function(x, y) all(x %in% y) containsQ - function(y, x) all(x %in% y) and package:rje has essentially the same is.subset. package:arulesSequences and package:arules have an S4 generic called is.subset, which is entirely different (it is not a predicate, but returns a matrix). Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The idea does have some merit, though. Apropos, why is there no setcontains()? -pd On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
How about unique them both and compare the lengths? It's less work, especially allocation. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote: If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The idea does have some merit, though. Apropos, why is there no setcontains()? -pd On 06 Jan 2015, at 22:02 , Herv� Pag�s hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'�tre of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Herv� Pag�s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup
If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The idea does have some merit, though. Apropos, why is there no setcontains()? -pd On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote: Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
Hi, Current implementation: setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(match(x, y, 0L) 0L, match(y, x, 0L) 0L)) } First what about replacing 'match(x, y, 0L) 0L' and 'match(y, x, 0L) 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the raison d'être of %in%?): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal - function (x, y) { x - as.vector(x) y - as.vector(y) all(x %in% y) all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel