Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread Michael Lawrence
Currently unique() does duplicated() internally and then extracts. One
could make a countUnique that simply counts, rather than allocate the
logical return value of duplicated(). But so much of the cost is in the
hash operation that it probably won't help much, but that might depend on
the sizes of things. The more unique elements, the better it would perform.


On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty haverty.pe...@gene.com
wrote:

 How about unique them both and compare the lengths?  It's less work,
 especially allocation.



 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote:

  If you look at the definition of %in%, you'll find that it is implemented
  using match, so if we did as you suggest, I give it about three days
 before
  someone suggests to inline the function call... Readability of source
 code
  is not usually our prime concern.
 
  The  idea does have some merit, though.
 
  Apropos, why is there no setcontains()?
 
  -pd
 
   On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote:
  
   Hi,
  
   Current implementation:
  
   setequal - function (x, y)
   {
x - as.vector(x)
y - as.vector(y)
all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
   }
  
   First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L)
 
  0L'
   with 'x %in% y' and 'y %in% x', respectively. They're strictly
   equivalent but the latter form is a lot more readable than the former
   (isn't this the raison d'être of %in%?):
  
   setequal - function (x, y)
   {
x - as.vector(x)
y - as.vector(y)
all(c(x %in% y, y %in% x))
   }
  
   Furthermore, replacing 'all(c(x %in% y, y %in x))' with
   'all(x %in% y)  all(y %in% x)' improves readability even more and,
   more importantly, reduces memory footprint significantly on big vectors
   (e.g. by 15% on integer vectors with 15M elements):
  
   setequal - function (x, y)
   {
x - as.vector(x)
y - as.vector(y)
all(x %in% y)  all(y %in% x)
   }
  
   It also seems to speed up things a little bit (not in a significant
   way though).
  
   Cheers,
   H.
  
   --
   Hervé Pagès
  
   Program in Computational Biology
   Division of Public Health Sciences
   Fred Hutchinson Cancer Research Center
   1100 Fairview Ave. N, M1-B514
   P.O. Box 19024
   Seattle, WA 98109-1024
  
   E-mail: hpa...@fredhutch.org
   Phone:  (206) 667-5791
   Fax:(206) 667-1319
  
   __
   R-devel@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-devel
 
  --
  Peter Dalgaard, Professor,
  Center for Statistics, Copenhagen Business School
  Solbjerg Plads 3, 2000 Frederiksberg, Denmark
  Phone: (+45)38153501
  Email: pd@cbs.dk  Priv: pda...@gmail.com
 
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel
 

 [[alternative HTML version deleted]]


 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread Peter Haverty
I was thinking something like:

setequal - function(x,y) {
xu = unique(x)
yu = unique(y)
if (length(xu) != length(yu)) { return FALSE; }
return (all( match( xu, yu, 0L )  0L ) )
}

This lets you fail early for cheap (skipping the allocation from the
0Ls).  Whether or not this goes fast depends a lot on the uniqueness of
x and y and whether or not you want to optimize for the TRUE or FALSE case.
You'd do much better to make some real hashes in C and compare the keys,
but it's probably not worth the complexity.




Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty phave...@gene.com wrote:

 How about unique them both and compare the lengths?  It's less work,
 especially allocation.



 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote:

 If you look at the definition of %in%, you'll find that it is implemented
 using match, so if we did as you suggest, I give it about three days before
 someone suggests to inline the function call... Readability of source code
 is not usually our prime concern.

 The  idea does have some merit, though.

 Apropos, why is there no setcontains()?

 -pd

  On 06 Jan 2015, at 22:02 , Herv� Pag�s hpa...@fredhutch.org wrote:
 
  Hi,
 
  Current implementation:
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
  }
 
  First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L)
  0L'
  with 'x %in% y' and 'y %in% x', respectively. They're strictly
  equivalent but the latter form is a lot more readable than the former
  (isn't this the raison d'�tre of %in%?):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(x %in% y, y %in% x))
  }
 
  Furthermore, replacing 'all(c(x %in% y, y %in x))' with
  'all(x %in% y)  all(y %in% x)' improves readability even more and,
  more importantly, reduces memory footprint significantly on big vectors
  (e.g. by 15% on integer vectors with 15M elements):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(x %in% y)  all(y %in% x)
  }
 
  It also seems to speed up things a little bit (not in a significant
  way though).
 
  Cheers,
  H.
 
  --
  Herv� Pag�s
 
  Program in Computational Biology
  Division of Public Health Sciences
  Fred Hutchinson Cancer Research Center
  1100 Fairview Ave. N, M1-B514
  P.O. Box 19024
  Seattle, WA 98109-1024
 
  E-mail: hpa...@fredhutch.org
  Phone:  (206) 667-5791
  Fax:(206) 667-1319
 
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel

 --
 Peter Dalgaard, Professor,
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread Peter Haverty
Try this out. It looks like a 2X speedup for some cases and a wash in
others.  unique does two allocations, but skipping the  0L allocation
could make up for it.

library(microbenchmark)
library(RUnit)

x = sample.int(1e4, 1e5, TRUE)
y = sample.int(1e4, 1e5, TRUE)

set_equal - function(x, y) {
xu = .Internal(unique(x, FALSE, FALSE, NA))
yu = .Internal(unique(y, FALSE, FALSE, NA))
if (length(xu) != length(yu)) {
return(FALSE);
}
return( all(match(xu, yu, 0L)  0L) )
}

set_equal2 - function(x, y) {
xu = .Internal(unique(x, FALSE, FALSE, NA))
yu = .Internal(unique(y, FALSE, FALSE, NA))
if (length(xu) != length(yu)) {
return(FALSE);
}
return( !anyNA(match(xu, yu)) )
}

microbenchmark(
a = setequal(x, y),
b = set_equal(x, y),
c = set_equal2(x, y)
)
checkIdentical(setequal(x, y), set_equal(x, y))
checkIdentical(setequal(x, y), set_equal2(x, y))

x = y
microbenchmark(
a = setequal(x, y),
b = set_equal(x, y),
c = set_equal2(x, y)
)
checkIdentical(setequal(x, y), set_equal(x, y))
checkIdentical(setequal(x, y), set_equal2(x, y))


Sorry, I'm probably over-posting today.

Regards,

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread Hervé Pagès

On 01/08/2015 01:30 PM, peter dalgaard wrote:

If you look at the definition of %in%, you'll find that it is implemented using 
match, so if we did as you suggest, I give it about three days before someone 
suggests to inline the function call...


But you wouldn't bet money on that right? Because you know you would
loose.


Readability of source code is not usually our prime concern.


Don't sacrifice readability if you do not have a good reason for it.
What's your reason here? Are you seriously suggesting that inlining
makes a significant difference? As Michael pointed out, the expensive
operation here is the hashing. But sadly some people like inlining and
want to use it everywhere: it's easy and they feel good about it, even
if it hurts readability and maintainability (if you use x %in% y
instead of the inlined version, the day someone changes the
implementation of x %in% y for something faster, or fixes a bug
in it, your code will automatically benefit, right now it won't).

More simply put: good readability generally leads to better code.



The  idea does have some merit, though.

Apropos, why is there no setcontains()?


Wait... shouldn't everybody use all(match(x, y, nomatch = 0L)  0L) ?

H.



-pd


On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote:

Hi,

Current implementation:

setequal - function (x, y)
{
  x - as.vector(x)
  y - as.vector(y)
  all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
}

First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L)  0L'
with 'x %in% y' and 'y %in% x', respectively. They're strictly
equivalent but the latter form is a lot more readable than the former
(isn't this the raison d'être of %in%?):

setequal - function (x, y)
{
  x - as.vector(x)
  y - as.vector(y)
  all(c(x %in% y, y %in% x))
}

Furthermore, replacing 'all(c(x %in% y, y %in x))' with
'all(x %in% y)  all(y %in% x)' improves readability even more and,
more importantly, reduces memory footprint significantly on big vectors
(e.g. by 15% on integer vectors with 15M elements):

setequal - function (x, y)
{
  x - as.vector(x)
  y - as.vector(y)
  all(x %in% y)  all(y %in% x)
}

It also seems to speed up things a little bit (not in a significant
way though).

Cheers,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread William Dunlap
 why is there no setcontains()?

Several packages define is.subset(), which I am assuming is what you are
proposing, but it its arguments reversed.  E.g., package:algstat has
   is.subset - function(x, y) all(x %in% y)
   containsQ - function(y, x) all(x %in% y)
and package:rje has essentially the same is.subset.

package:arulesSequences and package:arules have an S4 generic called
is.subset, which is entirely different (it is not a predicate, but returns
a matrix).


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote:

 If you look at the definition of %in%, you'll find that it is implemented
 using match, so if we did as you suggest, I give it about three days before
 someone suggests to inline the function call... Readability of source code
 is not usually our prime concern.

 The  idea does have some merit, though.

 Apropos, why is there no setcontains()?

 -pd

  On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote:
 
  Hi,
 
  Current implementation:
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
  }
 
  First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L) 
 0L'
  with 'x %in% y' and 'y %in% x', respectively. They're strictly
  equivalent but the latter form is a lot more readable than the former
  (isn't this the raison d'être of %in%?):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(x %in% y, y %in% x))
  }
 
  Furthermore, replacing 'all(c(x %in% y, y %in x))' with
  'all(x %in% y)  all(y %in% x)' improves readability even more and,
  more importantly, reduces memory footprint significantly on big vectors
  (e.g. by 15% on integer vectors with 15M elements):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(x %in% y)  all(y %in% x)
  }
 
  It also seems to speed up things a little bit (not in a significant
  way though).
 
  Cheers,
  H.
 
  --
  Hervé Pagès
 
  Program in Computational Biology
  Division of Public Health Sciences
  Fred Hutchinson Cancer Research Center
  1100 Fairview Ave. N, M1-B514
  P.O. Box 19024
  Seattle, WA 98109-1024
 
  E-mail: hpa...@fredhutch.org
  Phone:  (206) 667-5791
  Fax:(206) 667-1319
 
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel

 --
 Peter Dalgaard, Professor,
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread Peter Haverty
How about unique them both and compare the lengths?  It's less work,
especially allocation.



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard pda...@gmail.com wrote:

 If you look at the definition of %in%, you'll find that it is implemented
 using match, so if we did as you suggest, I give it about three days before
 someone suggests to inline the function call... Readability of source code
 is not usually our prime concern.

 The  idea does have some merit, though.

 Apropos, why is there no setcontains()?

 -pd

  On 06 Jan 2015, at 22:02 , Herv� Pag�s hpa...@fredhutch.org wrote:
 
  Hi,
 
  Current implementation:
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
  }
 
  First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L) 
 0L'
  with 'x %in% y' and 'y %in% x', respectively. They're strictly
  equivalent but the latter form is a lot more readable than the former
  (isn't this the raison d'�tre of %in%?):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(c(x %in% y, y %in% x))
  }
 
  Furthermore, replacing 'all(c(x %in% y, y %in x))' with
  'all(x %in% y)  all(y %in% x)' improves readability even more and,
  more importantly, reduces memory footprint significantly on big vectors
  (e.g. by 15% on integer vectors with 15M elements):
 
  setequal - function (x, y)
  {
   x - as.vector(x)
   y - as.vector(y)
   all(x %in% y)  all(y %in% x)
  }
 
  It also seems to speed up things a little bit (not in a significant
  way though).
 
  Cheers,
  H.
 
  --
  Herv� Pag�s
 
  Program in Computational Biology
  Division of Public Health Sciences
  Fred Hutchinson Cancer Research Center
  1100 Fairview Ave. N, M1-B514
  P.O. Box 19024
  Seattle, WA 98109-1024
 
  E-mail: hpa...@fredhutch.org
  Phone:  (206) 667-5791
  Fax:(206) 667-1319
 
  __
  R-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-devel

 --
 Peter Dalgaard, Professor,
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-08 Thread peter dalgaard
If you look at the definition of %in%, you'll find that it is implemented using 
match, so if we did as you suggest, I give it about three days before someone 
suggests to inline the function call... Readability of source code is not 
usually our prime concern.

The  idea does have some merit, though. 

Apropos, why is there no setcontains()?

-pd

 On 06 Jan 2015, at 22:02 , Hervé Pagès hpa...@fredhutch.org wrote:
 
 Hi,
 
 Current implementation:
 
 setequal - function (x, y)
 {
  x - as.vector(x)
  y - as.vector(y)
  all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
 }
 
 First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L)  0L'
 with 'x %in% y' and 'y %in% x', respectively. They're strictly
 equivalent but the latter form is a lot more readable than the former
 (isn't this the raison d'être of %in%?):
 
 setequal - function (x, y)
 {
  x - as.vector(x)
  y - as.vector(y)
  all(c(x %in% y, y %in% x))
 }
 
 Furthermore, replacing 'all(c(x %in% y, y %in x))' with
 'all(x %in% y)  all(y %in% x)' improves readability even more and,
 more importantly, reduces memory footprint significantly on big vectors
 (e.g. by 15% on integer vectors with 15M elements):
 
 setequal - function (x, y)
 {
  x - as.vector(x)
  y - as.vector(y)
  all(x %in% y)  all(y %in% x)
 }
 
 It also seems to speed up things a little bit (not in a significant
 way though).
 
 Cheers,
 H.
 
 -- 
 Hervé Pagès
 
 Program in Computational Biology
 Division of Public Health Sciences
 Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N, M1-B514
 P.O. Box 19024
 Seattle, WA 98109-1024
 
 E-mail: hpa...@fredhutch.org
 Phone:  (206) 667-5791
 Fax:(206) 667-1319
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] setequal: better readability, reduced memory footprint, and minor speedup

2015-01-06 Thread Hervé Pagès

Hi,

Current implementation:

  setequal - function (x, y)
  {
x - as.vector(x)
y - as.vector(y)
all(c(match(x, y, 0L)  0L, match(y, x, 0L)  0L))
  }

First what about replacing 'match(x, y, 0L)  0L' and 'match(y, x, 0L)  0L'
with 'x %in% y' and 'y %in% x', respectively. They're strictly
equivalent but the latter form is a lot more readable than the former
(isn't this the raison d'être of %in%?):

  setequal - function (x, y)
  {
x - as.vector(x)
y - as.vector(y)
all(c(x %in% y, y %in% x))
  }

Furthermore, replacing 'all(c(x %in% y, y %in x))' with
'all(x %in% y)  all(y %in% x)' improves readability even more and,
more importantly, reduces memory footprint significantly on big vectors
(e.g. by 15% on integer vectors with 15M elements):

  setequal - function (x, y)
  {
x - as.vector(x)
y - as.vector(y)
all(x %in% y)  all(y %in% x)
  }

It also seems to speed up things a little bit (not in a significant
way though).

Cheers,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel