[Rd] I() in merge (was: Re: xftrm is more than 100x slower for AsIs than for character vectors)

2024-07-16 Thread Hilmar Berger via R-devel

Dear all,

actually, it is not clear to me why there is still a protection of the
added Row.names column in merge using I(). This seems to stem from a
time when R would automatically convert character vectors to factor in
data.frame on insert. However, I can't reproduce this behaviour even in
data.frames generated with stringsAsFactors = T in current versions of
R. Maybe the I() inserted in r 39026 can be removed altogether?

Best regards

Hilmar

On 14.07.24 19:09, HB via R-devel wrote:

Dear Ivan,

thanks for the confirmation and the proposed patch.

I just wanted to add some notes regarding the relevance of this: base::merge 
using by.x=0 or by.y=0 (i.e. matching on row.names) will automatically add a 
column Row.names which is I(row.names(x)) to the corresponding input table 
(using I() since  revision 39026 to avoid conversion of character to factor). 
When this column is used for sorting (sort=TRUE by default in merge; should 
happen at least if all.x=T or all.y=T), this will result in slower execution.

xtfrm.AsIs is unchanged since its addition in r50992 (likely unrelated to the 
former).

So I guess that this just went unnoticed since it will not cause problems on 
small data frames.

Best regards

Hilmar

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] xftrm is more than 100x slower for AsIs than for character vectors

2024-07-12 Thread Hilmar Berger via R-devel

Good evening,

I recently have observed slow merges when combining multiple data frames
derived from DataFrame and base::data.frame. I observed that the index
column of intermediate tables was of class  (automatically
converted from character). The problem occurred mainly when using the
sorted = T option in base::merge.

This can be traced to xtfrm.AsIs being more than 100 times slower than
the comparable function for character vectors.

x = paste0("A_", 1:1e5)
system.time({o <- xtfrm(x)})

#  user  system elapsed
#  0.325   0.005   0.332

x <- I(x)
system.time({o <- xtfrm(x)}) # this calls xtfrm.AsIs

# user  system elapsed
# 26.153   0.016  26.388

This can be finally traced to base::rank() (called from xtfrm.default),
where I found that

"NB: rank is not itself generic but xtfrm is, and rank(xtfrm(x), )
will have the desired result if there is a xtfrm method. Otherwise, rank
will make use of ==, >, is.na and extraction methods for classed
objects, possibly rather slowly. "

This *sounds* like the existence of xtfrm.AsIs should already be able to
accelerate the ranking, but this does not seem to work. xtfrm.AsIs does
not do anything for my case of class(x) == "AsIs" and just delegates to
xtfrm.default.

As a quick solution (and if there is no other fix), could we possibly
add a note to the help page of I() that sorting/ordering/ranking of AsIs
columns will be rather slow?

Thanks a lot!

Best regards

Hilmar

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; 
LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=de_DE.UTF-8    LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=de_DE.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats graphics  grDevices utils datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_4.4.1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Partial matching performance in data frame rownames using [

2023-12-21 Thread Hilmar Berger via R-devel

Dear Toby and Ivan,

thanks a lot for the proposed patch and this detailed analysis. The
timing analysis nicely shows what I suspected - that partial matching in
large tables (>>10^5 rows) can get prohibitively slow. For 10^6 rows
with 50% non-hits in exact matching I roughly would expect 10,000
seconds, i.e. almost 3h.

That would be quite slow even if one would want partial matching. My
suspicion, however, is that most users do not want partial matching at
all and use row name indexing using character vectors in the same way as
applied in data.table or tibble, i.e. as a unique key to table rows.

I can't remember a valid use case where I would have used partial
matching for a rowname index in the last 10 years, and I would be happy
to learn how widespread such use cases are.

Regarding the workaround, I do not fully agree that adding match() to
the call of [.data.frame() would be a preferable solution. In cases
where one cannot exclude that the data.frame will grow large and that
there might be considerable proportions of non-hits in exact matching,
the workaround would have to applied always in order to achieve a
predictable performance. Which, in my opinion, raises the question if
and when the ordinary, partial matching option would be still applicable.

I am not knowledgeable to say how much work it would be, but I believe
that we could test the impact of Ivan's proposed solution by running
CRAN/BioC package tests against a patched R compared to an unpatched
one. I can offer to have a look at failing test cases to see if those
are intentional or unintentional uses of partial matching.

Best regards

Hilmar

On 19.12.23 21:57, Toby Hocking wrote:

Hi Hilmar and Ivan,
I have used your code examples to write a blog post about this topic,
which has figures that show the asymptotic time complexity of the
various approaches,
https://tdhock.github.io/blog/2023/df-partial-match/
The asymptotic complexity of partial matching appears to be quadratic
O(N^2) whereas the other approaches are asymptotically faster: linear
O(N) or log-linear O(N log N).
I think that accepting Ivan's pmatch.rows patch would add un-necessary
complexity to base R, since base R already provides an efficient
work-around, d1[match(q1,rownames(d1)),]
I do think the CheckUserInterrupt patch is a good idea, though.
Best,
Toby

On Sat, Dec 16, 2023 at 2:49 AM Ivan Krylov  wrote:

On Wed, 13 Dec 2023 09:04:18 +0100
Hilmar Berger via R-devel  wrote:


Still, I feel that default partial matching cripples the functionality
of data.frame for larger tables.

Changing the default now would require a long deprecation cycle to give
everyone who uses `[.data.frame` and relies on partial matching
(whether they know it or not) enough time to adjust.

Still, adding an argument feels like a small change: edit
https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
add a condition before calling pmatch(). Adjust the warning() for named
arguments. Don't forget to document the new argument in the man page at
https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd

Index: src/library/base/R/dataframe.R
===
--- src/library/base/R/dataframe.R  (revision 85664)
+++ src/library/base/R/dataframe.R  (working copy)
@@ -591,14 +591,14 @@
  ###  These are a little less general than S

  `[.data.frame` <-
-function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
+function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, 
pmatch.rows = TRUE)
  {
  mdrop <- missing(drop)
  Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
  has.j <- !missing(j)
-if(!all(names(sys.call()) %in% c("", "drop"))
+if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
 && !isS4(x)) # at least don't warn for callNextMethod!
-warning("named arguments other than 'drop' are discouraged")
+warning("named arguments other than 'drop', 'pmatch.rows' are 
discouraged")

  if(Narg < 3L) {  # list-like indexing or matrix indexing
  if(!mdrop) warning("'drop' argument will be ignored")
@@ -679,7 +679,11 @@
  ## for consistency with [, ]
  if(is.character(i)) {
  rows <- attr(xx, "row.names")
-i <- pmatch(i, rows, duplicates.ok = TRUE)
+i <- if (pmatch.rows) {
+pmatch(i, rows, duplicates.ok = TRUE)
+} else {
+match(i, rows)
+}
  }
  ## need to figure which col was selected:
  ## cannot use .subset2 directly as that may
@@ -699,7 +703,11 @@
   # as this can be expensive.
  if(is.character(i)) {
  rows <- attr(xx, "row.names")
-i

Re: [Rd] Partial matching performance in data frame rownames using [

2023-12-13 Thread Hilmar Berger via R-devel

Dear Ivan,

thanks a lot, that is helpful.

Still, I feel that default partial matching cripples the functionality
of data.frame for larger tables.

Thanks again and best regards

Hilmar

On 12.12.23 13:55, Ivan Krylov wrote:

В Mon, 11 Dec 2023 21:11:48 +0100
Hilmar Berger via R-devel  пишет:


What was unexpected is that in this case was that [.data.frame was
hanging for a long time (I waited about 10 minutes and then restarted
R). Also, this cannot be interrupted in interactive mode.

That's unfortunate. If an operation takes a long time, it ought to be
interruptible. Here's a patch that passes make check-devel:

--- src/main/unique.c   (revision 85667)
+++ src/main/unique.c   (working copy)
@@ -1631,6 +1631,7 @@
}
  }

+unsigned int ic = ;
  if(nexact < n_input) {
/* Second pass, partial matching */
for (R_xlen_t i = 0; i < n_input; i++) {
@@ -1642,6 +1643,10 @@
mtch = 0;
mtch_count = 0;
for (int j = 0; j < n_target; j++) {
+   if (!--ic) {
+   R_CheckUserInterrupt();
+   ic = ;
+   }
if (no_dups && used[j]) continue;
if (strncmp(ss, tar[j], temp) == 0) {
mtch = j + 1;



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Partial matching performance in data frame rownames using [

2023-12-11 Thread Hilmar Berger via R-devel

Dear all,

I have seen that others have discussed the partial matching behaviour of
data.frame[idx,] in the past, in particular with respect to unexpected
results sets.

I am aware of the fact that one can work around this using either
match() or switching to tibble/data.table or similar altogether.

I have a different issue with the partial matching, in particular its
performance when used on large data frames or more specifically, with
large queries matched against its row names.

I came across a case where I wanted to extract data from a large table
(approx 1M rows) using an index which matched only about 50% to the row
names, i.e. about 50% row name hits and 50% misses.

What was unexpected is that in this case was that [.data.frame was
hanging for a long time (I waited about 10 minutes and then restarted
R). Also, this cannot be interrupted in interactive mode.

ids <- paste0("cg", sprintf("%06d",0:(1e6-1)))
d1 <- data.frame(row.names=ids, v=1:(1e6) )

q1 <- sample(ids, 1e6, replace=F)
system.time({r <- d1[q1,,drop=F]})
#   user  system elapsed
#  0.464   0.000   0.465

# those will hang a long time, I stopped R after 10 minutes
q2 <- c(q1[1:5e5], gsub("cg", "ct", q1[(5e5+1):1e6]) )
system.time({r <- d1[q2,,drop=F]})

# same here
q3 <- c(q1[1:5e5], rep("FOO",5e5) )
system.time({r <- d1[q3,,drop=F]})

It seems that the penalty of partial matching the non-hits across the
whole row name vector is not negligible any more with large tables and
queries, compared to small and medium tables.

I checked and pmatch(q2, rownames(d1) is equally slow.

Is there a chance to a) document this in the help page ("with large
indexes/tables use match()") or even better b) add an exact flag to
[.data.frame ?

Thanks a lot!

Best regards

Hilmar

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel