Dear Toby and Ivan,

thanks a lot for the proposed patch and this detailed analysis. The
timing analysis nicely shows what I suspected - that partial matching in
large tables (>>10^5 rows) can get prohibitively slow. For 10^6 rows
with 50% non-hits in exact matching I roughly would expect 10,000
seconds, i.e. almost 3h.

That would be quite slow even if one would want partial matching. My
suspicion, however, is that most users do not want partial matching at
all and use row name indexing using character vectors in the same way as
applied in data.table or tibble, i.e. as a unique key to table rows.

I can't remember a valid use case where I would have used partial
matching for a rowname index in the last 10 years, and I would be happy
to learn how widespread such use cases are.

Regarding the workaround, I do not fully agree that adding match() to
the call of [.data.frame() would be a preferable solution. In cases
where one cannot exclude that the data.frame will grow large and that
there might be considerable proportions of non-hits in exact matching,
the workaround would have to applied always in order to achieve a
predictable performance. Which, in my opinion, raises the question if
and when the ordinary, partial matching option would be still applicable.

I am not knowledgeable to say how much work it would be, but I believe
that we could test the impact of Ivan's proposed solution by running
CRAN/BioC package tests against a patched R compared to an unpatched
one. I can offer to have a look at failing test cases to see if those
are intentional or unintentional uses of partial matching.

Best regards

Hilmar

On 19.12.23 21:57, Toby Hocking wrote:
Hi Hilmar and Ivan,
I have used your code examples to write a blog post about this topic,
which has figures that show the asymptotic time complexity of the
various approaches,
https://tdhock.github.io/blog/2023/df-partial-match/
The asymptotic complexity of partial matching appears to be quadratic
O(N^2) whereas the other approaches are asymptotically faster: linear
O(N) or log-linear O(N log N).
I think that accepting Ivan's pmatch.rows patch would add un-necessary
complexity to base R, since base R already provides an efficient
work-around, d1[match(q1,rownames(d1)),]
I do think the CheckUserInterrupt patch is a good idea, though.
Best,
Toby

On Sat, Dec 16, 2023 at 2:49 AM Ivan Krylov <krylov.r...@gmail.com> wrote:
On Wed, 13 Dec 2023 09:04:18 +0100
Hilmar Berger via R-devel <r-devel@r-project.org> wrote:

Still, I feel that default partial matching cripples the functionality
of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give
everyone who uses `[.data.frame` and relies on partial matching
(whether they know it or not) enough time to adjust.

Still, adding an argument feels like a small change: edit
https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
add a condition before calling pmatch(). Adjust the warning() for named
arguments. Don't forget to document the new argument in the man page at
https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd

Index: src/library/base/R/dataframe.R
===================================================================
--- src/library/base/R/dataframe.R      (revision 85664)
+++ src/library/base/R/dataframe.R      (working copy)
@@ -591,14 +591,14 @@
  ###  These are a little less general than S

  `[.data.frame` <-
-    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
+    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, 
pmatch.rows = TRUE)
  {
      mdrop <- missing(drop)
      Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
      has.j <- !missing(j)
-    if(!all(names(sys.call()) %in% c("", "drop"))
+    if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
         && !isS4(x)) # at least don't warn for callNextMethod!
-        warning("named arguments other than 'drop' are discouraged")
+        warning("named arguments other than 'drop', 'pmatch.rows' are 
discouraged")

      if(Narg < 3L) {  # list-like indexing or matrix indexing
          if(!mdrop) warning("'drop' argument will be ignored")
@@ -679,7 +679,11 @@
              ## for consistency with [, <length-1>]
              if(is.character(i)) {
                  rows <- attr(xx, "row.names")
-                i <- pmatch(i, rows, duplicates.ok = TRUE)
+                i <- if (pmatch.rows) {
+                    pmatch(i, rows, duplicates.ok = TRUE)
+                } else {
+                    match(i, rows)
+                }
              }
              ## need to figure which col was selected:
              ## cannot use .subset2 directly as that may
@@ -699,7 +703,11 @@
                   # as this can be expensive.
      if(is.character(i)) {
          rows <- attr(xx, "row.names")
-        i <- pmatch(i, rows, duplicates.ok = TRUE)
+        i <- if (pmatch.rows) {
+            pmatch(i, rows, duplicates.ok = TRUE)
+        } else {
+            match(i, rows)
+        }
      }
      for(j in seq_along(x)) {
          xj <- xx[[ sxx[j] ]]
Index: src/library/base/man/Extract.data.frame.Rd
===================================================================
--- src/library/base/man/Extract.data.frame.Rd  (revision 85664)
+++ src/library/base/man/Extract.data.frame.Rd  (working copy)
@@ -15,7 +15,7 @@
    Extract or replace subsets of data frames.
  }
  \usage{
-\method{[}{data.frame}(x, i, j, drop = )
+\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
  \method{[}{data.frame}(x, i, j) <- value
  \method{[[}{data.frame}(x, ..., exact = TRUE)
  \method{[[}{data.frame}(x, i, j) <- value
@@ -45,6 +45,9 @@
      column is selected.}

     \item{exact}{logical: see \code{\link{[}}, and applies to column names.}
+
+   \item{pmatch.rows}{logical: whether to perform partial matching on
+     row names in case \code{i} is a character vector.}
  }
  \details{
    Data frames can be indexed in several modes.  When \code{[} and


system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]})
#    user  system elapsed
#   0.478   0.004   0.482

Unfortunately, that would be only the beginning. The prose in the whole
?`[.data.frame` would have to be updated; the new behaviour would have
to be tested in tests/**.R. There may be very good reasons why named
arguments to `[` other than drop= are discouraged for data.frames. I'm
afraid I lack the whole-project view to consider whether such an
addition would be safe.

--
Best regards,
Ivan

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to