Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
(diverted to r-devel, a source code patch attached)

Wacek Kusnierczyk wrote:
 Allan Engelhardt wrote:
   
 Immaterial, yes, but it is always good to test :) and your solution
 *is* faster and it is even faster if you can assume byte strings:
 

 :)

 indeed;  though if the speed is immaterial (and in this case it
 supposedly was), it's probably not worth risking fixed=TRUE removing
 '.tif' from the middle of the name, however unlikely this might be (cf
 murphy's laws).

 but if you can assume that each string ends with a '.tif' (or any other
 \..{3} substring), then substr is marginally faster than sub, even as a
 three-pass approach, while avoiding the risk of removing '.tif' from the
 middle:

 strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
 paste(sample(letters, 10), collapse='')))
 library(rbenchmark)
 benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr={basenames=basename(strings); substr(basenames, 1,
 nchar(basenames)-4)},
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
 # test elapsed
 # 1 substr   3.176
 # 2sub   3.296
   

btw., i wonder why negative indices default to 1 in substr:

substr('foobar', -5, 5)
# fooba
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# 
# substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

# hypothetical
substr('foobar', -5, 5)
# ooba
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# ooba
# substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff  character.c.diff
svn revert -R .
patch -p0  character.c.diff
   
./configure
make
make check-all
# no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1   substr   2.981
# 2 substr-nchar   3.206
# 3  sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ
Index: src/main/character.c
===
--- src/main/character.c	(revision 48667)
+++ src/main/character.c	(working copy)
@@ -244,7 +244,12 @@
 	ss = CHAR(el);
 	slen = strlen(ss); /* FIXME -- should handle embedded nuls */
 	buf = R_AllocStringBuffer(slen+1, cbuff);
-	if (start  1) start = 1;
+	if (start == 0) 
+		start = 1;
+	else if (start  0) 
+		start = slen + start + 1;
+	if (stop  0) 
+		stop = slen + stop + 1;
 	if (start  stop || start  slen) {
 		buf[0] = '\0';
 	} else {
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread William Dunlap


Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

 -Original Message-
 From: r-devel-boun...@r-project.org 
 [mailto:r-devel-boun...@r-project.org] On Behalf Of Wacek Kusnierczyk
 Sent: Thursday, May 28, 2009 5:30 AM
 Cc: R help project; r-devel@r-project.org; Allan Engelhardt
 Subject: Re: [Rd] [R] split strings
 
 (diverted to r-devel, a source code patch attached)
 
 Wacek Kusnierczyk wrote:
  Allan Engelhardt wrote:

  Immaterial, yes, but it is always good to test :) and your solution
  *is* faster and it is even faster if you can assume byte strings:
  
 
  :)
 
  indeed;  though if the speed is immaterial (and in this case it
  supposedly was), it's probably not worth risking fixed=TRUE removing
  '.tif' from the middle of the name, however unlikely this 
 might be (cf
  murphy's laws).
 
  but if you can assume that each string ends with a '.tif' 
 (or any other
  \..{3} substring), then substr is marginally faster than 
 sub, even as a
  three-pass approach, while avoiding the risk of removing 
 '.tif' from the
  middle:
 
  strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
  paste(sample(letters, 10), collapse='')))
  library(rbenchmark)
  benchmark(columns=c('test', 'elapsed'), 
 replications=1000, order=NULL,
 substr={basenames=basename(strings); substr(basenames, 1,
  nchar(basenames)-4)},
 sub=sub('.tif', '', basename(strings), fixed=TRUE, 
 useBytes=TRUE))
  # test elapsed
  # 1 substr   3.176
  # 2sub   3.296

 
 btw., i wonder why negative indices default to 1 in substr:
 
 substr('foobar', -5, 5)
 # fooba
 # substr('foobar', 1, 5)
 substr('foobar', 2, -2)
 # 
 # substr('foobar', 2, 1)
 
 this does not seem to be documented in ?substr.

Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
x-c(ooo,good food,bad)
r-regexpr(o+, x)
substring(x,r,attr(r,match.length)+r-1)
   [1] ooo oo 
substr(x,r,attr(r,match.length)+r-1)
   [1] ooo oo 
r
   [1]  1  2 -1
   attr(,match.length)
   [1]  3  2 -1
attr(r,match.length)+r-1
   [1]  3  3 -3
   attr(,match.length)
   [1]  3  2 -1

  there are 
 ways to make
 negative indices meaningful, e.g., by taking them as indexing from
 behind (as in, e.g., perl):
 
 # hypothetical
 substr('foobar', -5, 5)
 # ooba
 # substr('foobar', 6-5+1, 5)
 substr('foobar', 2, -2)
 # ooba
 # substr('foobar', 2, 6-2+1)
 
 there is a trivial fix to src/main/character.c that gives substr the
 extended functionality -- see the attached patch.  the patch has been
 created and tested as follows:
 
 svn co https://svn.r-project.org/R/trunk r-devel
 cd r-devel
 # modifications made to src/main/character.c
 svn diff  character.c.diff
 svn revert -R .
 patch -p0  character.c.diff

 ./configure
 make
 make check-all
 # no problems reported
 
 with the patched substr, the original problem can now be solved more
 concisely, using a two-pass approach, with performance still 
 better than
 the sub/fixed/bytes one, as follows:
 
 strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
 paste(sample(letters, 10), collapse='')))
 library(rbenchmark)
 benchmark(columns=c('test', 'elapsed'), 
 replications=1000, order=NULL,
 substr=substr(basename(strings), 1, -5),
 'substr-nchar'={
 basenames=basename(strings)
 substr(basenames, 1, nchar(basenames)-4) },
 sub=sub('.tif', '', basename(strings), fixed=TRUE, 
 useBytes=TRUE))
 # test elapsed
 # 1   substr   2.981
 # 2 substr-nchar   3.206
 # 3  sub   3.273
 
 if this sounds interesting, i can update the docs accordingly.
 
 vQ
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
William Dunlap wrote:

 Would your patched code affect the following
 use of regexpr's output as input to substr, to
 pull out the matched text from the string?
 x-c(ooo,good food,bad)
 r-regexpr(o+, x)
 substring(x,r,attr(r,match.length)+r-1)
[1] ooo oo 
   

no; same output

 substr(x,r,attr(r,match.length)+r-1)
[1] ooo oo 
   

no; same output

 r
[1]  1  2 -1
attr(,match.length)
[1]  3  2 -1
 attr(r,match.length)+r-1
[1]  3  3 -3
attr(,match.length)
[1]  3  2 -1
   

for the positive indices there is no change, as you might expect.

if i understand your concern, the issue is that regexpr returns -1 (with
the corresponding attribute -1) where there is no match.  in this case,
you expect  as the substring. 

if there is no match, we have:

start = r = -1 (the start you index provide)
stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)

for a string of length n, my patch computes the final indices as follows:

start' = n + start - 1
stop' = n + stop - 1

whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
that is, stop'  start', hence an empty string is returned, by virtue of
the original code.  (see the sources for details.)

does this answer your question?

vQ

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
Wacek Kusnierczyk wrote:
 William Dunlap wrote:
   
 Would your patched code affect the following
 use of regexpr's output as input to substr, to
 pull out the matched text from the string?
 x-c(ooo,good food,bad)
 r-regexpr(o+, x)
 substring(x,r,attr(r,match.length)+r-1)
[1] ooo oo 
   
 

 no; same output

   
 substr(x,r,attr(r,match.length)+r-1)
[1] ooo oo 
   
 

 no; same output

   
 r
[1]  1  2 -1
attr(,match.length)
[1]  3  2 -1
 attr(r,match.length)+r-1
[1]  3  3 -3
attr(,match.length)
[1]  3  2 -1
   
 

 for the positive indices there is no change, as you might expect.

 if i understand your concern, the issue is that regexpr returns -1 (with
 the corresponding attribute -1) where there is no match.  in this case,
 you expect  as the substring. 

 if there is no match, we have:

 start = r = -1 (the start you index provide)
 stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)

 for a string of length n, my patch computes the final indices as follows:

 start' = n + start - 1
 stop' = n + stop - 1

 whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
   

except for that stop - start = -3 - -1 = -2, but that's still negative,
i.e., stop'  start'.
silly me, sorry.

vQ

 that is, stop'  start', hence an empty string is returned, by virtue of
 the original code.  (see the sources for details.)

 does this answer your question?



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel