Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
Wacek Kusnierczyk wrote:
> William Dunlap wrote:
>   
>> Would your patched code affect the following
>> use of regexpr's output as input to substr, to
>> pull out the matched text from the string?
>>> x<-c("ooo","good food","bad")
>>> r<-regexpr("o+", x)
>>> substring(x,r,attr(r,"match.length")+r-1)
>>[1] "ooo" "oo"  ""   
>>   
>> 
>
> no; same output
>
>   
>>> substr(x,r,attr(r,"match.length")+r-1)
>>[1] "ooo" "oo"  ""   
>>   
>> 
>
> no; same output
>
>   
>>> r
>>[1]  1  2 -1
>>attr(,"match.length")
>>[1]  3  2 -1
>>> attr(r,"match.length")+r-1
>>[1]  3  3 -3
>>attr(,"match.length")
>>[1]  3  2 -1
>>   
>> 
>
> for the positive indices there is no change, as you might expect.
>
> if i understand your concern, the issue is that regexpr returns -1 (with
> the corresponding attribute -1) where there is no match.  in this case,
> you expect "" as the substring. 
>
> if there is no match, we have:
>
> start = r = -1 (the start you index provide)
> stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)
>
> for a string of length n, my patch computes the final indices as follows:
>
> start' = n + start - 1
> stop' = n + stop - 1
>
> whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
>   

except for that stop - start = -3 - -1 = -2, but that's still negative,
i.e., stop' < start'.
silly me, sorry.

vQ

> that is, stop' < start', hence an empty string is returned, by virtue of
> the original code.  (see the sources for details.)
>
> does this answer your question?
>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
William Dunlap wrote:
>
> Would your patched code affect the following
> use of regexpr's output as input to substr, to
> pull out the matched text from the string?
>> x<-c("ooo","good food","bad")
>> r<-regexpr("o+", x)
>> substring(x,r,attr(r,"match.length")+r-1)
>[1] "ooo" "oo"  ""   
>   

no; same output

>> substr(x,r,attr(r,"match.length")+r-1)
>[1] "ooo" "oo"  ""   
>   

no; same output

>> r
>[1]  1  2 -1
>attr(,"match.length")
>[1]  3  2 -1
>> attr(r,"match.length")+r-1
>[1]  3  3 -3
>attr(,"match.length")
>[1]  3  2 -1
>   

for the positive indices there is no change, as you might expect.

if i understand your concern, the issue is that regexpr returns -1 (with
the corresponding attribute -1) where there is no match.  in this case,
you expect "" as the substring. 

if there is no match, we have:

start = r = -1 (the start you index provide)
stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide)

for a string of length n, my patch computes the final indices as follows:

start' = n + start - 1
stop' = n + stop - 1

whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. 
that is, stop' < start', hence an empty string is returned, by virtue of
the original code.  (see the sources for details.)

does this answer your question?

vQ

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread William Dunlap


Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -Original Message-
> From: r-devel-boun...@r-project.org 
> [mailto:r-devel-boun...@r-project.org] On Behalf Of Wacek Kusnierczyk
> Sent: Thursday, May 28, 2009 5:30 AM
> Cc: R help project; r-devel@r-project.org; Allan Engelhardt
> Subject: Re: [Rd] [R] split strings
> 
> (diverted to r-devel, a source code patch attached)
> 
> Wacek Kusnierczyk wrote:
> > Allan Engelhardt wrote:
> >   
> >> Immaterial, yes, but it is always good to test :) and your solution
> >> *is* faster and it is even faster if you can assume byte strings:
> >> 
> >
> > :)
> >
> > indeed;  though if the speed is immaterial (and in this case it
> > supposedly was), it's probably not worth risking fixed=TRUE removing
> > '.tif' from the middle of the name, however unlikely this 
> might be (cf
> > murphy's laws).
> >
> > but if you can assume that each string ends with a '.tif' 
> (or any other
> > \..{3} substring), then substr is marginally faster than 
> sub, even as a
> > three-pass approach, while avoiding the risk of removing 
> '.tif' from the
> > middle:
> >
> > strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> > paste(sample(letters, 10), collapse='')))
> > library(rbenchmark)
> > benchmark(columns=c('test', 'elapsed'), 
> replications=1000, order=NULL,
> >substr={basenames=basename(strings); substr(basenames, 1,
> > nchar(basenames)-4)},
> >sub=sub('.tif', '', basename(strings), fixed=TRUE, 
> useBytes=TRUE))
> > # test elapsed
> > # 1 substr   3.176
> > # 2sub   3.296
> >   
> 
> btw., i wonder why negative indices default to 1 in substr:
> 
> substr('foobar', -5, 5)
> # "fooba"
> # substr('foobar', 1, 5)
> substr('foobar', 2, -2)
> # ""
> # substr('foobar', 2, 1)
> 
> this does not seem to be documented in ?substr.

Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
   > x<-c("ooo","good food","bad")
   > r<-regexpr("o+", x)
   > substring(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > substr(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > r
   [1]  1  2 -1
   attr(,"match.length")
   [1]  3  2 -1
   > attr(r,"match.length")+r-1
   [1]  3  3 -3
   attr(,"match.length")
   [1]  3  2 -1

>  there are 
> ways to make
> negative indices meaningful, e.g., by taking them as indexing from
> behind (as in, e.g., perl):
> 
> # hypothetical
> substr('foobar', -5, 5)
> # "ooba"
> # substr('foobar', 6-5+1, 5)
> substr('foobar', 2, -2)
> # "ooba"
> # substr('foobar', 2, 6-2+1)
> 
> there is a trivial fix to src/main/character.c that gives substr the
> extended functionality -- see the attached patch.  the patch has been
> created and tested as follows:
> 
> svn co https://svn.r-project.org/R/trunk r-devel
> cd r-devel
> # modifications made to src/main/character.c
> svn diff > character.c.diff
> svn revert -R .
> patch -p0 < character.c.diff
>
> ./configure
> make
> make check-all
> # no problems reported
> 
> with the patched substr, the original problem can now be solved more
> concisely, using a two-pass approach, with performance still 
> better than
> the sub/fixed/bytes one, as follows:
> 
> strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> paste(sample(letters, 10), collapse='')))
> library(rbenchmark)
> benchmark(columns=c('test', 'elapsed'), 
> replications=1000, order=NULL,
> substr=substr(basename(strings), 1, -5),
> 'substr-nchar'={
> basenames=basename(strings)
> substr(basenames, 1, nchar(basenames)-4) },
> sub=sub('.tif', '', basename(strings), fixed=TRUE, 
> useBytes=TRUE))
> # test elapsed
> # 1   substr   2.981
> # 2 substr-nchar   3.206
> # 3  sub   3.273
> 
> if this sounds interesting, i can update the docs accordingly.
> 
> vQ
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [R] split strings

2009-05-28 Thread Wacek Kusnierczyk
(diverted to r-devel, a source code patch attached)

Wacek Kusnierczyk wrote:
> Allan Engelhardt wrote:
>   
>> Immaterial, yes, but it is always good to test :) and your solution
>> *is* faster and it is even faster if you can assume byte strings:
>> 
>
> :)
>
> indeed;  though if the speed is immaterial (and in this case it
> supposedly was), it's probably not worth risking fixed=TRUE removing
> '.tif' from the middle of the name, however unlikely this might be (cf
> murphy's laws).
>
> but if you can assume that each string ends with a '.tif' (or any other
> \..{3} substring), then substr is marginally faster than sub, even as a
> three-pass approach, while avoiding the risk of removing '.tif' from the
> middle:
>
> strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> paste(sample(letters, 10), collapse='')))
> library(rbenchmark)
> benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
>substr={basenames=basename(strings); substr(basenames, 1,
> nchar(basenames)-4)},
>sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
> # test elapsed
> # 1 substr   3.176
> # 2sub   3.296
>   

btw., i wonder why negative indices default to 1 in substr:

substr('foobar', -5, 5)
# "fooba"
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# ""
# substr('foobar', 2, 1)

this does not seem to be documented in ?substr.  there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):

# hypothetical
substr('foobar', -5, 5)
# "ooba"
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# "ooba"
# substr('foobar', 2, 6-2+1)

there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch.  the patch has been
created and tested as follows:

svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff > character.c.diff
svn revert -R .
patch -p0 < character.c.diff
   
./configure
make
make check-all
# no problems reported

with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:

strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1   substr   2.981
# 2 substr-nchar   3.206
# 3  sub   3.273

if this sounds interesting, i can update the docs accordingly.

vQ
Index: src/main/character.c
===
--- src/main/character.c	(revision 48667)
+++ src/main/character.c	(working copy)
@@ -244,7 +244,12 @@
 	ss = CHAR(el);
 	slen = strlen(ss); /* FIXME -- should handle embedded nuls */
 	buf = R_AllocStringBuffer(slen+1, &cbuff);
-	if (start < 1) start = 1;
+	if (start == 0) 
+		start = 1;
+	else if (start < 0) 
+		start = slen + start + 1;
+	if (stop < 0) 
+		stop = slen + stop + 1;
 	if (start > stop || start > slen) {
 		buf[0] = '\0';
 	} else {
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel