[R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Jannis
Dear R users,


the way I understand the documentation of sub() and regexp the following code: 



sub('[[:digit:]]{1,2}', '', '9ewww')



... should yield:

'ewww'


It returns, however:

'www'


Why is this the case? My code should just substitute 1 (minimum) or up to 2 
(maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret 
something here?


Thanks for any ideas
Jannis


 sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8    LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C LC_NAME=C     
 [9] LC_ADDRESS=C   LC_TELEPHONE=C    
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base    


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Duncan Murdoch

On 09/12/2011 9:20 AM, Jannis wrote:

Dear R users,


the way I understand the documentation of sub() and regexp the following code:



sub('[[:digit:]]{1,2}', '', '9ewww')



... should yield:

'ewww'


It returns, however:

'www'


Why is this the case? My code should just substitute 1 (minimum) or up to 2 
(maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret 
something here?


I get your expected output of ewww running 2.14.0 or 2.14.0-patched on 
Windows.   So it's not a universal problem...


Duncan Murdoch


Thanks for any ideas
Jannis


  sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8   
  [7] LC_PAPER=C LC_NAME=C 
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   


attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base   



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Prof Brian Ripley
This is AFAICS an instance of bug PR#14408 : it seems that in UTF-8 
locales the grammar generated by the TRE engine for repetitions is in 
odd cases buggy.  And as the author has vanished, our hopes of his 
fixing it are slim.


Try perl=TRUE .

On 09/12/2011 14:20, Jannis wrote:

Dear R users,


the way I understand the documentation of sub() and regexp the following code:



sub('[[:digit:]]{1,2}', '', '9ewww')



... should yield:

'ewww'


It returns, however:

'www'


Why is this the case? My code should just substitute 1 (minimum) or up to 2 
(maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret 
something here?


Thanks for any ideas
Jannis



sessionInfo()

R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Sarah Goslee
But I do get the incorrect result on R 2.14.0 on linux:
 sub('[[:digit:]]{1,2}', '', '9ewww')
[1] www

And also:

 sub('[[:digit:]]{1,2}', '', '9ewww')
[1] www
 sub('[[:digit:]]{1,2}', '', 'ewww9')
[1] ww9
 sub('\\d{1,2}', '', 'ewww9')
[1] ww9

But:
 sub('\\d', '', 'ewww9')
[1] ewww
 sub('\\d*', '', '9ewww')
[1] ewww

So it seems to be something about the way the curly braces are
handled, but only with certain groups:

 sub('e{1,2}', '', '9ewww')
[1] 9www
 sub('9{1,2}', '', '9ewww')
[1] ewww


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)

 sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
 On 09/12/2011 9:20 AM, Jannis wrote:

 Dear R users,


 the way I understand the documentation of sub() and regexp the following
 code:



 sub('[[:digit:]]{1,2}', '', '9ewww')



 ... should yield:

 'ewww'


 It returns, however:

 'www'


 Why is this the case? My code should just substitute 1 (minimum) or up to
 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
 misinterpret something here?


 I get your expected output of ewww running 2.14.0 or 2.14.0-patched on
 Windows.   So it's not a universal problem...

 Duncan Murdoch


 Thanks for any ideas
 Jannis


   sessionInfo()
 R version 2.14.0 (2011-10-31)
 Platform: i686-pc-linux-gnu (32-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3]
 LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5]
 LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=C
           LC_NAME=C                   [9] LC_ADDRESS=C
 LC_TELEPHONE=C            [11] LC_MEASUREMENT=en_US.UTF-8
 LC_IDENTIFICATION=C
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base


-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Jannis
Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It 
would be great, however, to prevent others from stumbling in this pitfall by 
fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this 
might be difficult/impossible so we might have to live with it. 


By the way, is there an easily accessible and search able list of such bugs for 
R (just for the future)?


Thanks a lot
Jannis



- Ursprüngliche Message -
Von: Sarah Goslee sarah.gos...@gmail.com
An: Duncan Murdoch murdoch.dun...@gmail.com
Cc: Jannis bt_jan...@yahoo.de; r-help@r-project.org r-help@r-project.org
Gesendet: 15:37 Freitag, 9.Dezember 2011
Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp

But I do get the incorrect result on R 2.14.0 on linux:
 sub('[[:digit:]]{1,2}', '', '9ewww')
[1] www

And also:

 sub('[[:digit:]]{1,2}', '', '9ewww')
[1] www
 sub('[[:digit:]]{1,2}', '', 'ewww9')
[1] ww9
 sub('\\d{1,2}', '', 'ewww9')
[1] ww9

But:
 sub('\\d', '', 'ewww9')
[1] ewww
 sub('\\d*', '', '9ewww')
[1] ewww

So it seems to be something about the way the curly braces are
handled, but only with certain groups:

 sub('e{1,2}', '', '9ewww')
[1] 9www
 sub('9{1,2}', '', '9ewww')
[1] ewww


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)

 sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C                 LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote:
 On 09/12/2011 9:20 AM, Jannis wrote:

 Dear R users,


 the way I understand the documentation of sub() and regexp the following
 code:



 sub('[[:digit:]]{1,2}', '', '9ewww')



 ... should yield:

 'ewww'


 It returns, however:

 'www'


 Why is this the case? My code should just substitute 1 (minimum) or up to
 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
 misinterpret something here?


 I get your expected output of ewww running 2.14.0 or 2.14.0-patched on
 Windows.   So it's not a universal problem...

 Duncan Murdoch


 Thanks for any ideas
 Jannis


   sessionInfo()
 R version 2.14.0 (2011-10-31)
 Platform: i686-pc-linux-gnu (32-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3]
 LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5]
 LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=C
           LC_NAME=C                   [9] LC_ADDRESS=C
 LC_TELEPHONE=C            [11] LC_MEASUREMENT=en_US.UTF-8
 LC_IDENTIFICATION=C
 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base


-- 
Sarah Goslee
http://www.functionaldiversity.org


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unexpected behaviour of sub() / usage of regexp

2011-12-09 Thread Prof Brian Ripley

On 09/12/2011 14:49, Jannis wrote:

Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It 
would be great, however, to prevent others from stumbling in this pitfall by 
fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this 
might be difficult/impossible so we might have to live with it.


By the way, is there an easily accessible and search able list of such bugs for 
R (just for the future)?


http://www.bugs.r-project.org

I'm not sure how obvious it would be that it is the same problem.  I 
happened to have worked on trying to solve it.




Thanks a lot
Jannis



- Ursprüngliche Message -
Von: Sarah Gosleesarah.gos...@gmail.com
An: Duncan Murdochmurdoch.dun...@gmail.com
Cc: Jannisbt_jan...@yahoo.de; r-help@r-project.orgr-help@r-project.org
Gesendet: 15:37 Freitag, 9.Dezember 2011
Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp

But I do get the incorrect result on R 2.14.0 on linux:

sub('[[:digit:]]{1,2}', '', '9ewww')

[1] www

And also:


sub('[[:digit:]]{1,2}', '', '9ewww')

[1] www

sub('[[:digit:]]{1,2}', '', 'ewww9')

[1] ww9

sub('\\d{1,2}', '', 'ewww9')

[1] ww9

But:

sub('\\d', '', 'ewww9')

[1] ewww

sub('\\d*', '', '9ewww')

[1] ewww

So it seems to be something about the way the curly braces are
handled, but only with certain groups:


sub('e{1,2}', '', '9ewww')

[1] 9www

sub('9{1,2}', '', '9ewww')

[1] ewww


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)


sessionInfo()

R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdochmurdoch.dun...@gmail.com  wrote:

On 09/12/2011 9:20 AM, Jannis wrote:


Dear R users,


the way I understand the documentation of sub() and regexp the following
code:



sub('[[:digit:]]{1,2}', '', '9ewww')



... should yield:

'ewww'


It returns, however:

'www'


Why is this the case? My code should just substitute 1 (minimum) or up to
2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
misinterpret something here?



I get your expected output of ewww running 2.14.0 or 2.14.0-patched on
Windows.   So it's not a universal problem...

Duncan Murdoch



Thanks for any ideas
Jannis



  sessionInfo()

R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C[3]
LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8  [5]
LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C
   LC_NAME=C   [9] LC_ADDRESS=C
LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base






--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.