Re: [R] unexpected behaviour of sub() / usage of regexp
On 09/12/2011 14:49, Jannis wrote: Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It would be great, however, to prevent others from stumbling in this pitfall by fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this might be difficult/impossible so we might have to live with it. By the way, is there an easily accessible and search able list of such bugs for R (just for the future)? http://www.bugs.r-project.org I'm not sure how obvious it would be that it is the same problem. I happened to have worked on trying to solve it. Thanks a lot Jannis - Ursprüngliche Message - Von: Sarah Goslee An: Duncan Murdoch Cc: Jannis; "r-help@r-project.org" Gesendet: 15:37 Freitag, 9.Dezember 2011 Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp But I do get the incorrect result on R 2.14.0 on linux: sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" And also: sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" sub('[[:digit:]]{1,2}', '', 'ewww9') [1] "ww9" sub('\\d{1,2}', '', 'ewww9') [1] "ww9" But: sub('\\d', '', 'ewww9') [1] "ewww" sub('\\d*', '', '9ewww') [1] "ewww" So it seems to be something about the way the curly braces are handled, but only with certain groups: sub('e{1,2}', '', '9ewww') [1] "9www" sub('9{1,2}', '', '9ewww') [1] "ewww" But, as Prof. Ripley's email suggests, perl=TRUE solves the problem. (I was trying out various combinations when it appeared in my inbox.) sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch wrote: On 09/12/2011 9:20 AM, Jannis wrote: Dear R users, the way I understand the documentation of sub() and regexp the following code: sub('[[:digit:]]{1,2}', '', '9ewww') ... should yield: 'ewww' It returns, however: 'www' Why is this the case? My code should just substitute 1 (minimum) or up to 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret something here? I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on Windows. So it's not a universal problem... Duncan Murdoch Thanks for any ideas Jannis sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C[3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unexpected behaviour of sub() / usage of regexp
Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It would be great, however, to prevent others from stumbling in this pitfall by fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this might be difficult/impossible so we might have to live with it. By the way, is there an easily accessible and search able list of such bugs for R (just for the future)? Thanks a lot Jannis - Ursprüngliche Message - Von: Sarah Goslee An: Duncan Murdoch Cc: Jannis ; "r-help@r-project.org" Gesendet: 15:37 Freitag, 9.Dezember 2011 Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp But I do get the incorrect result on R 2.14.0 on linux: > sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" And also: > sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" > sub('[[:digit:]]{1,2}', '', 'ewww9') [1] "ww9" > sub('\\d{1,2}', '', 'ewww9') [1] "ww9" But: > sub('\\d', '', 'ewww9') [1] "ewww" > sub('\\d*', '', '9ewww') [1] "ewww" So it seems to be something about the way the curly braces are handled, but only with certain groups: > sub('e{1,2}', '', '9ewww') [1] "9www" > sub('9{1,2}', '', '9ewww') [1] "ewww" But, as Prof. Ripley's email suggests, perl=TRUE solves the problem. (I was trying out various combinations when it appeared in my inbox.) > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch wrote: > On 09/12/2011 9:20 AM, Jannis wrote: >> >> Dear R users, >> >> >> the way I understand the documentation of sub() and regexp the following >> code: >> >> >> >> sub('[[:digit:]]{1,2}', '', '9ewww') >> >> >> >> ... should yield: >> >> 'ewww' >> >> >> It returns, however: >> >> 'www' >> >> >> Why is this the case? My code should just substitute 1 (minimum) or up to >> 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I >> misinterpret something here? > > > I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on > Windows. So it's not a universal problem... > > Duncan Murdoch > >> >> Thanks for any ideas >> Jannis >> >> >> > sessionInfo() >> R version 2.14.0 (2011-10-31) >> Platform: i686-pc-linux-gnu (32-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C >> LC_NAME=C [9] LC_ADDRESS=C >> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unexpected behaviour of sub() / usage of regexp
But I do get the incorrect result on R 2.14.0 on linux: > sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" And also: > sub('[[:digit:]]{1,2}', '', '9ewww') [1] "www" > sub('[[:digit:]]{1,2}', '', 'ewww9') [1] "ww9" > sub('\\d{1,2}', '', 'ewww9') [1] "ww9" But: > sub('\\d', '', 'ewww9') [1] "ewww" > sub('\\d*', '', '9ewww') [1] "ewww" So it seems to be something about the way the curly braces are handled, but only with certain groups: > sub('e{1,2}', '', '9ewww') [1] "9www" > sub('9{1,2}', '', '9ewww') [1] "ewww" But, as Prof. Ripley's email suggests, perl=TRUE solves the problem. (I was trying out various combinations when it appeared in my inbox.) > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch wrote: > On 09/12/2011 9:20 AM, Jannis wrote: >> >> Dear R users, >> >> >> the way I understand the documentation of sub() and regexp the following >> code: >> >> >> >> sub('[[:digit:]]{1,2}', '', '9ewww') >> >> >> >> ... should yield: >> >> 'ewww' >> >> >> It returns, however: >> >> 'www' >> >> >> Why is this the case? My code should just substitute 1 (minimum) or up to >> 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I >> misinterpret something here? > > > I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on > Windows. So it's not a universal problem... > > Duncan Murdoch > >> >> Thanks for any ideas >> Jannis >> >> >> > sessionInfo() >> R version 2.14.0 (2011-10-31) >> Platform: i686-pc-linux-gnu (32-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C >> LC_NAME=C [9] LC_ADDRESS=C >> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unexpected behaviour of sub() / usage of regexp
This is AFAICS an instance of bug PR#14408 : it seems that in UTF-8 locales the grammar generated by the TRE engine for repetitions is in odd cases buggy. And as the author has vanished, our hopes of his fixing it are slim. Try perl=TRUE . On 09/12/2011 14:20, Jannis wrote: Dear R users, the way I understand the documentation of sub() and regexp the following code: sub('[[:digit:]]{1,2}', '', '9ewww') ... should yield: 'ewww' It returns, however: 'www' Why is this the case? My code should just substitute 1 (minimum) or up to 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret something here? Thanks for any ideas Jannis sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unexpected behaviour of sub() / usage of regexp
On 09/12/2011 9:20 AM, Jannis wrote: Dear R users, the way I understand the documentation of sub() and regexp the following code: sub('[[:digit:]]{1,2}', '', '9ewww') ... should yield: 'ewww' It returns, however: 'www' Why is this the case? My code should just substitute 1 (minimum) or up to 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret something here? I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on Windows. So it's not a universal problem... Duncan Murdoch Thanks for any ideas Jannis > sessionInfo() R version 2.14.0 (2011-10-31) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.