Re: [R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)
Yep. You are right. That is better. -tgs On Thu, May 29, 2014 at 5:23 PM, Ista Zahn wrote: > 10Hi Thomas, > > On Thu, May 29, 2014 at 9:15 AM, Thomas Stewart > wrote: > > Thanks to to Ista Zahn, I was able to find a work around solution. The > key > > seems to be that string1 needs to be encoded as UTF-8 prior to being > passed > > to gsub. For whatever reason, > > > > Encoding(string1) <- "UTF-8" > > > > does not change the encoding on my Windows machine. > > Right, because "ASCII strings will never be marked with a declared > encoding" (read ?Encoding again). > > The work around: I > > paste an obvious UTF-8 character "\u00A0" to the start of the string, > send > > the string through gsub, then remove the "\u00A0" character from the > output. > > > > string1 <- "\u00A0text X"; string1 > > Encoding(string1) > > new_string1 <- gsub("X","\u2265",string1); new_string1 > > new_string2 <- substring(new_string1,2); new_string2 > > > > If you know of a less hackish way to accomplish this, I'm interested to > > hear it. > > Why not just set the encoding after the fact, as I suggested? > > string1 <- "X"; string1 > new_string1 <- gsub("X","\u2265",string1); new_string1 > Encoding(new_string1) <- "UTF-8"; new_string1 > > Best, > Ista > However, this work around is sufficient for now. > > > > Thanks, > > -tgs > > > > > > On Wed, May 28, 2014 at 10:25 PM, Thomas Stewart < > tgs.public.m...@gmail.com> > > wrote: > > > >> Can anyone help me understand the following behavior? > >> > >> I want to replace the letter 'X' in > >> the string > >> 'text X' with 'â¥' (\u226 > >> 5 > >> ). The output from gsub is not what I expect. It gives: "text ââ°Â¥". > >> > >> Now, suppose I want to replace the character 'â¤' in > >> the string > >> 'text â¤' with 'â¥'. Then, gsub gives the expected, desired output. > >> > >> What am I missing? > >> > >> Thanks for any insight. > >> -tgs > >> > >> Minimal Working Example: > >> > >> string1 <- "text X"; string1 > >> new_string1 <- gsub("X","\u2265",string1); new_string1 > >> > >> string2 <- "text \u2264"; string2 > >> new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > >> > >> charToRaw(new_string1) > >> charToRaw(new_string2) > >> > >> sessionInfo() > >> > >> ## OUTPUT > >> > >> > string1 <- "text X"; string1 > >> [1] "text X" > >> > >> > new_string1 <- gsub("X","\u2265",string1); new_string1 > >> [1] "text ââ°Â¥" > >> > >> > string2 <- "text \u2264"; string2 > >> [1] "text â¤" > >> > >> > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > >> [1] "text â¥" > >> > >> > charToRaw(new_string1) > >> [1] 74 65 78 74 20 e2 89 a5 > >> > >> > charToRaw(new_string2) > >> [1] 74 65 78 74 20 e2 89 a5 > >> > >> > sessionInfo() > >> R version 3.0.2 (2013-09-25) > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > >> > >> locale: > >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > >> States.1252LC_MONETARY=English_United States.1252 > >> [4] LC_NUMERIC=C LC_TIME=English_United > >> States.1252 > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> loaded via a namespace (and not attached): > >> [1] tools_3.0.2 > >> > > > > [[alternative HTML version deleted]] > > > > > > __ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)
10Hi Thomas, On Thu, May 29, 2014 at 9:15 AM, Thomas Stewart wrote: > Thanks to to Ista Zahn, I was able to find a work around solution. The key > seems to be that string1 needs to be encoded as UTF-8 prior to being passed > to gsub. For whatever reason, > > Encoding(string1) <- "UTF-8" > > does not change the encoding on my Windows machine. Right, because "ASCII strings will never be marked with a declared encoding" (read ?Encoding again). The work around: I > paste an obvious UTF-8 character "\u00A0" to the start of the string, send > the string through gsub, then remove the "\u00A0" character from the output. > > string1 <- "\u00A0text X"; string1 > Encoding(string1) > new_string1 <- gsub("X","\u2265",string1); new_string1 > new_string2 <- substring(new_string1,2); new_string2 > > If you know of a less hackish way to accomplish this, I'm interested to > hear it. Why not just set the encoding after the fact, as I suggested? string1 <- "X"; string1 new_string1 <- gsub("X","\u2265",string1); new_string1 Encoding(new_string1) <- "UTF-8"; new_string1 Best, Ista However, this work around is sufficient for now. > > Thanks, > -tgs > > > On Wed, May 28, 2014 at 10:25 PM, Thomas Stewart > wrote: > >> Can anyone help me understand the following behavior? >> >> I want to replace the letter 'X' in >> the string >> 'text X' with '≥' (\u226 >> 5 >> ). The output from gsub is not what I expect. It gives: "text ≥". >> >> Now, suppose I want to replace the character '≤' in >> the string >> 'text ≤' with '≥'. Then, gsub gives the expected, desired output. >> >> What am I missing? >> >> Thanks for any insight. >> -tgs >> >> Minimal Working Example: >> >> string1 <- "text X"; string1 >> new_string1 <- gsub("X","\u2265",string1); new_string1 >> >> string2 <- "text \u2264"; string2 >> new_string2 <- gsub("\u2264","\u2265",string2); new_string2 >> >> charToRaw(new_string1) >> charToRaw(new_string2) >> >> sessionInfo() >> >> ## OUTPUT >> >> > string1 <- "text X"; string1 >> [1] "text X" >> >> > new_string1 <- gsub("X","\u2265",string1); new_string1 >> [1] "text ≥" >> >> > string2 <- "text \u2264"; string2 >> [1] "text ≤" >> >> > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 >> [1] "text ≥" >> >> > charToRaw(new_string1) >> [1] 74 65 78 74 20 e2 89 a5 >> >> > charToRaw(new_string2) >> [1] 74 65 78 74 20 e2 89 a5 >> >> > sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C LC_TIME=English_United >> States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] tools_3.0.2 >> > > [[alternative HTML version deleted]] > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)
Thanks to to Ista Zahn, I was able to find a work around solution. The key seems to be that string1 needs to be encoded as UTF-8 prior to being passed to gsub. For whatever reason, Encoding(string1) <- "UTF-8" does not change the encoding on my Windows machine. The work around: I paste an obvious UTF-8 character "\u00A0" to the start of the string, send the string through gsub, then remove the "\u00A0" character from the output. string1 <- "\u00A0text X"; string1 Encoding(string1) new_string1 <- gsub("X","\u2265",string1); new_string1 new_string2 <- substring(new_string1,2); new_string2 If you know of a less hackish way to accomplish this, I'm interested to hear it. However, this work around is sufficient for now. Thanks, -tgs On Wed, May 28, 2014 at 10:25 PM, Thomas Stewart wrote: > Can anyone help me understand the following behavior? > > I want to replace the letter 'X' in > âthe string â > 'text X' with 'â¥' (\u226 > â5 > ). The output from gsub is not what I expect. It gives: "text ââ°Â¥". > > Now, suppose I want to replace the character 'â¤' in > â the stringâ > 'text â¤' with 'â¥'. Then, gsub gives the expected, desired output. > > âWhat am I missing? > > Thanks for any insight. > -tgs > > Minimal Working Example: > > string1 <- "text X"; string1 > new_string1 <- gsub("X","\u2265",string1); new_string1 > > string2 <- "text \u2264"; string2 > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > > charToRaw(new_string1) > charToRaw(new_string2) > > sessionInfo() > > ## OUTPUT > > > string1 <- "text X"; string1 > [1] "text X" > > > new_string1 <- gsub("X","\u2265",string1); new_string1 > [1] "text ââ°Â¥" > > > string2 <- "text \u2264"; string2 > [1] "text â¤" > > > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > [1] "text â¥" > > > charToRaw(new_string1) > [1] 74 65 78 74 20 e2 89 a5 > > > charToRaw(new_string2) > [1] 74 65 78 74 20 e2 89 a5 > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)
On May 28, 2014, at 7:25 PM, Thomas Stewart wrote: > Can anyone help me understand the following behavior? > > I want to replace the letter 'X' in > the string > 'text X' with '≥' (\u226 > 5 > ). The output from gsub is not what I expect. It gives: "text ≥". > > Now, suppose I want to replace the character '≤' in > the string > 'text ≤' with '≥'. Then, gsub gives the expected, desired output. > > What am I missing? > > Thanks for any insight. > -tgs > > Minimal Working Example: > > string1 <- "text X"; string1 > new_string1 <- gsub("X","\u2265",string1); new_string1 Try this instead: > new_string1 <- gsub("X","\\\u2265",string1); new_string1 [1] "text ≥" Each "\" needs to be escaped, both the "\" in \u2265 as well as the "\" that escapes it. > nchar("\\") [1] 1 > nchar("\\\u2265") [1] 2 You would be well-served by spending effort at reading: ?Quotes -- David. > > string2 <- "text \u2264"; string2 > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > > charToRaw(new_string1) > charToRaw(new_string2) > > sessionInfo() > > ## OUTPUT > >> string1 <- "text X"; string1 > [1] "text X" > >> new_string1 <- gsub("X","\u2265",string1); new_string1 > [1] "text ≥" > >> string2 <- "text \u2264"; string2 > [1] "text ≤" > >> new_string2 <- gsub("\u2264","\u2265",string2); new_string2 > [1] "text ≥" > >> charToRaw(new_string1) > [1] 74 65 78 74 20 e2 89 a5 > charToRaw("\\\u2265") [1] 5c e2 89 a5 > >> charToRaw(new_string2) > [1] 74 65 78 74 20 e2 89 a5 > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > It was a good idea to post sessionInfo(), but it would have been even better to have posted in plain text. > [[alternative HTML version deleted]] > -- David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)
Can anyone help me understand the following behavior? I want to replace the letter 'X' in âthe string â 'text X' with 'â¥' (\u226 â5 ). The output from gsub is not what I expect. It gives: "text ââ°Â¥". Now, suppose I want to replace the character 'â¤' in â the stringâ 'text â¤' with 'â¥'. Then, gsub gives the expected, desired output. âWhat am I missing? Thanks for any insight. -tgs Minimal Working Example: string1 <- "text X"; string1 new_string1 <- gsub("X","\u2265",string1); new_string1 string2 <- "text \u2264"; string2 new_string2 <- gsub("\u2264","\u2265",string2); new_string2 charToRaw(new_string1) charToRaw(new_string2) sessionInfo() ## OUTPUT > string1 <- "text X"; string1 [1] "text X" > new_string1 <- gsub("X","\u2265",string1); new_string1 [1] "text ââ°Â¥" > string2 <- "text \u2264"; string2 [1] "text â¤" > new_string2 <- gsub("\u2264","\u2265",string2); new_string2 [1] "text â¥" > charToRaw(new_string1) [1] 74 65 78 74 20 e2 89 a5 > charToRaw(new_string2) [1] 74 65 78 74 20 e2 89 a5 > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.0.2 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.