I would guess that the original URLs were encoded somehow (non-ASCII), and the 
person who received them didn't understand how to deal with them either and 
url-encoded them with the thought that they would not lose information that 
way. Unfortunately, they probably lost the meta information as to how they were 
originally encoded, and without that this turns into a detective job that will 
likely need C's ability (perhaps via RCpp) to ignore type information to put 
things back. If you are lucky all strings were originally encoded the same 
way... if really lucky they were all UTF8 or UTF16 (which would have nuls and 
other odd bytes). Proceeding with the broken strings you have now will almost 
certainly not work. The fragments shown are not even vaguely recognizable as 
URLs, so I don't see how we can do anything meaningful with them.

Please read the Posting Guide. One point made there to note is that if C 
becomes part of the question then R-devel becomes the more appropriate list. 
The other is that for all of these lists plain text email is expected (nor 
HTML). 
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On September 1, 2014 9:02:33 AM PDT, Oliver Keyes <oke...@wikimedia.org> wrote:
>Hey all,
>
>So, I'm attempting to decode some (and I don't know why anyone did
>this)
>URl-encoded user agents. Running URLdecode over them generates the
>error:
>
>"Error in rawToChar(out) : embedded nul in string"
>
>Okay, so there's an embedded nul - fair enough. Presumably decoding the
>URL
>is exposing it in a format R doesn't like. Except when I try to dig
>down
>and work out what an encoded nul looks like, in order to simply remove
>them
>with something like gsub(), I end up with several different strings,
>all of
>which apparently resolve to an embedded nul:
>
>> URLdecode("0;%20@%gIL")
>Error in rawToChar(out) : embedded nul in string: '0; @\0L'
>In addition: Warning message:
>In URLdecode("0;%20@%gIL") :
>  out-of-range values treated as 0 in coercion to raw
>> URLdecode("%20%use")
>Error in rawToChar(out) : embedded nul in string: ' \0e'
>In addition: Warning message:
>In URLdecode("%20%use") :
>  out-of-range values treated as 0 in coercion to raw
>
>I'm a relative newb to encodings, so maybe the fault is simply in my
>understanding of how this should work, but - why are both strings being
>read as including nuls, despite having different values? And how would
>I go
>about removing said nuls?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to