On Thu, Feb 17, 2011 at 17:56, <isaacschlue...@gmail.com> wrote: > So, \0 is valid UTF-8, but \xC0\x80 is the encoding for the 0 code-point in > "Modified UTF-8", which is backwards compatible with null-terminated > c-strings. >
Yes, where "Modified UTF-8" is invalid UTF-8 and would oribably be mangled by a validating UTF-8 parser replacing \xC0\x80 by, e.g., U+FFFD (as recommended by RFC 3629). I.e., it's not something I would recommend. If we ever change our UTF-8 parsing to be validating (which isn't unthinkable) using overlong encodings for zero would fail. > It's interesting to me that v8 *does* terminate on \0 if a length is not > given, > no matter what the bytes are. That's because the function serves two purposes: It creates strings from either ASCII or UTF-8 inputs (which is reasonable since ASCII is a subset of UTF-8). For the common ASCII case using a C string literal, it allows omitting the length. For UTF-8 input, the length is expected to be given (as per the comment on the function declaration). So it's just to make it easy to write: String::New("my string"); It's a convenience that's not supposed to be used with UTF-8 input. If null-terminated UTF-8 strings are common, it would be better to add a >> separate function to create JS strings from those. >> > > That would be very agreeable. Would it be better to add an optional flag? > It > seems like it's not much of a change in behavior. It would change the signature of the existing method, so I think I would prefer a separate function. They may both delegate to the same implementation with a flag, though. > If you are creating the UTF-8 byte sequences yourself, you are likely to >> > already > >> know where it ends, and it's just a matter of not throwing that >> information >> away. If you are using a library that returns null-terminated UTF-8 >> sequences, >> then it's obviously not as simple. >> > > Here's a bit more information about a use case that prompted this. (It's > worse > than a library.) > > Node's "Buffer" objects represent a block of bytes outside of v8's heap. > The > Buffer::ToString method takes the block of bytes, and passes it to v8's > String::New method. The length of the buffer is known, and is provided to > String::New. However, the length of the string it contains might *not* be > known. > > If the code putting data into the buffer is under your control, you should know the end point, but I can see that that might not always be the case, e.g., in the case of externally supplied zero-or-length terminated data. > For example, say that you are writing a client for a binary protocol. Part > of > the message contains a block of data that is 10 bytes long, which contains > a > string that is between 1 and 10 characters, terminated with a \0 byte if it > is > less than 10 bytes long. Anything between the first \0 and the 10th byte > is > considered garbage, and if no \0 byte is found, then the string is 10 bytes > long. > > It makes sense to allocate a 10-byte buffer to read in this block, since > you > know that it will be 10 bytes of the message. It would be disastrous to > rely on > the string ending in \0, since the 11th byte is the start of some *other* > part > of the message (or another Buffer, or maybe just some random spot in > memory). > So, we must provide a length. However, if there are only 5 characters in > this > block, you only really want a 5-byte string. > Yes, I see the problem. If the buffer is full, there is no space for zero at the end. It's similar to how strncpy works. If it's really important to avoid running through the data twice (i.e., long data), then it's probably also important to make the loop as speedy as possible. In that case we don't want to switch on a flag for each byte read, so we should have separate loop implementations for the code that checks for zero-bytes and the one that doesn't. > The options seem to be: > > 1. Scan the bytes twice. Once to get the length, and a second time in v8 > to > convert them to a string. This seems suboptimal. > 2. Tell v8 to terminate on \0 bytes in String::New. I can see a need for the latter, and I would suggest a separate function, e.g., String::NewZeroTerminated. I would still recommend against embedding \xC0\x80 or similar invalid UTF-8 sequences. If you have zero-termination, you don't get to have zero-characters in the string as well. However, the way the NewStringFromUTF8 currently works, it's going to traverse the data two or three times anyway - first to check if the string is all ASCII (which can bail out early but might not), then, if it's not ASCII, traverse the UTF-8 to find the length of the resulting string, and then finally to copy the characters to a newly allocated string. I wouldn't worry about an extra strlen unless I had profiler data saying that decoding was a bottleneck (and in that case, I might want to do more than just avoiding one of the loops). /L -- Lasse R.H. Nielsen l...@google.com 'Faith without judgement merely degrades the spirit divine' Google Denmark ApS - Frederiksborggade 20B, 1 sal - 1360 København K - Denmark - CVR nr. 28 86 69 84 -- v8-dev mailing list v8-dev@googlegroups.com http://groups.google.com/group/v8-dev