On Thu, Feb 17, 2011 at 17:56, <isaacschlue...@gmail.com> wrote:

> So, \0 is valid UTF-8, but \xC0\x80 is the encoding for the 0 code-point in
> "Modified UTF-8", which is backwards compatible with null-terminated
> c-strings.
>

Yes, where "Modified UTF-8" is invalid UTF-8 and would oribably be mangled
by a validating UTF-8 parser replacing \xC0\x80 by, e.g., U+FFFD (as
recommended by RFC 3629).
I.e., it's not something I would recommend. If we ever change our UTF-8
parsing to be validating (which isn't unthinkable) using overlong encodings
for zero would fail.


> It's interesting to me that v8 *does* terminate on \0 if a length is not
> given,
> no matter what the bytes are.


That's because the function serves two purposes: It creates strings from
either ASCII or UTF-8 inputs (which is reasonable since ASCII is a subset of
UTF-8). For the common ASCII case using a C string literal, it allows
omitting the length. For UTF-8 input, the length is expected to be given (as
per the comment on the function declaration).
So it's just to make it easy to write:
   String::New("my string");
It's a convenience that's not supposed to be used with UTF-8 input.

 If null-terminated UTF-8 strings are common, it would be better to add a
>> separate function to create JS strings from those.
>>
>
> That would be very agreeable.  Would it be better to add an optional flag?
>  It
> seems like it's not much of a change in behavior.


It would change the signature of the existing method, so I think I would
prefer
a separate function. They may both delegate to the same implementation with
a flag, though.


>  If you are creating the UTF-8 byte sequences yourself, you are likely to
>>
> already
>
>> know where it ends, and it's just a matter of not throwing that
>> information
>> away. If you are using a library that returns null-terminated UTF-8
>> sequences,
>> then it's obviously not as simple.
>>
>
> Here's a bit more information about a use case that prompted this.  (It's
> worse
> than a library.)
>
> Node's "Buffer" objects represent a block of bytes outside of v8's heap.
>  The
> Buffer::ToString method takes the block of bytes, and passes it to v8's
> String::New method.  The length of the buffer is known, and is provided to
> String::New.  However, the length of the string it contains might *not* be
> known.
>
>
If the code putting data into the buffer is under your control, you should
know the end point, but I can see that that might not always be the case,
e.g., in the case of externally supplied zero-or-length terminated data.


> For example, say that you are writing a client for a binary protocol.  Part
> of
> the message contains a block of data that is 10 bytes long, which contains
> a
> string that is between 1 and 10 characters, terminated with a \0 byte if it
> is
> less than 10 bytes long.  Anything between the first \0 and the 10th byte
> is
> considered garbage, and if no \0 byte is found, then the string is 10 bytes
> long.
>
> It makes sense to allocate a 10-byte buffer to read in this block, since
> you
> know that it will be 10 bytes of the message.  It would be disastrous to
> rely on
> the string ending in \0, since the 11th byte is the start of some *other*
> part
> of the message (or another Buffer, or maybe just some random spot in
> memory).
> So, we must provide a length.  However, if there are only 5 characters in
> this
> block, you only really want a 5-byte string.
>

Yes, I see the problem. If the buffer is full, there is no space for zero at
the end. It's similar to
how strncpy works.

If it's really important to avoid running through the data twice (i.e., long
data), then it's probably
also important to make the loop as speedy as possible. In that case we don't
want
to switch on a flag for each byte read, so we should have separate loop
implementations
for the code that checks for zero-bytes and the one that doesn't.


> The options seem to be:
>
> 1. Scan the bytes twice.  Once to get the length, and a second time in v8
> to
> convert them to a string.  This seems suboptimal.
> 2. Tell v8 to terminate on \0 bytes in String::New.


I can see a need for the latter, and I would suggest a separate function,
e.g., String::NewZeroTerminated.
I would still recommend against embedding \xC0\x80 or similar invalid UTF-8
sequences. If you have zero-termination, you don't get to have
zero-characters in the string as well.

However, the way the NewStringFromUTF8 currently works, it's going to
traverse the data two or three times anyway - first to check if the string
is all ASCII (which can bail out early but might not), then, if it's not
ASCII, traverse the UTF-8 to find the length of the resulting string, and
then finally to copy the characters to a newly allocated string.

I wouldn't worry about an extra strlen unless I had profiler data saying
that decoding was a bottleneck (and in that case, I might want to do more
than just avoiding one of the loops).

/L
-- 
Lasse R.H. Nielsen
l...@google.com
'Faith without judgement merely degrades the spirit divine'
Google Denmark ApS - Frederiksborggade 20B, 1 sal - 1360 København K -
Denmark - CVR nr. 28 86 69 84

-- 
v8-dev mailing list
v8-dev@googlegroups.com
http://groups.google.com/group/v8-dev

Reply via email to