On Jan 5, 2012, at 5:07 PM, Andy Fingerhut wrote:

> I realize that with variable-length multi-byte character encodings like 
> UTF-8, it would be a bad idea to seek to a random byte position and start 
> trying to decode a UTF-8 character starting at that byte position.  I'm 
> thinking of cases where you have an index of byte positions of interest you 
> want to jump to in the future that are known to be the first byte of a 
> character in the appropriate encoding.  I also realize that one must be very 
> cautious in writing to the middle of such a file, since byte lengths of 
> strings are variable.


I can't help too much, but the comment about UTF-8 rang a bell.  It's actually 
not that hard to find a valid character by jumping to a random position.  You 
just need to be able to back up a few bytes.

http://en.wikipedia.org/wiki/UTF-8

>       * All continuation bytes (byte nos. 2-6 in the table above) have 10 as 
> their two most-significant bits (bits 7-6); in contrast, the first byte never 
> has 10 as its two most-significant bits. As a result, it is immediately 
> obvious whether any given byte anywhere in a (valid) UTF-8 stream represents 
> the first byte of a byte sequence corresponding to a single character, or a 
> continuation byte of such a byte sequence.

>       * As a consequence of no. 3 above, starting with any arbitrary byte 
> anywhere in a (valid) UTF-8 stream, it is necessary to back up by only at 
> most five bytes in order to get to the beginning of the byte sequence 
> corresponding to a single character (three bytes in actual UTF-8 as explained 
> in the next section). If it is not possible to back up, or a byte is missing 
> because of e.g. a communication failure, one single character can be 
> discarded, and the next character be correctly read.


-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to