[ 
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749083#action_12749083
 ] 

Curt Arnold commented on COUCHDB-345:
-------------------------------------

ISO-8859-1, Cp1252 and Latin-1 are near synonyms for encoding the first 256 
character points in Unicode as single byte values and is incapable of 
representing any other character without some escape mechanism.   Any arbitrary 
set of bytes would be a valid ISO-8859-1 sequence and can be decoded into a 
sequence of Unicode characters.

UTF-8 is an variable byte encoding of the full Unicode character repertoire.  
Character values from \u0000 to \u007F are represented as a single-byte, while 
other characters require 2-6 bytes to encode.  Unlike ISO-8859-1, not every 
sequence of bytes is valid and can be converted back to Unicode character 
points.  If I remember correctly, any two-bytes in a row with the high-bit set 
is invalid.  The test date in the last two cases are valid ISO-8859-1 sequence, 
but they can not be interpreted as UTF-8 since they contain byte sequences that 
can not be converted back into Unicode code points.

If it was just an encoding mismatch and the data was being misinterpreted, you 
would lay the blame at the client.  However, in this case, data can go into the 
database that the rest of the stack can't process since it contains invalid 
sequences. 

The RFC mentions the two variants of UTF-16 and UCS-4, however the ISO-8859-1 
sequences could not be interpreted using any of those encodings since the first 
two characters must be ASCII.  There are only certain sequences of bytes that 
could appear for JSON encoded in any of those encodings and the byte sequences 
send in the last two cases don't match any of those patterns.  Sniffing the 
encoding  would work in a similar manner to XML which is described in 
http://www.w3.org/TR/REC-xml/#sec-guessing.

> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
>                 Key: COUCHDB-345
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-345
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 0.9
>         Environment: OSX 10.5.6
>            Reporter: Joan Touzet
>         Attachments: badtext.tar.gz, enctest.zip
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value 
> that cannot be retrieved. This results from not escaping a non-ASCII value 
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø) 
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
>     "ok": true
> }
> {
>     "id": "fail", 
>     "ok": true, 
>     "rev": "1-76726372"
> }
> {
>     "error": "ucs", 
>     "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the 
> bad_utf8_character_code exception thrown by a design document attempting to 
> map() the bad document caused Futon to fail silently in building the view, 
> with no indication (except via debug log) that there was a failure. The log 
> indicated two attempts to build the view, both failing, followed by an 
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not 
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have 
> rejected the PUT/POST, or should have escaped the input itself before the 
> insertion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to