[ 
https://issues.apache.org/jira/browse/COUCHDB-345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706689#action_12706689
 ] 

Curt Arnold commented on COUCHDB-345:
-------------------------------------

Not a CouchDB or Python developer, but interested in the issue and thought I 
might help you clarify what you have encountered.

According to http://www.ietf.org/rfc/rfc4627.txt?number=4627, JSON data must be 
encoded in Unicode and the default encoding is UTF-8.  I'm assuming that is the 
definite statement on JSON encoding.

 It should not be necessary to escape non US-ASCII characters as long as the 
document is properly encoded in UTF-8, one of the UTF-16s or UCS-4.  What I'm 
guessing is happening in your code fragment is that a Latin-1 (ISO-8859-1) byte 
sequence is sent in  the request which should have been rejected by the PUT 
operation since it bad JSON.

If that interpretation is correct and shooting from the hip, I think:

1. The Python API should have either rejected the standard string or converted 
it to a Unicode string before sending the request.

2. CouchDB should reject any document that is not valid UTF-8, UTF-16BE, 
UTF-16LE or UCS-4.

3. It should not require non-USASCII characters to be escaped.

4. CouchDB should ignore any encoding specified in the header.  Limiting to 
only Unicode types means that the encoding can be reliably determined and it is 
likely that any encoding in the header might be wrong.




> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
>                 Key: COUCHDB-345
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-345
>             Project: CouchDB
>          Issue Type: Bug
>    Affects Versions: 0.9
>         Environment: OSX 10.5.6
>            Reporter: Joan Touzet
>         Attachments: badtext.tar.gz
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" value 
> that cannot be retrieved. This results from not escaping a non-ASCII value 
> into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8 (Ø) 
> in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> ================================================
> {
>     "ok": true
> }
> {
>     "id": "fail", 
>     "ok": true, 
>     "rev": "1-76726372"
> }
> {
>     "error": "ucs", 
>     "reason": "{bad_utf8_character_code}"
> }
> ================================================
> Please note this defect turned up another problem, namely that the 
> bad_utf8_character_code exception thrown by a design document attempting to 
> map() the bad document caused Futon to fail silently in building the view, 
> with no indication (except via debug log) that there was a failure. The log 
> indicated two attempts to build the view, both failing, followed by an 
> uncaught exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not 
> handle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have 
> rejected the PUT/POST, or should have escaped the input itself before the 
> insertion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to