[ 
https://issues.apache.org/jira/browse/SOLR-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748228#action_12748228
 ] 

Yonik Seeley commented on SOLR-1091:
------------------------------------

Echoing the param from the python writer (which escapes chars outside the ascii 
range) shows that the internal UTF-16 string after decoding the invalid UTF8 is 
\udbc0\udc78

This represents unicode code point 1048696, which encoded into UTF8 should be
f4 80 81 b8  (4 bytes).

Thus, I'm thinking that it's perhaps a jetty bug in not being able to handle 
characters outside the BMP?

> "phps" (serialized PHP) writer produces invalid output
> ------------------------------------------------------
>
>                 Key: SOLR-1091
>                 URL: https://issues.apache.org/jira/browse/SOLR-1091
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.3
>         Environment: Sun JRE 1.6.0 on Centos 5
>            Reporter: frank farmer
>            Priority: Minor
>             Fix For: 1.4
>
>
> The serialized PHP output writer can outputs invalid string lengths for 
> certain (unusual) input values.  Specifically, I had a document containing 
> the following 6 byte character sequence: \xED\xAF\x80\xED\xB1\xB8
> I was able to create a document in the index containing this value without 
> issue; however, when fetching the document back out using the serialized PHP 
> writer, it returns a string like the following:
> s:4:"􀁸";
> Note that the string length specified is 4, while the string is actually 6 
> bytes long.
> When using PHP's native serialize() function, it correctly sets the length to 
> 6:
> # php -r 'var_dump(serialize("\xED\xAF\x80\xED\xB1\xB8"));'
> string(13) "s:6:"􀁸";"
> The "wt=php" writer, which produces output to be parsed with eval(), doesn't 
> have any trouble with this string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to