Hi Mathias, > > > I assume that the char[] returned form > > > IndexableBinaryStringTools.encode is encoded in UTF-8 again > > > and then stored. At some point the information is lost and > > > cannot be recovered. > > > > Can you give an example? This should not happen. > > My character array returned by IndexableBinaryStringTools.encode looks > like following: > > char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0}; [...] > BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk. > > Why has the string representation changed? From the changed string I > cannot decode the correct ID.
Looks to me like the returned value is in a Solr-internal form of XML character escaping: \u0000 is represented as "#0;" and \u0008 is represented as "#8;". (The escaping code is in solr/src/java/org/apache/common/util/XML.java.) You can get the value back in its original binary form by unescaping the /#[0-9]+;/ format. Here is a test illustrating this fix that I added to SolrExampleTests, then ran from SolrExampleEmbeddedTest: ============== @Test public void testIndexableBinary() throws Exception { // Empty the database... server.deleteByQuery( "*:*" );// delete everything! server.commit(); assertNumFound( "*:*", 0 ); // make sure it got in byte[] binary = new byte[] { (byte)0, (byte)0, (byte)0x84, (byte)0xF0, (byte)0x6A, (byte)0, (byte)4, (byte)0, (byte)0, (byte)0, (byte)2, (byte)0 }; int encodedLen = IndexableBinaryStringTools.getEncodedLength (binary, 0, binary.length); char encoded[] = new char[encodedLen]; IndexableBinaryStringTools.encode (binary, 0, binary.length, encoded, 0, encoded.length); final String encodedString = new String(encoded); log.info("Encoded: " + stringToIntSequence(encodedString)); // Expected encoded: { 0, 8508, 3392, 64, 0, 8, 0, 0 } String expectedEncoded = "\u0000\u213C\u0D40\u0040\u0000\u0008\u0000\u0000"; assertEquals(stringToIntSequence(expectedEncoded), stringToIntSequence(encodedString)); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", encodedString); server.add(doc); server.commit(); SolrQuery query = new SolrQuery(); query.setQuery("*:*"); QueryResponse rsp = server.query(query); SolrDocument retrievedDoc = rsp.getResults().get(0); String retrievedEncoded = (String)retrievedDoc.getFieldValue("id"); String unescapedRetrievedEncoded = unescapeSolrXMLEscaping(retrievedEncoded); assertEquals(stringToIntSequence(encodedString), stringToIntSequence(unescapedRetrievedEncoded)); } String stringToIntSequence(String str) { StringBuilder builder = new StringBuilder(); for (int chnum = 0 ; chnum < str.length() ; ++chnum) { if (chnum > 0) { builder.append(", "); } builder.append((int)str.charAt(chnum)) .append(" (").append(str.charAt(chnum)).append(")"); } return builder.toString(); } String unescapeSolrXMLEscaping(String escaped) { StringBuffer unescaped = new StringBuffer(); Matcher matcher = Pattern.compile("#(\\d+);").matcher(escaped); while (matcher.find()) { String replacement = String.format ("%c",(char)Integer.parseInt(matcher.group(1))); matcher.appendReplacement(unescaped, replacement); } matcher.appendTail(unescaped); return unescaped.toString(); } ============== Steve