RE: IndexableBinaryStringTools (was FieldCache)

Steven A Rowe Sat, 13 Nov 2010 10:51:25 -0800

Hi Mathias,

> > > I assume that the char[] returned form
> > > IndexableBinaryStringTools.encode is encoded in UTF-8 again
> > > and then stored. At some point the information is lost and
> > > cannot be recovered.
> >
> > Can you give an example?  This should not happen.
> 
> My character array returned by IndexableBinaryStringTools.encode looks
> like following:
> 
> char[] encoded = new char[] {0, 8508, 3392, 64, 0, 8, 0, 0};
[...]
> BTW: I've tested it with EmbeddedSolrServer and Solr/Lucene trunk.
> 
> Why has the string representation changed? From the changed string I
> cannot decode the correct ID.


Looks to me like the returned value is in a Solr-internal form of XML character 
escaping: \u0000 is represented as "#0;" and \u0008 is represented as "#8;".  
(The escaping code is in solr/src/java/org/apache/common/util/XML.java.)  

You can get the value back in its original binary form by unescaping the 
/#[0-9]+;/ format.  Here is a test illustrating this fix that I added to 
SolrExampleTests, then ran from SolrExampleEmbeddedTest:

==============
  @Test
  public void testIndexableBinary() throws Exception {
    // Empty the database...
    server.deleteByQuery( "*:*" );// delete everything!
    server.commit();
    assertNumFound( "*:*", 0 ); // make sure it got in
 
    byte[] binary = new byte[] 
      { (byte)0, (byte)0, (byte)0x84, (byte)0xF0, (byte)0x6A, (byte)0, 
        (byte)4, (byte)0, (byte)0,    (byte)0,    (byte)2,    (byte)0 };
    int encodedLen = IndexableBinaryStringTools.getEncodedLength
      (binary, 0, binary.length);
    char encoded[] = new char[encodedLen];
    IndexableBinaryStringTools.encode
      (binary, 0, binary.length, encoded, 0, encoded.length);
    final String encodedString = new String(encoded);
    log.info("Encoded: " + stringToIntSequence(encodedString));
    // Expected encoded: {         0, 8508, 3392,   64,    0,    8,    0,    0 }
    String expectedEncoded = "\u0000\u213C\u0D40\u0040\u0000\u0008\u0000\u0000";
    assertEquals(stringToIntSequence(expectedEncoded),
                 stringToIntSequence(encodedString));
      
    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("id", encodedString);
    server.add(doc);
    server.commit();
        
    SolrQuery query = new SolrQuery();
    query.setQuery("*:*");
    QueryResponse rsp = server.query(query);
    SolrDocument retrievedDoc = rsp.getResults().get(0);
    String retrievedEncoded = (String)retrievedDoc.getFieldValue("id");
    String unescapedRetrievedEncoded = 
unescapeSolrXMLEscaping(retrievedEncoded);
    assertEquals(stringToIntSequence(encodedString), 
                 stringToIntSequence(unescapedRetrievedEncoded));
  }
    
  String stringToIntSequence(String str) {
    StringBuilder builder = new StringBuilder();
    for (int chnum = 0 ; chnum < str.length() ; ++chnum) {
      if (chnum > 0) {
        builder.append(", ");
      }
      builder.append((int)str.charAt(chnum))
        .append(" (").append(str.charAt(chnum)).append(")");
    }
    return builder.toString();
  }
  String unescapeSolrXMLEscaping(String escaped) {
    StringBuffer unescaped = new StringBuffer();
    Matcher matcher = Pattern.compile("#(\\d+);").matcher(escaped);
    while (matcher.find()) {
      String replacement = String.format
        ("%c",(char)Integer.parseInt(matcher.group(1)));
      matcher.appendReplacement(unescaped, replacement); 
    }
    matcher.appendTail(unescaped);
    return unescaped.toString();
  }
==============

Steve

RE: IndexableBinaryStringTools (was FieldCache)

Reply via email to