> 1. Exactly which version of Solr / SolrJ are you using? Solr Specification Version: 1.3.0 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47 Latest SolrJ that I downloaded a couple of days ago.
> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) > file that this solr doc came from online somewhere? We are running an instance of MediaWiki so the text goes through a couple of transformations: wiki markup -> html -> plain text. Its at this last step that I take a "snippet" and insert that into Solr. My snippet code is: // article.java public String getSnippet(int maxlen) { int length = getPlainText().length() >= maxlen ? maxlen : getPlainText().length(); return getPlainText().substring(0, length); } // ... later on .... add to solr doc.addField("text_snippet_t", article.getSnippet(1000)); So in theory, I am getting the whole article if its less than 1K chars and a maximum of 1K chars if its bigger. I initialized this String from the DB by using the String constructor where I pass in the charset/collation text = new String(textFromDB, "UTF-8"); So to the best of my knowledge, accessing a substring of a UTF-8 encoded string should not break up the UTF-8 code point. Is that an incorrect assumption? If so, what is best way to break up a UTF-8 encoded string and get approximately that many characters? Exactness is not a requirement. -Rupert On Tue, Aug 25, 2009 at 5:37 PM, Chris Hostetter<hossman_luc...@fucit.org> wrote: > > 1. Exactly which version of Solr / SolrJ are you using? > > 2. ... > > : >>>> I am using the SolrJ client to add documents to in my index. My field > : >>>> is a normal "text" field type and the text itself is the first 1000 > : >>>> characters of an article. > > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) > file that this solr doc came from online somewhere? > > What does your *indexing* code look like? ... Can you add some debuging to > the SolrJ client when you *add* this doc to print out exactly what those > 1000 characters are? > > My hunch: when you are extracting the first 1000 characters, you're > getting only the first half of a character ...or... you are getting docs > with less them 1000 characters and winding up with a buffer (char[]?) that > has garbage at the end; SolrJ isn't complaining on the way in, but > something farther down (maybe before indexing, maybe after) is seeing that > garbage and cutting the field off at that point. > > > > -Hoss > >