> 1.  Exactly which version of Solr / SolrJ are you using?

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Latest SolrJ that I downloaded a couple of days ago.

> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?

We are running an instance of MediaWiki so the text goes through a
couple of transformations: wiki markup -> html -> plain text.
Its at this last step that I take a "snippet" and insert that into Solr.

My snippet code is:

 // article.java
public String getSnippet(int maxlen) {
  int length = getPlainText().length() >= maxlen ? maxlen :
getPlainText().length();
  return getPlainText().substring(0, length);
}
// ... later on .... add to solr
doc.addField("text_snippet_t", article.getSnippet(1000));

So in theory, I am getting the whole article if its less than 1K chars
and a maximum of 1K chars if its bigger. I initialized this String
from the DB by using the String constructor where I pass in the
charset/collation

text = new String(textFromDB, "UTF-8");

So to the best of my knowledge, accessing a substring of a UTF-8
encoded string should not break up the UTF-8 code point. Is that an
incorrect assumption? If so, what is best way to break up a UTF-8
encoded string and get approximately that many characters? Exactness
is not a requirement.

-Rupert

On Tue, Aug 25, 2009 at 5:37 PM, Chris
Hostetter<hossman_luc...@fucit.org> wrote:
>
> 1.  Exactly which version of Solr / SolrJ are you using?
>
> 2. ...
>
> : >>>> I am using the SolrJ client to add documents to in my index. My field
> : >>>> is a normal "text" field type and the text itself is the first 1000
> : >>>> characters of an article.
>
> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?
>
> What does your *indexing* code look like? ... Can you add some debuging to
> the SolrJ client when you *add* this doc to print out exactly what those
> 1000 characters are?
>
> My hunch: when you are extracting the first 1000 characters, you're
> getting only the first half of a character ...or... you are getting docs
> with less them 1000 characters and winding up with a buffer (char[]?) that
> has garbage at the end; SolrJ isn't complaining on the way in, but
> something farther down (maybe before indexing, maybe after) is seeing that
> garbage and cutting the field off at that point.
>
>
>
> -Hoss
>
>

Reply via email to