Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the "tail" of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetter<hossman_luc...@fucit.org> wrote:
>
> : We are running an instance of MediaWiki so the text goes through a
> : couple of transformations: wiki markup -> html -> plain text.
> : Its at this last step that I take a "snippet" and insert that into Solr.
>        ...
> : doc.addField("text_snippet_t", article.getSnippet(1000));
>
> ok, well first off: that's the not the field we're you are having problems
> is it?  if i remember correctly from your previous posts, wasn't the
> response getting aborted in the middle of the Contents field?
>
> : and a maximum of 1K chars if its bigger. I initialized this String
> : from the DB by using the String constructor where I pass in the
> : charset/collation
> :
> : text = new String(textFromDB, "UTF-8");
> :
> : So to the best of my knowledge, accessing a substring of a UTF-8
> : encoded string should not break up the UTF-8 code point. Is that an
>
> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?
>
> this leads back to my suggestion before....
>
> : > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> : > file that this solr doc came from online somewhere?
> : >
> : > What does your *indexing* code look like? ... Can you add some debuging to
> : > the SolrJ client when you *add* this doc to print out exactly what those
> : > 1000 characters are?
>
>
> -Hoss
>

Reply via email to