Hi Markus

I've seen similar issue before (but not with solr) when processing files as xml.
In our case the problem was due to processing a utf16 file with a byte
order mark. This presents itself as
0xffff to the xml parser which is not used by utf8 (the bom unicode
would be represented as efbfbf in utf8) This caused the utf8
aware parser to choke.

I don't want to get involved in any unicode / utf war as I'm confused
enough as it stands but
could you check for utf16 files before processing ?

lee c

On 27 June 2011 14:26, Thomas Fischer <fischer...@aon.at> wrote:
> Hello,
>
> Am 27.06.2011 um 12:40 schrieb Markus Jelsma:
>
>> Hi,
>>
>> I came across the indexing error below. It happened in a huge batch update
>> from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
>> the error back to a specific document. So i try my luck here: anyone seen 
>> this
>> before with SolrJ 3.1? Anything else on the Nutch part i should have taken
>> care off?
>>
>> Thanks!
>>
>>
>> Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 
>> QTime=423
>> Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.RuntimeException: [was class 
>> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char 
>> #1142033, byte #1155068)
>>       at 
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>
> and loads of other rubbish and
>
>>       ... 26 more
>
>
> I see this as a problem of solr error-reporting. This is not only obnoxiously 
> "loud" (white on grey with oversized fonts), but less useful than it should 
> be.
> Instead of telling the user where the error occurred (i.e. while reading 
> which file, which column at which line) it unravels the stack. This is 
> useless if the program just choked on some unexpected input, like a typo in a 
> schema of config file or an invalid character in a file to be indexed.
> I don't know if this is due to the Tomcat, the logging system of solr itself, 
> but it is annoying.
>
> And yes, I've seen something like this before and found the error not by 
> inspecting solr but by opening the suspected files with an appropriate 
> browser (e.g. Firefox) which tells me exactly where something goes wrong.
>
> All the best
> Thomas
>
>

Reply via email to