Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
I suggest avoid illegal UTF-8 characters by pre-filtering your contentstream before loading. Unicode UTF-8(hex) U+07FFdf bf U+0800e0 a0 80 So there is no UTF-8 0x. It is illegal. Regards Am 27.06.2011 12:40, schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.CharConversionException: Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: So there is no UTF-8 0x. It is illegal. you are wrong: it is legally encoded as a three byte sequence: ef bf bf
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Am 27.06.2011 14:02, schrieb Robert Muir: On Mon, Jun 27, 2011 at 7:11 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: So there is no UTF-8 0x. It is illegal. you are wrong: it is legally encoded as a three byte sequence: ef bf bf Unicode U+ ist UTF-8 byte sequence ef bf bf that is right. But I was saying that UTF-8 0x (which is byte sequence ff ff) is illegal and that's what the java.io.CharConversionException is complaining about. Invalid UTF-8 character 0x. Don't mix up Unicode with UTF-8. Sorry, but think are wrong ;-)
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Unicode U+ ist UTF-8 byte sequence ef bf bf that is right. But I was saying that UTF-8 0x (which is byte sequence ff ff) is illegal and that's what the java.io.CharConversionException is complaining about. Invalid UTF-8 character 0x. Don't mix up Unicode with UTF-8. Sorry, but think are wrong ;-) Hi, there is no such thing as UTF-8 0x, nor is there any such thing as utf-8 character, despite what this xml parser might say. This is just a stupid XML parser, like other stupid things about XML, it says 'illegal this' or 'illegal that' for arbitrary sets of unicode (such as control characters). You can tell the XML parser is totally broken, when it uses the phrase 'utf-8 character'. this term does not exist.
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Am 27.06.2011 14:35, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:30 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Unicode U+ ist UTF-8 byte sequence ef bf bf that is right. But I was saying that UTF-8 0x (which is byte sequence ff ff) is illegal and that's what the java.io.CharConversionException is complaining about. Invalid UTF-8 character 0x. Don't mix up Unicode with UTF-8. Sorry, but think are wrong ;-) Hi, there is no such thing as UTF-8 0x, nor is there any such thing as utf-8 character, despite what this xml parser might say. This is just a stupid XML parser, like other stupid things about XML, it says 'illegal this' or 'illegal that' for arbitrary sets of unicode (such as control characters). You can tell the XML parser is totally broken, when it uses the phrase 'utf-8 character'. this term does not exist. correct!!!
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: correct!!! but what i said, is totally different than what you said. you are still wrong.
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://jira.codehaus.org/) suggesting that he alter the wording of the error message :) -Mike On 06/27/2011 09:01 AM, Bernd Fehling wrote: Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
OK - re-reading your message it seems maybe that is what you were trying to say too, Robert. FWIW I agree with you that XML is rigid, sometimes for purely arbitrary reasons. But nobody has really helped Markus here - unfortunately, there is no easy way out of this mess. What I do to handle issues like this is to wrap the stream I'm handing to the parser in some kind of cleanup stream that handles a few yucky issues. You could, eg, just strip out invalid XML characters. Maybe Nutch should be doing this, or at least handling the error better? -Mike On 06/27/2011 09:19 AM, Mike Sokolov wrote: Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://jira.codehaus.org/) suggesting that he alter the wording of the error message :) -Mike On 06/27/2011 09:01 AM, Bernd Fehling wrote: Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) and loads of other rubbish and ... 26 more I see this as a problem of solr error-reporting. This is not only obnoxiously loud (white on grey with oversized fonts), but less useful than it should be. Instead of telling the user where the error occurred (i.e. while reading which file, which column at which line) it unravels the stack. This is useless if the program just choked on some unexpected input, like a typo in a schema of config file or an invalid character in a file to be indexed. I don't know if this is due to the Tomcat, the logging system of solr itself, but it is annoying. And yes, I've seen something like this before and found the error not by inspecting solr but by opening the suspected files with an appropriate browser (e.g. Firefox) which tells me exactly where something goes wrong. All the best Thomas
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
hı Its the same error I mentioned here http://lucene.472066.n3.nabble.com/strange-utf-8-problem-td3094473.html. Also if you use solr 1.4.1 there is no problem like that. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-1-indexing-error-Invalid-UTF-8-character-0x-tp3113191p3113864.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Hi Markus I've seen similar issue before (but not with solr) when processing files as xml. In our case the problem was due to processing a utf16 file with a byte order mark. This presents itself as 0x to the xml parser which is not used by utf8 (the bom unicode would be represented as efbfbf in utf8) This caused the utf8 aware parser to choke. I don't want to get involved in any unicode / utf war as I'm confused enough as it stands but could you check for utf16 files before processing ? lee c On 27 June 2011 14:26, Thomas Fischer fischer...@aon.at wrote: Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) and loads of other rubbish and ... 26 more I see this as a problem of solr error-reporting. This is not only obnoxiously loud (white on grey with oversized fonts), but less useful than it should be. Instead of telling the user where the error occurred (i.e. while reading which file, which column at which line) it unravels the stack. This is useless if the program just choked on some unexpected input, like a typo in a schema of config file or an invalid character in a file to be indexed. I don't know if this is due to the Tomcat, the logging system of solr itself, but it is annoying. And yes, I've seen something like this before and found the error not by inspecting solr but by opening the suspected files with an appropriate browser (e.g. Firefox) which tells me exactly where something goes wrong. All the best Thomas
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
On Monday 27 June 2011 16:33:16 lee carroll wrote: Hi Markus I've seen similar issue before (but not with solr) when processing files as xml. In our case the problem was due to processing a utf16 file with a byte order mark. This presents itself as 0x to the xml parser which is not used by utf8 (the bom unicode would be represented as efbfbf in utf8) This caused the utf8 aware parser to choke. I don't want to get involved in any unicode / utf war as I'm confused enough as it stands but could you check for utf16 files before processing ? Some files may be UTF-16 but i cannot confirm it right now. On the other hand, Nutch should have no trouble processing UTF-16. lee c On 27 June 2011 14:26, Thomas Fischer fischer...@aon.at wrote: Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav a:18) and loads of other rubbish and ... 26 more I see this as a problem of solr error-reporting. This is not only obnoxiously loud (white on grey with oversized fonts), but less useful than it should be. Instead of telling the user where the error occurred (i.e. while reading which file, which column at which line) it unravels the stack. This is useless if the program just choked on some unexpected input, like a typo in a schema of config file or an invalid character in a file to be indexed. I don't know if this is due to the Tomcat, the logging system of solr itself, but it is annoying. And yes, I've seen something like this before and found the error not by inspecting solr but by opening the suspected files with an appropriate browser (e.g. Firefox) which tells me exactly where something goes wrong. All the best Thomas -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it, the parser treats the BOM as white space. But in that case you typically get a more specific error about invalid characters in the XML prolog, not just a random invalid character error. -Mike On 06/27/2011 10:33 AM, lee carroll wrote: Hi Markus I've seen similar issue before (but not with solr) when processing files as xml. In our case the problem was due to processing a utf16 file with a byte order mark. This presents itself as 0x to the xml parser which is not used by utf8 (the bom unicode would be represented as efbfbf in utf8) This caused the utf8 aware parser to choke. I don't want to get involved in any unicode / utf war as I'm confused enough as it stands but could you check for utf16 files before processing ? lee c On 27 June 2011 14:26, Thomas Fischerfischer...@aon.at wrote: Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) and loads of other rubbish and ... 26 more I see this as a problem of solr error-reporting. This is not only obnoxiously loud (white on grey with oversized fonts), but less useful than it should be. Instead of telling the user where the error occurred (i.e. while reading which file, which column at which line) it unravels the stack. This is useless if the program just choked on some unexpected input, like a typo in a schema of config file or an invalid character in a file to be indexed. I don't know if this is due to the Tomcat, the logging system of solr itself, but it is annoying. And yes, I've seen something like this before and found the error not by inspecting solr but by opening the suspected files with an appropriate browser (e.g. Firefox) which tells me exactly where something goes wrong. All the best Thomas
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch = 0xfdd0 ch = 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java :356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav a:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl er.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216 ) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCo llection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java: 114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.jav a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java: 228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Of course it doesn't work like this: use AND instead of OR! On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch = 0xfdd0 ch = 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja v a:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand l er.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21 6 ) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC o llection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java : 114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j av a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.jav a: 228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java: 5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parser used in Solr adheres to that convention. But note the restriction about control characters needing to be encoded - I'm not sure, but it might also be best to strip out chars 32 except for \r, \n and \t. You definitely need to remove \0 also... On 06/27/2011 11:59 AM, Markus Jelsma wrote: Of course it doesn't work like this: use AND instead of OR! On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch= 0xfdd0 ch = 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja v a:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand l er.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21 6 ) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC o llection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java : 114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j av a:945) at
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Of course, i can't print the system bell and stuff like that in XML. I'll improve the method to get rid of non-printable control characters as well. On Monday 27 June 2011 18:16:08 Mike Sokolov wrote: Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parser used in Solr adheres to that convention. But note the restriction about control characters needing to be encoded - I'm not sure, but it might also be best to strip out chars 32 except for \r, \n and \t. You definitely need to remove \0 also... On 06/27/2011 11:59 AM, Markus Jelsma wrote: Of course it doesn't work like this: use AND instead of OR! On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch= 0xfdd0 ch = 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav a: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav a: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont en t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler Ba s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.j a va :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore :execute INFO: [] webapp=/solr path=/update params={wt=javabinversion=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav a: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav a: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont en t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler Ba s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.j a va :356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter. ja v a:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHa nd l er.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399 ) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java: 21 6 ) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182 ) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766 ) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandle rC o llection.java:230) at