Of course, i can't print the system bell and stuff like that in XML. I'll improve the method to get rid of non-printable control characters as well.
On Monday 27 June 2011 18:16:08 Mike Sokolov wrote: > Markus - if you want to make sure not to offend XML parsers, you should > strip all characters not in this list: > > http://en.wikipedia.org/wiki/XML#Valid_characters > > You'll see that article talks about XML 1.1, which accepts a wider range > of characters than XML 1.0, and I believe the Woodstox parser used in > Solr adheres to that convention. But note the restriction about control > characters needing to be encoded - I'm not sure, but it might also be > best to strip out chars < 32 except for \r, \n and \t. You definitely > need to remove \0 also... > > On 06/27/2011 11:59 AM, Markus Jelsma wrote: > > Of course it doesn't work like this: use AND instead of OR! > > > > On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: > >> Hi all, thanks for your comments. I seem to have fixed it by now by > >> simply stripping away all non-character codepoints [1] by iterating > >> over the individual chars and checking them against: > >> > >> if (ch % 0x10000 != 0xffff || ch % 0x10000 != 0xfffe || (ch<= 0xfdd0&& > >> ch > >> > >>> = 0xfdef)) { pass; } > >> > >> Comments? > >> > >> [1]: http://unicode.org/cldr/utility/list- > >> unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] > >> > >> On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: > >>> Hi, > >>> > >>> I came across the indexing error below. It happened in a huge batch > >>> update from Nutch with SolrJ 3.1. Since the crawl was huge it is very > >>> hard to trace the error back to a specific document. So i try my luck > >>> here: anyone seen this before with SolrJ 3.1? Anything else on the > >>> Nutch part i should have taken care off? > >>> > >>> Thanks! > >>> > >>> > >>> Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute > >>> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} > >>> status=500 QTime=423 Jun 27, 2011 10:24:28 AM > >>> org.apache.solr.common.SolrException log > >>> SEVERE: java.lang.RuntimeException: [was class > >>> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char > >>> #1142033, byte #1155068) at > >>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav > >>> a: 1 8) at > >>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at > >>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav > >>> a: 3 657) at > >>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > >>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at > >>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at > >>> org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at > >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont > >>> en t StreamHandlerBase.java:67) at > >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler > >>> Ba s e.java:129) at > >>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at > >>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.j > >>> a va > >>> > >>> :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore > >>> :execute > >>> > >>> INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} > >>> status=500 QTime=423 Jun 27, 2011 10:24:28 AM > >>> org.apache.solr.common.SolrException log > >>> SEVERE: java.lang.RuntimeException: [was class > >>> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char > >>> #1142033, byte #1155068) at > >>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.jav > >>> a: 1 8) at > >>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at > >>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav > >>> a: 3 657) at > >>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > >>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at > >>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at > >>> org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at > >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont > >>> en t StreamHandlerBase.java:67) at > >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler > >>> Ba s e.java:129) at > >>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at > >>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.j > >>> a va > >>> > >>> :356) at > >>> > >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter. > >>> ja v a:252) at > >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHa > >>> nd l er.java:1212) at > >>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399 > >>> ) at > >>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java: > >>> 21 6 ) at > >>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182 > >>> ) at > >>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766 > >>> ) at > >>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > >>> at > >>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandle > >>> rC o llection.java:230) at > >>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.ja > >>> va > >>> > >>> : 114) at > >>> > >>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152 > >>> ) at org.mortbay.jetty.Server.handle(Server.java:326) > >>> > >>> at > >>> > >>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > >>> at > >>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection. > >>> j av a:945) at > >>> org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at > >>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at > >>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at > >>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.j > >>> av a: 228) at > >>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.jav > >>> a: 5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 > >>> character 0xffff at char #1142033, byte #1155068) at > >>> com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at > >>> com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > >>> > >>> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > >>> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > >>> at > >>> > >>> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.ja > >>> va > >>> > >>> : 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) > >>> > >>> at > >>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.j > >>> a va > >>> > >>> :4628) at > >>> > >>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.j > >>> av a > >>> > >>> :4126) at > >>> > >>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:37 > >>> 01 ) at > >>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav > >>> a: 3 649) ... 26 > >>> moreg.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte > >>> r. j ava:252) at > >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHa > >>> nd l er.java:1212) at > >>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399 > >>> ) at > >>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java: > >>> 21 6 ) at > >>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182 > >>> ) at > >>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766 > >>> ) at > >>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > >>> at > >>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandle > >>> rC o llection.java:230) at > >>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.ja > >>> va > >>> > >>> : 114) at > >>> > >>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152 > >>> ) at org.mortbay.jetty.Server.handle(Server.java:326) > >>> > >>> at > >>> > >>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > >>> at > >>> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection. > >>> j av a:945) at > >>> org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843) at > >>> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at > >>> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at > >>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.j > >>> av a: 228) at > >>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.jav > >>> a: 5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 > >>> character 0xffff at char #1142033, byte #1155068) at > >>> com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at > >>> com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > >>> > >>> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > >>> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > >>> at > >>> > >>> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.ja > >>> va > >>> > >>> : 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) > >>> > >>> at > >>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.j > >>> a va > >>> > >>> :4628) at > >>> > >>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.j > >>> av a > >>> > >>> :4126) at > >>> > >>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:37 > >>> 01 ) at > >>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.jav > >>> a: 3 649) ... 26 more -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350