Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list:

http://en.wikipedia.org/wiki/XML#Valid_characters

You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parser used in Solr adheres to that convention. But note the restriction about control characters needing to be encoded - I'm not sure, but it might also be best to strip out chars < 32 except for \r, \n and \t. You definitely need to remove \0 also...

On 06/27/2011 11:59 AM, Markus Jelsma wrote:
Of course it doesn't work like this: use AND instead of OR!

On Monday 27 June 2011 17:50:01 Markus Jelsma wrote:
Hi all, thanks for your comments. I seem to have fixed it by now by simply
stripping away all non-character codepoints [1] by iterating over the
individual chars and checking them against:

if (ch % 0x10000 != 0xffff || ch % 0x10000 != 0xfffe || (ch<= 0xfdd0&&  ch
= 0xfdef)) { pass; }
Comments?

[1]: http://unicode.org/cldr/utility/list-
unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

On Monday 27 June 2011 12:40:16 Markus Jelsma wrote:
Hi,

I came across the indexing error below. It happened in a huge batch
update from Nutch with SolrJ 3.1. Since the crawl was huge it is very
hard to trace the error back to a specific document. So i try my luck
here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch
part i should have taken care off?

Thanks!


Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2}
status=500 QTime=423 Jun 27, 2011 10:24:28 AM
org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
#1142033, byte #1155068) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:
1 8) at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 657) at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
t StreamHandlerBase.java:67) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
s e.java:129) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va

:356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute

INFO: [] webapp=/solr path=/update params={wt=javabin&version=2}
status=500 QTime=423 Jun 27, 2011 10:24:28 AM
org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
#1142033, byte #1155068) at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:
1 8) at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 657) at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
t StreamHandlerBase.java:67) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
s e.java:129) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va

:356) at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
v a:252) at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand
l er.java:1212) at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21
6 ) at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC
o llection.java:230) at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java
: 114) at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)

         at

org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j
av a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.jav
a: 228) at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:
5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character
0xffff at char #1142033, byte #1155068) at
com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at
com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)

         at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
         at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
         at

com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java
: 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.ja
va

:4628) at

com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.jav
a

:4126) at

com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701
) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 649) ... 26
moreg.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.
j ava:252) at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand
l er.java:1212) at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21
6 ) at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC
o llection.java:230) at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java
: 114) at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)

         at

org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j
av a:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.jav
a: 228) at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:
5 82) Caused by: java.io.CharConversionException: Invalid UTF-8 character
0xffff at char #1142033, byte #1155068) at
com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at
com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)

         at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
         at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
         at

com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java
: 57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.ja
va

:4628) at

com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.jav
a

:4126) at

com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701
) at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:
3 649) ... 26 more

Reply via email to