Re: How to remove control characters in stored value at Solr side
On 9/18/2017 12:45 PM, Markus Jelsma wrote: > But, can you then explain why Apache Nutch with SolrJ had this problem? It > seems that by default SolrJ does use XML as transport format. We have always > used SolrJ which i assumed would default to javabin, but we had this exact > problem anyway, and solved it by stripping non-character code points. > > When we use SolrJ for querying we clearly see wt=javabin in the logs, but > updates showed the problem. Can we fix it anywhere? The wt parameter controls the *response*, not the *request*. The cloud client started using javabin by default for requests in version 4.6 (SOLR-5223), but the HTTP client used XML for requests by default up until version 5.5 (SOLR-8595). The current trunk Nutch code is using SolrJ 5.4.1 and HttpSolrClient, which means that Nutch is sending XML to Solr. The wt parameter on those requests is javabin, so the response that Solr sends back is binary. SolrJ should handle translating the input so that it's valid XML, but maybe there are characters that SolrJ's XML request writer doesn't (or can't) handle correctly. Thanks, Shawn
RE: How to remove control characters in stored value at Solr side
Ah, thanks! -Original message- > From:Chris Hostetter > Sent: Monday 18th September 2017 23:11 > To: solr-user@lucene.apache.org > Subject: RE: How to remove control characters in stored value at Solr side > > > : But, can you then explain why Apache Nutch with SolrJ had this problem? > : It seems that by default SolrJ does use XML as transport format. We have > : always used SolrJ which i assumed would default to javabin, but we had > : this exact problem anyway, and solved it by stripping non-character code > : points. > : > : When we use SolrJ for querying we clearly see wt=javabin in the logs, > : but updates showed the problem. Can we fix it anywhere? > > wt=javabin indicates what *response* format the client (ie: solrj) is > requesting from the server ... the format used for the *request* body is > determined by the client based on the Content-Type of the ContentStream > it sends to Solr. > > When using SolrJ, and sending an arbitrary/abstract SolrRequest objects, > the "RequestWriter" configured on the SolrClient is what specifies the > Content-Type to use (and is in charge of serializing the java objects > appropriately) > > BinaryRequestWriter (which uses javabin format to serialize SolrRequest > objects when building ContentStreams) has been the default since Solr > 5.5/6.0 (see SOLR-8595) > > > -Hoss > http://www.lucidworks.com/ >
RE: How to remove control characters in stored value at Solr side
: But, can you then explain why Apache Nutch with SolrJ had this problem? : It seems that by default SolrJ does use XML as transport format. We have : always used SolrJ which i assumed would default to javabin, but we had : this exact problem anyway, and solved it by stripping non-character code : points. : : When we use SolrJ for querying we clearly see wt=javabin in the logs, : but updates showed the problem. Can we fix it anywhere? wt=javabin indicates what *response* format the client (ie: solrj) is requesting from the server ... the format used for the *request* body is determined by the client based on the Content-Type of the ContentStream it sends to Solr. When using SolrJ, and sending an arbitrary/abstract SolrRequest objects, the "RequestWriter" configured on the SolrClient is what specifies the Content-Type to use (and is in charge of serializing the java objects appropriately) BinaryRequestWriter (which uses javabin format to serialize SolrRequest objects when building ContentStreams) has been the default since Solr 5.5/6.0 (see SOLR-8595) -Hoss http://www.lucidworks.com/
RE: How to remove control characters in stored value at Solr side
I agree. But, can you then explain why Apache Nutch with SolrJ had this problem? It seems that by default SolrJ does use XML as transport format. We have always used SolrJ which i assumed would default to javabin, but we had this exact problem anyway, and solved it by stripping non-character code points. When we use SolrJ for querying we clearly see wt=javabin in the logs, but updates showed the problem. Can we fix it anywhere? Thanks, Markus -Original message- > From:Chris Hostetter > Sent: Monday 18th September 2017 20:29 > To: solr-user@lucene.apache.org > Subject: RE: How to remove control characters in stored value at Solr side > > > : You can not do this in Solr, you cannot even send non-character code > : points in the first place. For Apache Nutch we solved the problem by > > Strictly speak: this is false. You *can* send control characters to solr > as field values -- assuming your transport format allows it. > > Example: using javabin to send SolrInputDocuments from a SolrJ client > doesn't care if the field value Strings have control characters in them. > Likewise it should be possible to send many control characters when using > JSON formatted updates -- let alone using something like DIH to pull blog > data from a DB, or the Extracting Request handler which might find > control-characters in MS-Word of PDF docs. > > In all of those cases, an UpdateProcessor to strip out hte unwanted > characters can/will work well. > > In the specific case discussed in this thread (based on the eventual stack > trace posted) and UpdateProcessor witll *not* work because the fundemental > problem is that the control characters in question mean that the "XML-ish" > lookin bytes being sent to Solr by the client are not actually valid XML > -- because by definition XML can not contain those invalid > control-characters. > > > -Hoss > http://www.lucidworks.com/ >
RE: How to remove control characters in stored value at Solr side
: You can not do this in Solr, you cannot even send non-character code : points in the first place. For Apache Nutch we solved the problem by Strictly speak: this is false. You *can* send control characters to solr as field values -- assuming your transport format allows it. Example: using javabin to send SolrInputDocuments from a SolrJ client doesn't care if the field value Strings have control characters in them. Likewise it should be possible to send many control characters when using JSON formatted updates -- let alone using something like DIH to pull blog data from a DB, or the Extracting Request handler which might find control-characters in MS-Word of PDF docs. In all of those cases, an UpdateProcessor to strip out hte unwanted characters can/will work well. In the specific case discussed in this thread (based on the eventual stack trace posted) and UpdateProcessor witll *not* work because the fundemental problem is that the control characters in question mean that the "XML-ish" lookin bytes being sent to Solr by the client are not actually valid XML -- because by definition XML can not contain those invalid control-characters. -Hoss http://www.lucidworks.com/
Re: How to remove control characters in stored value at Solr side
looks as though the problem is in parsing some malformed XML, based on what I'm seeing: ... Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11)) ... ( char #11 is a vertical tab). This should be fixed outside Solr, but if that is not practical, and you could live with dropping the offending document(s) then you might want to investigate the TolerantUpdateProcessorFactory Solr 6.1 or later) -Simon On Thu, Sep 14, 2017 at 3:56 PM, arnoldbronley wrote: > Thanks for information. Here is the full stack trace. I thought to handle > it > from client side but client apps are not under my control and I don't have > access to them. > > org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code > 11)) > at [row,col {unknown-source}]: [1,413] > at org.apache.solr.handler.loader.XMLLoader.load( > XMLLoader.java:179) > at > org.apache.solr.handler.UpdateRequestHandler$1.load( > UpdateRequestHandler.java:97) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody( > ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest( > RequestHandlerBase.java:153) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213) > at org.apache.solr.servlet.HttpSolrCall.execute( > HttpSolrCall.java:654) > at org.apache.solr.servlet.HttpSolrCall.call( > HttpSolrCall.java:460) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter( > SolrDispatchFilter.java:254) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain. > doFilter(ServletHandler.java:1668) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle( > SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler. > doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler. > doHandle(ContextHandler.java:1160) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.eclipse.jetty.server.session.SessionHandler. > doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler. > doScope(ContextHandler.java:1092) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle( > ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle( > ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection. > handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle( > HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:518) > at org.eclipse.jetty.server.HttpChannel.handle( > HttpChannel.java:308) > at > org.eclipse.jetty.server.HttpConnection.onFillable( > HttpConnection.java:244) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded( > AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable( > FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run( > SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume. > produceAndRun(ExecuteProduceConsume.java:246) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run( > ExecuteProduceConsume.java:156) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob( > QueuedThreadPool.java:654) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run( > QueuedThreadPool.java:572) > at java.lang.Thread.run(Thread.java:748) > Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character > ((CTRL-CHAR, code 11)) > at [row,col {unknown-source}]: [1,413] > at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace( > StreamScanner.java:674) > at > com.ctc.wstx.sr.BasicStreamReader.readTextPrimary( > BasicStreamReader.java:4576) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTree( > BasicStreamReader.java:2881) > at com.ctc.wstx.sr.BasicStreamReader.next( > BasicStreamReader.java:1073) > at org.apache.solr.handler.loader.XMLLoader.readDoc( > XMLLoader.java:397) > at > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249) > at org.apache.solr.handler.loader.XMLLoader.load( > XMLLoader.java:177) > ... 32 more > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: How to remove control characters in stored value at Solr side
Thanks for information. Here is the full stack trace. I thought to handle it from client side but client apps are not under my control and I don't have access to them. org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code 11)) at [row,col {unknown-source}]: [1,413] at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:179) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:518) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:748) Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11)) at [row,col {unknown-source}]: [1,413] at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:674) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4576) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2881) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:397) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177) ... 32 more -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: How to remove control characters in stored value at Solr side
@Arnold: are these non UTF-8 control characters (which is what the Nutch issue was about) or otherwise legal UTF-8 characters which Solr for some reason is choking on ? If you could provide a full stack trace it would be really helpful. On Thu, Sep 14, 2017 at 2:55 PM, Markus Jelsma wrote: > Hello, > > You can not do this in Solr, you cannot even send non-character code > points in the first place. For Apache Nutch we solved the problem by > stripping those non-character code points from Strings before putting them > in SolrDocument. Check the ticket, you can easily resuse the strip method. > > Perhaps it would be a good idea to move the method to SolrDocument or > somewhere in SolrJ in the first place, so others don't have to bother with > this problem. > > Regards, > Markus > > https://issues.apache.org/jira/browse/NUTCH-1016 > > > > -Original message- > > From:Arnold Bronley > > Sent: Thursday 14th September 2017 19:46 > > To: solr-user@lucene.apache.org > > Subject: How to remove control characters in stored value at Solr side > > > > I know I can apply PatternReplaceFilterFactory to remove control > characters > > from indexed value. However, is it possible to do similar thing for > stored > > value? Because of some control characters included in indexing request, > > Solr throws Illegal Character Exception. > > >
RE: How to remove control characters in stored value at Solr side
Hello, You can not do this in Solr, you cannot even send non-character code points in the first place. For Apache Nutch we solved the problem by stripping those non-character code points from Strings before putting them in SolrDocument. Check the ticket, you can easily resuse the strip method. Perhaps it would be a good idea to move the method to SolrDocument or somewhere in SolrJ in the first place, so others don't have to bother with this problem. Regards, Markus https://issues.apache.org/jira/browse/NUTCH-1016 -Original message- > From:Arnold Bronley > Sent: Thursday 14th September 2017 19:46 > To: solr-user@lucene.apache.org > Subject: How to remove control characters in stored value at Solr side > > I know I can apply PatternReplaceFilterFactory to remove control characters > from indexed value. However, is it possible to do similar thing for stored > value? Because of some control characters included in indexing request, > Solr throws Illegal Character Exception. >
Re: How to remove control characters in stored value at Solr side
Sounds as though an update request processor will do that, and also eliminate the need to use the PatternReplaceFilterfactory downstream. Take a look at the documentation in https://lucene.apache.org/solr/guide/6_6/update-request-processors.html. I'm thinking that the RegexReplaceProcessorFactory might work for this. best -Simon On Thu, Sep 14, 2017 at 1:46 PM, Arnold Bronley wrote: > I know I can apply PatternReplaceFilterFactory to remove control characters > from indexed value. However, is it possible to do similar thing for stored > value? Because of some control characters included in indexing request, > Solr throws Illegal Character Exception. >