Re: How to remove control characters in stored value at Solr side

2017-09-19 Thread Shawn Heisey
On 9/18/2017 12:45 PM, Markus Jelsma wrote:
> But, can you then explain why Apache Nutch with SolrJ had this problem? It 
> seems that by default SolrJ does use XML as transport format. We have always 
> used SolrJ which i assumed would default to javabin, but we had this exact 
> problem anyway, and solved it by stripping non-character code points.
>
> When we use SolrJ for querying we clearly see wt=javabin in the logs, but 
> updates showed the problem. Can we fix it anywhere?

The wt parameter controls the *response*, not the *request*.

The cloud client started using javabin by default for requests in
version 4.6 (SOLR-5223), but the HTTP client used XML for requests by
default up until version 5.5 (SOLR-8595).  The current trunk Nutch code
is using SolrJ 5.4.1 and HttpSolrClient, which means that Nutch is
sending XML to Solr.  The wt parameter on those requests is javabin, so
the response that Solr sends back is binary.

SolrJ should handle translating the input so that it's valid XML, but
maybe there are characters that SolrJ's XML request writer doesn't (or
can't) handle correctly.

Thanks,
Shawn



RE: How to remove control characters in stored value at Solr side

2017-09-19 Thread Markus Jelsma
Ah, thanks!

 
 
-Original message-
> From:Chris Hostetter 
> Sent: Monday 18th September 2017 23:11
> To: solr-user@lucene.apache.org
> Subject: RE: How to remove control characters in stored value at Solr side
> 
> 
> : But, can you then explain why Apache Nutch with SolrJ had this problem? 
> : It seems that by default SolrJ does use XML as transport format. We have 
> : always used SolrJ which i assumed would default to javabin, but we had 
> : this exact problem anyway, and solved it by stripping non-character code 
> : points.
> : 
> : When we use SolrJ for querying we clearly see wt=javabin in the logs, 
> : but updates showed the problem. Can we fix it anywhere?
> 
> wt=javabin indicates what *response* format the client (ie: solrj) is 
> requesting from the server ... the format used for the *request* body is 
> determined by the client based on the Content-Type of the ContentStream 
> it sends to Solr.
> 
> When using SolrJ, and sending an arbitrary/abstract SolrRequest objects, 
> the "RequestWriter" configured on the SolrClient is what specifies the 
> Content-Type to use (and is in charge of serializing the java objects 
> appropriately)
> 
> BinaryRequestWriter (which uses javabin format to serialize SolrRequest 
> objects when building ContentStreams) has been the default since Solr 
> 5.5/6.0 (see SOLR-8595)
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 


RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Chris Hostetter

: But, can you then explain why Apache Nutch with SolrJ had this problem? 
: It seems that by default SolrJ does use XML as transport format. We have 
: always used SolrJ which i assumed would default to javabin, but we had 
: this exact problem anyway, and solved it by stripping non-character code 
: points.
: 
: When we use SolrJ for querying we clearly see wt=javabin in the logs, 
: but updates showed the problem. Can we fix it anywhere?

wt=javabin indicates what *response* format the client (ie: solrj) is 
requesting from the server ... the format used for the *request* body is 
determined by the client based on the Content-Type of the ContentStream 
it sends to Solr.

When using SolrJ, and sending an arbitrary/abstract SolrRequest objects, 
the "RequestWriter" configured on the SolrClient is what specifies the 
Content-Type to use (and is in charge of serializing the java objects 
appropriately)

BinaryRequestWriter (which uses javabin format to serialize SolrRequest 
objects when building ContentStreams) has been the default since Solr 
5.5/6.0 (see SOLR-8595)


-Hoss
http://www.lucidworks.com/


RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Markus Jelsma
I agree.

But, can you then explain why Apache Nutch with SolrJ had this problem? It 
seems that by default SolrJ does use XML as transport format. We have always 
used SolrJ which i assumed would default to javabin, but we had this exact 
problem anyway, and solved it by stripping non-character code points.

When we use SolrJ for querying we clearly see wt=javabin in the logs, but 
updates showed the problem. Can we fix it anywhere?

Thanks,
Markus
 
-Original message-
> From:Chris Hostetter 
> Sent: Monday 18th September 2017 20:29
> To: solr-user@lucene.apache.org
> Subject: RE: How to remove control characters in stored value at Solr side
> 
> 
> : You can not do this in Solr, you cannot even send non-character code 
> : points in the first place. For Apache Nutch we solved the problem by 
> 
> Strictly speak: this is false.  You *can* send control characters to solr 
> as field values -- assuming your transport format allows it.
> 
> Example: using javabin to send SolrInputDocuments from a SolrJ client 
> doesn't care if the field value Strings have control characters in them.  
> Likewise it should be possible to send many control characters when using 
> JSON formatted updates -- let alone using something like DIH to pull blog 
> data from a DB, or the Extracting Request handler which might find
> control-characters in MS-Word of PDF docs.
> 
> In all of those cases, an UpdateProcessor to strip out hte unwanted 
> characters can/will work well.
> 
> In the specific case discussed in this thread (based on the eventual stack 
> trace posted) and UpdateProcessor witll *not* work because the fundemental 
> problem is that the control characters in question mean that the "XML-ish" 
> lookin bytes being sent to Solr by the client are not actually valid XML 
> -- because by definition XML can not contain those invalid 
> control-characters.
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 


RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Chris Hostetter

: You can not do this in Solr, you cannot even send non-character code 
: points in the first place. For Apache Nutch we solved the problem by 

Strictly speak: this is false.  You *can* send control characters to solr 
as field values -- assuming your transport format allows it.

Example: using javabin to send SolrInputDocuments from a SolrJ client 
doesn't care if the field value Strings have control characters in them.  
Likewise it should be possible to send many control characters when using 
JSON formatted updates -- let alone using something like DIH to pull blog 
data from a DB, or the Extracting Request handler which might find
control-characters in MS-Word of PDF docs.

In all of those cases, an UpdateProcessor to strip out hte unwanted 
characters can/will work well.

In the specific case discussed in this thread (based on the eventual stack 
trace posted) and UpdateProcessor witll *not* work because the fundemental 
problem is that the control characters in question mean that the "XML-ish" 
lookin bytes being sent to Solr by the client are not actually valid XML 
-- because by definition XML can not contain those invalid 
control-characters.


-Hoss
http://www.lucidworks.com/


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
looks as though the problem is in parsing some malformed XML,  based on
what I'm seeing:

...
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 11))
... ( char #11 is a vertical tab).

This should be fixed outside Solr, but if that is not practical, and you
could live with dropping the offending document(s) then you might want to
investigate the TolerantUpdateProcessorFactory Solr 6.1 or later)

-Simon

On Thu, Sep 14, 2017 at 3:56 PM, arnoldbronley 
wrote:

> Thanks for information. Here is the full stack trace. I thought to handle
> it
> from client side but client apps are not under my control and I don't have
> access to them.
>
> org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
> 11))
>  at [row,col {unknown-source}]: [1,413]
> at org.apache.solr.handler.loader.XMLLoader.load(
> XMLLoader.java:179)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(
> UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:153)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
> at org.apache.solr.servlet.HttpSolrCall.execute(
> HttpSolrCall.java:654)
> at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:460)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:254)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1668)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1160)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
> at
> org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1092)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:518)
> at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:308)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:244)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(
> FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> produceAndRun(ExecuteProduceConsume.java:246)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(
> ExecuteProduceConsume.java:156)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:654)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
> ((CTRL-CHAR, code 11))
>  at [row,col {unknown-source}]: [1,413]
> at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(
> StreamScanner.java:674)
> at
> com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(
> BasicStreamReader.java:4576)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(
> BasicStreamReader.java:2881)
> at com.ctc.wstx.sr.BasicStreamReader.next(
> BasicStreamReader.java:1073)
> at org.apache.solr.handler.loader.XMLLoader.readDoc(
> XMLLoader.java:397)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249)
> at org.apache.solr.handler.loader.XMLLoader.load(
> XMLLoader.java:177)
> ... 32 more
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread arnoldbronley
Thanks for information. Here is the full stack trace. I thought to handle it
from client side but client apps are not under my control and I don't have
access to them.

org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
11))
 at [row,col {unknown-source}]: [1,413]
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:179)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:518)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 11))
 at [row,col {unknown-source}]: [1,413]
at 
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:674)
at
com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4576)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2881)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1073)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:397)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
... 32 more



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
@Arnold: are these non UTF-8 control characters (which is what the Nutch
issue was about) or otherwise legal UTF-8  characters which Solr for some
reason is choking on ?

If you could provide a full stack trace it would be really helpful.


On Thu, Sep 14, 2017 at 2:55 PM, Markus Jelsma 
wrote:

> Hello,
>
> You can not do this in Solr, you cannot even send non-character code
> points in the first place. For Apache Nutch we solved the problem by
> stripping those non-character code points from Strings before putting them
> in SolrDocument. Check the ticket, you can easily resuse the strip method.
>
> Perhaps it would be a good idea to move the method to SolrDocument or
> somewhere in SolrJ in the first place, so others don't have to bother with
> this problem.
>
> Regards,
> Markus
>
> https://issues.apache.org/jira/browse/NUTCH-1016
>
>
>
> -Original message-
> > From:Arnold Bronley 
> > Sent: Thursday 14th September 2017 19:46
> > To: solr-user@lucene.apache.org
> > Subject: How to remove control characters in stored value at Solr side
> >
> > I know I can apply PatternReplaceFilterFactory to remove control
> characters
> > from indexed value. However, is it possible to do similar thing for
> stored
> > value? Because of some control characters included in indexing request,
> > Solr throws Illegal Character Exception.
> >
>


RE: How to remove control characters in stored value at Solr side

2017-09-14 Thread Markus Jelsma
Hello,

You can not do this in Solr, you cannot even send non-character code points in 
the first place. For Apache Nutch we solved the problem by stripping those 
non-character code points from Strings before putting them in SolrDocument. 
Check the ticket, you can easily resuse the strip method.

Perhaps it would be a good idea to move the method to SolrDocument or somewhere 
in SolrJ in the first place, so others don't have to bother with this problem.

Regards,
Markus

https://issues.apache.org/jira/browse/NUTCH-1016

 
 
-Original message-
> From:Arnold Bronley 
> Sent: Thursday 14th September 2017 19:46
> To: solr-user@lucene.apache.org
> Subject: How to remove control characters in stored value at Solr side
> 
> I know I can apply PatternReplaceFilterFactory to remove control characters
> from indexed value. However, is it possible to do similar thing for stored
> value? Because of some control characters included in indexing request,
> Solr throws Illegal Character Exception.
> 


Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
Sounds as though an update request processor will do that, and also
eliminate the need to use the PatternReplaceFilterfactory downstream.

Take a look at the documentation in
https://lucene.apache.org/solr/guide/6_6/update-request-processors.html.
I'm thinking that the RegexReplaceProcessorFactory might work for this.

best

-Simon

On Thu, Sep 14, 2017 at 1:46 PM, Arnold Bronley 
wrote:

> I know I can apply PatternReplaceFilterFactory to remove control characters
> from indexed value. However, is it possible to do similar thing for stored
> value? Because of some control characters included in indexing request,
> Solr throws Illegal Character Exception.
>