I've created a ticket, CONNECTORS-1434, to look at the file name issues. Karl
On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <daddy...@gmail.com> wrote: > There is no good way to handle a case where Solr doesn't like the file > name. About the only thing that could be done would be to encode the > filename using something like URL encoding. This might have some effects > on existing users, but more importantly, we really would need to know what > characters were legal before adopting that solution. > > I am not entirely sure how the file name is transmitted to Solr when using > multipart forms, but how that is done is critical to know what to do. > > Karl > > > On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan < > tthamizhara...@worldbankgroup.org> wrote: > >> Hi Karl, >> >> >> >> Thanks for the update!!! >> >> >> >> As per the response from Solr team, expandMacros=false is added to the >> output connector as additional parameter. >> >> After adding expandMacros=false, the indexing job is getting completed >> with “Missing content stream” error for few of the documents and those are >> not indexed into Solr. >> >> >> >> As per our analysis, the pdf document’s file name we are trying to index >> from documentum contains whitespace and special characters like double >> quotes. >> >> Which makes the file non readable and missing content stream error is >> thrown. >> >> >> >> If there is any work around to overcome this issue, kindly share it with >> us. >> >> >> >> Regards, >> >> Tamizh Kumaran Thamizharasan >> >> >> >> *From:* Karl Wright [mailto:daddy...@gmail.com] >> *Sent:* Wednesday, June 14, 2017 7:20 PM >> >> *To:* user@manifoldcf.apache.org >> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani >> *Subject:* Re: ManifoldCF documentum indexing issue >> >> >> >> Here's the response: >> >> >> >> >>>>>> >> >> Karl - >> >> There’s expandMacros=false, as covered here: https://cwiki.apache.org >> /confluence/display/solr/Parameter+Substitution >> >> But… what exactly is being sent to Solr? Is there some kind of “${…” >> being sent as a parameter? Just curious what’s getting you into this in >> the first place. But disabling probably is your most desired solution. >> >> Erik >> >> <<<<<< >> >> >> >> Karl >> >> >> >> >> >> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddy...@gmail.com> wrote: >> >> Here's the question I posted: >> >> >> >> >>>>>> >> >> Hi all, >> >> >> >> I've got a ManifoldCF user who is posting content to Solr using the MCF >> Solr output connector. This connector uses SolrJ under the covers -- a >> fairly recent version -- but also has overridden some classes to insure >> that multipart form posts will be used for most content. >> >> >> >> The problem is that, for a specific document, the user is getting an >> ArrayIndexOutOfBounds exception in Solr, as follows: >> >> >> >> >>>>>> >> >> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] - >> {collection=c:documentum_manifoldcf_stg, >> core=x:documentum_manifoldcf_stg_shard1_replica1, >> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} >> - java.lang.StringIndexOutOfBoundsException: String index out of range: >> -296 >> >> at java.lang.String.substring(String.java:1911) >> >> at org.apache.solr.request.macro.MacroExpander._expand(MacroExp >> ander.java:143) >> >> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa >> nder.java:93) >> >> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa >> nder.java:59) >> >> at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa >> nder.java:45) >> >> at org.apache.solr.request.json.RequestUtil.processParams(Reque >> stUtil.java:157) >> >> at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU >> tils.java:172) >> >> at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req >> uestHandlerBase.java:152) >> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102) >> >> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall. >> java:654) >> >> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java: >> 460) >> >> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp >> atchFilter.java:257) >> >> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp >> atchFilter.java:208) >> >> at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte >> r(ServletHandler.java:1652) >> >> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan >> dler.java:585) >> >> at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped >> Handler.java:143) >> >> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa >> ndler.java:577) >> >> at org.eclipse.jetty.server.session.SessionHandler.doHandle( >> SessionHandler.java:223) >> >> at org.eclipse.jetty.server.handler.ContextHandler.doHandle( >> ContextHandler.java:1127) >> >> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand >> ler.java:515) >> >> at org.eclipse.jetty.server.session.SessionHandler.doScope( >> SessionHandler.java:185) >> >> at org.eclipse.jetty.server.handler.ContextHandler.doScope( >> ContextHandler.java:1061) >> >> at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped >> Handler.java:141) >> >> at org.eclipse.jetty.server.handler.ContextHandlerCollection.ha >> ndle(ContextHandlerCollection.java:215) >> >> at org.eclipse.jetty.server.handler.HandlerCollection.handle( >> HandlerCollection.java:110) >> >> at org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl >> erWrapper.java:97) >> >> at org.eclipse.jetty.server.Server.handle(Server.java:499) >> >> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel. >> java:310) >> >> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne >> ction.java:257) >> >> at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnec >> tion.java:540) >> >> at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued >> ThreadPool.java:635) >> >> at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedT >> hreadPool.java:555) >> >> at java.lang.Thread.run(Thread.java:745) >> >> <<<<<< >> >> >> >> It looks worrisome to me that there's now possibly some kind of "macro >> expansion" that is being triggered within parameters being sent to Solr. >> Can anyone tell me either how to (a) disable this feature, or (b) how the >> MCF Solr output connector should escape parameters being posted so that >> Solr does not attempt any macro expansion? If the latter, I also need to >> know when this feature appeared, since obviously whether or not to do the >> escaping will depend on the precise version of the Solr instance involved. >> >> >> >> I'm also quite concerned that considerations of backwards compatibility >> may have been lost at some point with Solr, since heretofore I could count >> on older versions of SolrJ working with newer versions of Solr. Please >> clarify what the current policy is.... >> >> >> >> >> >> Thanks, >> >> Karl >> >> <<<<<< >> >> >> >> >> >> >> >> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <daddy...@gmail.com> wrote: >> >> I posted the pertinent question to the solr dev list. Let's see what >> they say. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddy...@gmail.com> wrote: >> >> Hi, >> >> >> >> The exception in the solr.log should be reported as a Solr bug. It is >> not emanating from the Tika extractor (Solr Cell), but is in Solr itself. >> >> >> >> I wish there was an easy fix for this. The problem is *not* an empty >> stream; it's that Solr is attempting to do something with it that it >> shouldn't. MCF just gets back a 500 error from Solr, and we can't recover >> from that. >> >> >>>>>> >> >> https://**********/webtop/component/drl?versionLabel=CURRENT >> &objectId=091e8486805142f5 (500) >> >> <<<<<< >> >> >> >> Karl >> >> >> >> >> >> >> >> >> >> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan < >> tthamizhara...@worldbankgroup.org> wrote: >> >> Hi Karl, >> >> >> >> After configuring Solr to ignore Tika errors by adding Tika transformer >> in the job, below behavior is observed. >> >> >> >> 1) ManifoldCF fetches the content from documentum, which contains >> null content and tries to push it to the output connector(Solr). >> >> 2) Solr couldn’t accept the null as a value and throwing “Missing >> content stream” error. >> >> 3) Each agent thread In ManifoldCF internally held-up with >> different r_object_id’s that don’t have body content and keeps trying to >> push the content to Solr after each failure, but Solr couldn’t accept the >> content and throws the same error. >> >> 4) Over the time, the manifold job stops with the error thrown by >> Solr >> >> >> >> Please let know if there is any configuration change which can help us >> resolve this issue. >> >> >> >> Please find the attached manifoldCF error log,Solr error log and agent >> log. >> >> >> >> Regards, >> >> Tamizh Kumaran. >> >> >> >> *From:* Karl Wright [mailto:daddy...@gmail.com] >> *Sent:* Tuesday, June 13, 2017 2:23 PM >> *To:* user@manifoldcf.apache.org >> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani >> *Subject:* Re: ManifoldCF documentum indexing issue >> >> >> >> Hi Tamizh, >> >> >> >> The reported error is 'Error from server at http://localhost:8983/solr/ >> documentum_manifoldcf_stg: String index out of range: -188'. The >> message seemingly indicates that the error was *received* from the solr >> server for one specific document. ManifoldCF does not recognize the error >> as being innocuous and therefore it will retry for a while until it >> eventually gives up and halts the job. However, I cannot find that exact >> text anywhere in the Solr output connector code, so I wonder if you >> transcribed it correctly? >> >> There should also be the following: >> >> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack >> trace attached to each one; >> >> (2) Simple history records for that document that are of the type >> INGESTDOCUMENT. >> >> (3) Solr log entries that have a Solr stack trace. >> >> >> >> The last one is the one that would be the most helpful. It is possible >> that you are seeing a problem in Solr Cell (Tika) that is manifesting >> itself in this way. You can (and should) configure your Solr to ignore >> Tika errors. >> >> >> >> Thanks, >> >> Karl >> >> >> >> >> >> >> >> >> >> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan < >> tthamizhara...@worldbankgroup.org> wrote: >> >> Hi, >> >> >> >> The Manifoldcf 2.7.1 is running in the multiprocess zk model and >> integrated with PostgreSQL 9.3. The expected setup is to crawl the >> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui >> app is installed on the tomcat and startup script is pointed with the MF >> properties.xml during server startup. Manifold along with the bundled ZK, >> tomcat are running on the same host with OS as Red Hat Enterprise Linux >> Server release 6.9 (Santiago). The DB is running on a windows box. >> >> The ZK is integrated with the DB through the properties.xml and >> properties-global.xml >> >> The ZK, the documentum related processes(registry and server) are up and >> the two agents (start-agents.sh and start-agents-2.sh) are started which >> produce multiple threads to index the documemtum contents into SOLR through >> ManifoldCF. >> >> >> >> The Current no of the connections configured on the MF are as below. >> >> SOLR Output max connection : 25 >> >> Document repository Max Connections: 25 >> >> Properties.xml: >> >> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/> >> >> <property name="org.apache.manifoldcf.crawler.threads" value="25"/> >> >> Total documentum document count : 0.5 million >> >> >> >> After the Job is started, it indexed some 20000+ documents and gets >> terminated with the below error on the Manifold JOB. >> >> Error: Repeated service interruptions - failure processing document: >> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg: >> String index out of range: -188 >> >> >> >> Please find the attached manifoldCF error log and agent log. >> >> >> >> Please let me know the observations on the cause of the issue and the >> configuration on the threads used for crawling. Please share your thoughts. >> >> >> >> Regards, >> >> Tamizh Kumaran >> >> >> >> >> >> >> >> >> >> >> >> >> > >