Hi, The exception in the solr.log should be reported as a Solr bug. It is not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
I wish there was an easy fix for this. The problem is *not* an empty stream; it's that Solr is attempting to do something with it that it shouldn't. MCF just gets back a 500 error from Solr, and we can't recover from that. >>>>>> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5 (500) <<<<<< Karl On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan < tthamizhara...@worldbankgroup.org> wrote: > Hi Karl, > > > > After configuring Solr to ignore Tika errors by adding Tika transformer in > the job, below behavior is observed. > > > > 1) ManifoldCF fetches the content from documentum, which contains > null content and tries to push it to the output connector(Solr). > > 2) Solr couldn’t accept the null as a value and throwing “Missing > content stream” error. > > 3) Each agent thread In ManifoldCF internally held-up with different > r_object_id’s that don’t have body content and keeps trying to push the > content to Solr after each failure, but Solr couldn’t accept the content > and throws the same error. > > 4) Over the time, the manifold job stops with the error thrown by > Solr > > > > Please let know if there is any configuration change which can help us > resolve this issue. > > > > Please find the attached manifoldCF error log,Solr error log and agent log. > > > > Regards, > > Tamizh Kumaran. > > > > *From:* Karl Wright [mailto:daddy...@gmail.com] > *Sent:* Tuesday, June 13, 2017 2:23 PM > *To:* user@manifoldcf.apache.org > *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani > *Subject:* Re: ManifoldCF documentum indexing issue > > > > Hi Tamizh, > > > > The reported error is 'Error from server at http://localhost:8983/solr/ > documentum_manifoldcf_stg: String index out of range: -188'. The message > seemingly indicates that the error was *received* from the solr server for > one specific document. ManifoldCF does not recognize the error as being > innocuous and therefore it will retry for a while until it eventually gives > up and halts the job. However, I cannot find that exact text anywhere in > the Solr output connector code, so I wonder if you transcribed it correctly? > > There should also be the following: > > (1) A record of the attempts in the manifoldcf.log file, with a MCF stack > trace attached to each one; > > (2) Simple history records for that document that are of the type > INGESTDOCUMENT. > > (3) Solr log entries that have a Solr stack trace. > > > > The last one is the one that would be the most helpful. It is possible > that you are seeing a problem in Solr Cell (Tika) that is manifesting > itself in this way. You can (and should) configure your Solr to ignore > Tika errors. > > > > Thanks, > > Karl > > > > > > > > > > On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan < > tthamizhara...@worldbankgroup.org> wrote: > > Hi, > > > > The Manifoldcf 2.7.1 is running in the multiprocess zk model and > integrated with PostgreSQL 9.3. The expected setup is to crawl the > Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui > app is installed on the tomcat and startup script is pointed with the MF > properties.xml during server startup. Manifold along with the bundled ZK, > tomcat are running on the same host with OS as Red Hat Enterprise Linux > Server release 6.9 (Santiago). The DB is running on a windows box. > > The ZK is integrated with the DB through the properties.xml and > properties-global.xml > > The ZK, the documentum related processes(registry and server) are up and > the two agents (start-agents.sh and start-agents-2.sh) are started which > produce multiple threads to index the documemtum contents into SOLR through > ManifoldCF. > > > > The Current no of the connections configured on the MF are as below. > > SOLR Output max connection : 25 > > Document repository Max Connections: 25 > > Properties.xml: > > <property name="org.apache.manifoldcf.database.maxhandles" value="50"/> > > <property name="org.apache.manifoldcf.crawler.threads" value="25"/> > > Total documentum document count : 0.5 million > > > > After the Job is started, it indexed some 20000+ documents and gets > terminated with the below error on the Manifold JOB. > > Error: Repeated service interruptions - failure processing document: Error > from server at http://localhost:8983/solr/documentum_manifoldcf_stg: > String index out of range: -188 > > > > Please find the attached manifoldCF error log and agent log. > > > > Please let me know the observations on the cause of the issue and the > configuration on the threads used for crawling. Please share your thoughts. > > > > Regards, > > Tamizh Kumaran > > > > >