Hi Cihad,
OCR processing takes a lot of resources and time process, so when sending
several files at the same time to Tika, you increase the time processing for
each file, resulting in timeout on the connector side like you have
experienced. So, by decreasing the number of files to process, you will improve
the time processing for each file and so, you decrease the probability to
encounter a timeout issue (if you don’t change the timeout value of course).
The timeout parameters for the Tika connector are there for that reason and you
used them well.
Concerning the error, there is a very high probability, in a corpus of files,
that some files are problematic for Tika and causes timeout, OCR processing is
not the only one that triggers that kind of pb. So a choice had to be made in
order to deal with those errors, either to trigger an error in the Tika
connector that will stop the job, or to consider that the error will happen a
lot of time, log it in the simple history and ignore it to continue the job
processing. The second option has been retained because in the other case, more
than 90% of crawl jobs involving Tika in an enterprise environment would fail
and it would be nearly impossible to solve/filter all the problematic files.
Concerning the Solr insertion, the connector will only trigger an error if the
Solr indexation cannot be done, which is not linked to any previous connector
in the pipeline and will never be. In your case, when a file is timed out in
Tika, its content and metadata cannot be retrieved by the Tika server so the
document is indexed like this, and the ingest process works so there are no
error to trigger.
Cheers,
Julien
De : Cihad Guzel
Envoyé : jeudi 20 octobre 2022 03:17
À : julien.massi...@francelabs.com
Cc : dev ; u...@manifoldcf.apache.org
Objet : Re: Tika Service Rmeta Connector Error
Hi,
The problem goes away when I increase the socket timeout from the mfc tika
connector edit page. I think "document ingest (Solr)" should not be OK when
there is such a problem.
Regards,
Cihad Güzel
Cihad Guzel < <mailto:cguz...@gmail.com> cguz...@gmail.com>, 20 Eki 2022 Per,
02:28 tarihinde şunu yazdı:
Hi Julien,
I ran the tika 2x service using the official tika available on docker hub. I am
using MFC version 2.3. I activated the tika-service-rmeta connector for MFC. I
created a job on mfc for a folder with 5 files in it. But OCR was not performed
on some of the files. When I look at Solr, the content of some files seems
empty. I also got the error messages found in the attachment.
In the second test I made, this time I created 5 separate jobs to include each
of the 5 files one by one. When I ran these jobs, I did not encounter any
problems.
When I send these 5 files directly to the tika-service using curl it also works
correctly.
When I examine the Simple History Report, I see error messages for some files
as in the attached picture.
Could Tika connector have a bug that will cause an error while sending multiple
files to tika? Could it have something to do with this issue?
<https://issues.apache.org/jira/browse/CONNECTORS-1733>
https://issues.apache.org/jira/browse/CONNECTORS-1733
Regards,
Cihad Güzel