RE: Tika Service Rmeta Connector Error

2022-10-20 Thread Julien Massiera
Hi Cihad,

 

OCR processing takes a lot of resources and time process, so when sending 
several files at the same time to Tika, you increase the time processing for 
each file, resulting in timeout on the connector side like you have 
experienced. So, by decreasing the number of files to process, you will improve 
the time processing for each file and so, you decrease the probability to 
encounter a timeout issue (if you don’t change the timeout value of course). 
The timeout parameters for the Tika connector are there for that reason and you 
used them well. 

Concerning the error, there is a very high probability, in a corpus of files, 
that some files are problematic for Tika and causes timeout, OCR processing is 
not the only one that triggers that kind of pb. So a choice had to be made in 
order to deal with those errors, either to trigger an error in the Tika 
connector that will stop the job, or to consider that the error will happen a 
lot of time, log it in the simple history and ignore it to continue the job 
processing. The second option has been retained because in the other case, more 
than 90% of crawl jobs involving Tika in an enterprise environment would fail 
and it would be nearly impossible to solve/filter all the problematic files.

Concerning the Solr insertion, the connector will only trigger an error if the 
Solr indexation cannot be done, which is not linked to any previous connector 
in the pipeline and will never be. In your case, when a file is timed out in 
Tika, its content and metadata cannot be retrieved by the Tika server so the 
document is indexed like this, and the ingest process works so there are no 
error to trigger.

 

Cheers,

Julien 

 

 

De : Cihad Guzel  
Envoyé : jeudi 20 octobre 2022 03:17
À : julien.massi...@francelabs.com
Cc : dev ; u...@manifoldcf.apache.org
Objet : Re: Tika Service Rmeta Connector Error

 

Hi,

The problem goes away when I increase the socket timeout from the mfc tika 
connector edit page. I think "document ingest (Solr)" should not be OK when 
there is such a problem.

Regards,


Cihad Güzel

 

Cihad Guzel < <mailto:cguz...@gmail.com> cguz...@gmail.com>, 20 Eki 2022 Per, 
02:28 tarihinde şunu yazdı:

 Hi Julien,

I ran the tika 2x service using the official tika available on docker hub. I am 
using MFC version 2.3. I activated the tika-service-rmeta connector for MFC. I 
created a job on mfc for a folder with 5 files in it. But OCR was not performed 
on some of the files. When I look at Solr, the content of some files seems 
empty. I also got the error messages found in the attachment.

In the second test I made, this time I created 5 separate jobs to include each 
of the 5 files one by one. When I ran these jobs, I did not encounter any 
problems.

When I send these 5 files directly to the tika-service using curl it also works 
correctly.

When I examine the Simple History Report, I see error messages for some files 
as in the attached picture.

Could Tika connector have a bug that will cause an error while sending multiple 
files to tika? Could it have something to do with this issue?  
<https://issues.apache.org/jira/browse/CONNECTORS-1733> 
https://issues.apache.org/jira/browse/CONNECTORS-1733



Regards,


Cihad Güzel



Re: Tika Service Rmeta Connector Error

2022-10-19 Thread Cihad Guzel
Hi,

The problem goes away when I increase the socket timeout from the mfc tika
connector edit page. I think "document ingest (Solr)" should not be OK when
there is such a problem.

Regards,
Cihad Güzel


Cihad Guzel , 20 Eki 2022 Per, 02:28 tarihinde şunu
yazdı:

>  Hi Julien,
>
> I ran the tika 2x service using the official tika available on docker hub.
> I am using MFC version 2.3. I activated the tika-service-rmeta connector
> for MFC. I created a job on mfc for a folder with 5 files in it. But OCR
> was not performed on some of the files. When I look at Solr, the content of
> some files seems empty. I also got the error messages found in the
> attachment.
>
> In the second test I made, this time I created 5 separate jobs to include
> each of the 5 files one by one. When I ran these jobs, I did not encounter
> any problems.
>
> When I send these 5 files directly to the tika-service using curl it also
> works correctly.
>
> When I examine the Simple History Report, I see error messages for some
> files as in the attached picture.
>
> Could Tika connector have a bug that will cause an error while sending
> multiple files to tika? Could it have something to do with this issue?
> https://issues.apache.org/jira/browse/CONNECTORS-1733
> [image: Screen Shot 2022-10-20 at 02.08.11.png]
> Regards,
> Cihad Güzel
>