[jira] [Resolved] (CONNECTORS-1738) Suggestion for adding function that allows setting timeout values for Elasticsearch Output Connector

2022-10-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1738.
-
Fix Version/s: ManifoldCF 2.24
   Resolution: Fixed

r1904741


> Suggestion for adding function that allows setting timeout values for 
> Elasticsearch Output Connector
> 
>
> Key: CONNECTORS-1738
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1738
> Project: ManifoldCF
>  Issue Type: Improvement
>Reporter: Nguyen Huu Nhat
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.24
>
> Attachments: EditConnection.PNG, ViewConnection.PNG, patch.txt
>
>
> Hi there,
> For Elasticsearch Output Connector, during use, I have exeperienced cases 
> that required the values of *socketTimeout* and *connectionTimeout* to be 
> increased.
> However, as those values are being hardcoded within the source code as 
> 90(ms) and 6(ms) respectively, it is quite troublesome to update them 
> in cases mentioned above.
> For this reason, instead of hardcoding, I think it would be better that the 
> values of *socketTimeout* and *connectionTimeout* can be edited through 
> WebUI, on the connection setting screen.
> In ManifoldCF, there are also a few other connectors that support setting 
> *socketTimeout* and {*}connectionTimeout{*}, such as Generic, Confluence, etc.
> Therefore, I would like to suggest modifying the ElasticSearch Output 
> Connector's source code to allow setting *socketTimeout* and 
> *connectionTimeout* value when it is needed.
> h3. +*1. Connector Name*+
> ElasticSearch Output Connector
> h3. +*2. Improvement Detail*+
> On connection setting screen (WebUI), add handling method that enable value 
> setting for *socketTimeout* and *connectionTimeout*
> ※The default value for *socketTimeout* and *connectionTimeout* are still 
> 90 and 6 (ms) as they are.
> The connection setting screen will look like below:
> !EditConnection.PNG!
> h3. +*3. Suggested source code (based on release 2.22.1)*+
> Because the content is edited in many files & the number of LOC is quite 
> large,
> I will attach the patch file here, please check it out.
> [^patch.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (CONNECTORS-1738) Suggestion for adding function that allows setting timeout values for Elasticsearch Output Connector

2022-10-20 Thread Karl Wright (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1738:
---

Assignee: Karl Wright

> Suggestion for adding function that allows setting timeout values for 
> Elasticsearch Output Connector
> 
>
> Key: CONNECTORS-1738
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1738
> Project: ManifoldCF
>  Issue Type: Improvement
>Reporter: Nguyen Huu Nhat
>Assignee: Karl Wright
>Priority: Major
> Attachments: EditConnection.PNG, ViewConnection.PNG, patch.txt
>
>
> Hi there,
> For Elasticsearch Output Connector, during use, I have exeperienced cases 
> that required the values of *socketTimeout* and *connectionTimeout* to be 
> increased.
> However, as those values are being hardcoded within the source code as 
> 90(ms) and 6(ms) respectively, it is quite troublesome to update them 
> in cases mentioned above.
> For this reason, instead of hardcoding, I think it would be better that the 
> values of *socketTimeout* and *connectionTimeout* can be edited through 
> WebUI, on the connection setting screen.
> In ManifoldCF, there are also a few other connectors that support setting 
> *socketTimeout* and {*}connectionTimeout{*}, such as Generic, Confluence, etc.
> Therefore, I would like to suggest modifying the ElasticSearch Output 
> Connector's source code to allow setting *socketTimeout* and 
> *connectionTimeout* value when it is needed.
> h3. +*1. Connector Name*+
> ElasticSearch Output Connector
> h3. +*2. Improvement Detail*+
> On connection setting screen (WebUI), add handling method that enable value 
> setting for *socketTimeout* and *connectionTimeout*
> ※The default value for *socketTimeout* and *connectionTimeout* are still 
> 90 and 6 (ms) as they are.
> The connection setting screen will look like below:
> !EditConnection.PNG!
> h3. +*3. Suggested source code (based on release 2.22.1)*+
> Because the content is edited in many files & the number of LOC is quite 
> large,
> I will attach the patch file here, please check it out.
> [^patch.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


RE: Tika Service Rmeta Connector Error

2022-10-20 Thread Julien Massiera
Hi Cihad,

 

OCR processing takes a lot of resources and time process, so when sending 
several files at the same time to Tika, you increase the time processing for 
each file, resulting in timeout on the connector side like you have 
experienced. So, by decreasing the number of files to process, you will improve 
the time processing for each file and so, you decrease the probability to 
encounter a timeout issue (if you don’t change the timeout value of course). 
The timeout parameters for the Tika connector are there for that reason and you 
used them well. 

Concerning the error, there is a very high probability, in a corpus of files, 
that some files are problematic for Tika and causes timeout, OCR processing is 
not the only one that triggers that kind of pb. So a choice had to be made in 
order to deal with those errors, either to trigger an error in the Tika 
connector that will stop the job, or to consider that the error will happen a 
lot of time, log it in the simple history and ignore it to continue the job 
processing. The second option has been retained because in the other case, more 
than 90% of crawl jobs involving Tika in an enterprise environment would fail 
and it would be nearly impossible to solve/filter all the problematic files.

Concerning the Solr insertion, the connector will only trigger an error if the 
Solr indexation cannot be done, which is not linked to any previous connector 
in the pipeline and will never be. In your case, when a file is timed out in 
Tika, its content and metadata cannot be retrieved by the Tika server so the 
document is indexed like this, and the ingest process works so there are no 
error to trigger.

 

Cheers,

Julien 

 

 

De : Cihad Guzel  
Envoyé : jeudi 20 octobre 2022 03:17
À : julien.massi...@francelabs.com
Cc : dev ; u...@manifoldcf.apache.org
Objet : Re: Tika Service Rmeta Connector Error

 

Hi,

The problem goes away when I increase the socket timeout from the mfc tika 
connector edit page. I think "document ingest (Solr)" should not be OK when 
there is such a problem.

Regards,


Cihad Güzel

 

Cihad Guzel <  cguz...@gmail.com>, 20 Eki 2022 Per, 
02:28 tarihinde şunu yazdı:

 Hi Julien,

I ran the tika 2x service using the official tika available on docker hub. I am 
using MFC version 2.3. I activated the tika-service-rmeta connector for MFC. I 
created a job on mfc for a folder with 5 files in it. But OCR was not performed 
on some of the files. When I look at Solr, the content of some files seems 
empty. I also got the error messages found in the attachment.

In the second test I made, this time I created 5 separate jobs to include each 
of the 5 files one by one. When I ran these jobs, I did not encounter any 
problems.

When I send these 5 files directly to the tika-service using curl it also works 
correctly.

When I examine the Simple History Report, I see error messages for some files 
as in the attached picture.

Could Tika connector have a bug that will cause an error while sending multiple 
files to tika? Could it have something to do with this issue?  
 
https://issues.apache.org/jira/browse/CONNECTORS-1733



Regards,


Cihad Güzel