Mingchun Zhao created CONNECTORS-1746:
-----------------------------------------

             Summary: Adding execution conditions of PostgreSQL's ANALYZE 
command to avoid crawling become extremely slow.
                 Key: CONNECTORS-1746
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1746
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
         Environment: I am using ManifoldCF 2.24 with PostgreSQL 12.14 as the 
database. 
            Reporter: Mingchun Zhao


Sometimes, the crawling does not process any documents for a while and there is 
nothing logged about long-running queries. The performance can be restored by 
firing the 'ANALYZE' command manually. It seems that a bad query plan caused 
this performance problem.

Therefore, in addition to the current configuration parameter 
org.apache.manifoldcf.db.postgres.analyze.<tablename> , it is considered 
necessary to execute the 'ANALYZE' even in the following situations.
1. When the number of records in the table exceeds the number required for 
creating an query plan after the job starts.
2. When the crawling performance slows down. For example, if the document 
processing rate drops below a specified threshold. 

How about adding two parameters to handle the timing of 'ANALYZE' execution as 
below?
1. `org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumrowcount`
Specify how many records should be accumulated before carrying out an 'ANALYZE' 
on the specified table as the first time.defaults to 100.
2.`org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumprocessrate`
Specify the number of documents processed in the last minute. If the actual 
processing rate falls below this, the 'ANALYZE' will be carrying out. defaults 
to 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to