Mingchun Zhao created CONNECTORS-1746:
-----------------------------------------
Summary: Adding execution conditions of PostgreSQL's ANALYZE
command to avoid crawling become extremely slow.
Key: CONNECTORS-1746
URL: https://issues.apache.org/jira/browse/CONNECTORS-1746
Project: ManifoldCF
Issue Type: Improvement
Components: Web connector
Environment: I am using ManifoldCF 2.24 with PostgreSQL 12.14 as the
database.
Reporter: Mingchun Zhao
Sometimes, the crawling does not process any documents for a while and there is
nothing logged about long-running queries. The performance can be restored by
firing the 'ANALYZE' command manually. It seems that a bad query plan caused
this performance problem.
Therefore, in addition to the current configuration parameter
org.apache.manifoldcf.db.postgres.analyze.<tablename> , it is considered
necessary to execute the 'ANALYZE' even in the following situations.
1. When the number of records in the table exceeds the number required for
creating an query plan after the job starts.
2. When the crawling performance slows down. For example, if the document
processing rate drops below a specified threshold.
How about adding two parameters to handle the timing of 'ANALYZE' execution as
below?
1. `org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumrowcount`
Specify how many records should be accumulated before carrying out an 'ANALYZE'
on the specified table as the first time.defaults to 100.
2.`org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumprocessrate`
Specify the number of documents processed in the last minute. If the actual
processing rate falls below this, the 'ANALYZE' will be carrying out. defaults
to 1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)