[ 
https://issues.apache.org/jira/browse/CONNECTORS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingchun Zhao updated CONNECTORS-1746:
--------------------------------------
    Description: 
Sometimes, the crawling does not process any documents for a while and there is 
nothing logged about long-running queries. The performance can be restored by 
firing the 'ANALYZE' command manually. It seems that a bad query plan caused 
this performance problem.

Therefore, in addition to the current configuration parameter 
'org.apache.manifoldcf.db.postgres.analyze.<tablename>', it is considered 
necessary to execute the 'ANALYZE' even in the following situations.
1. When the number of records in the table exceeds the number required for 
creating a execution plan after the job starts.
2. When the crawling performance slows down. For example, if the processing 
rate of documents drops below a specified threshold.

  was:
Sometimes, the crawling does not process any documents for a while and there is 
nothing logged about long-running queries. The performance can be restored by 
firing the 'ANALYZE' command manually. It seems that a bad query plan caused 
this performance problem.

Therefore, in addition to the current configuration parameter 
'org.apache.manifoldcf.db.postgres.analyze.<tablename>', it is considered 
necessary to execute the 'ANALYZE' even in the following situations.
1. When the number of records in the table exceeds the number required for 
creating a execution plan after the job starts.
2. When the crawling performance slows down. For example, if the processing 
rate of documents drops below a specified threshold.

So, how about adding two parameters to handle the timing of 'ANALYZE' execution 
as below?
1.'org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumrowcount'
Specify how many records should be inserted before carrying out an 'ANALYZE' on 
the specified table as the first time.defaults to 100.
2.'org.apache.manifoldcf.db.postgres.analyze.<tablename>.minimumprocessrate'
Specify the minimum number of documents processed per minute. If the processing 
rate of documents drops below this threshold, the 'ANALYZE' will be executed. 
defaults to 1.


> Adding conditions to execute PostgreSQL's ANALYZE command to avoid crawling 
> become extremely slow.
> --------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1746
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1746
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>         Environment: Using ManifoldCF 2.24 with PostgreSQL 12.14 as the 
> database. 
>            Reporter: Mingchun Zhao
>            Priority: Major
>
> Sometimes, the crawling does not process any documents for a while and there 
> is nothing logged about long-running queries. The performance can be restored 
> by firing the 'ANALYZE' command manually. It seems that a bad query plan 
> caused this performance problem.
> Therefore, in addition to the current configuration parameter 
> 'org.apache.manifoldcf.db.postgres.analyze.<tablename>', it is considered 
> necessary to execute the 'ANALYZE' even in the following situations.
> 1. When the number of records in the table exceeds the number required for 
> creating a execution plan after the job starts.
> 2. When the crawling performance slows down. For example, if the processing 
> rate of documents drops below a specified threshold.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to