Hi,

Did you find any solution for that or do you have still disabled the
history?

I'm having the same problem, and we are using postgresql as the db.

Regards

On Sun, 29 Jan 2023 at 05:48, Artem Abeleshev <[email protected]>
wrote:

> Hi everyone!
>
> We are using ManifoldCF 2.22.1 with multiple nodes in our production. And
> I am investigating the problem we've got recently (it happens at least 5-6
> times already). Couple of our jobs are end up with the following error:
>
> ```
> Error: ERROR: duplicate key value violates unique constraint
> "repohistory_pkey" Detail: Key (id)=(1672652357009) already exists.
> ```
>
> and following log entry appears in the logs of the one of the nodes:
>
> ```
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: ERROR:
> duplicate key value violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
>         at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:638)
> ~[mcf-core.jar:?]
>         at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:665)
> ~[mcf-core.jar:?]
>         at
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:187)
> ~[mcf-core.jar:?]
>         at
> org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:68)
> ~[mcf-core.jar:?]
>         at
> org.apache.manifoldcf.crawler.repository.RepositoryHistoryManager.addRow(RepositoryHistoryManager.java:202)
> ~[mcf-pull-agent.jar:?]
>         at
> org.apache.manifoldcf.crawler.repository.RepositoryConnectionManager.recordHistory(RepositoryConnectionManager.java:706)
> ~[mcf-pull-agent.jar:?]
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordActivity(WorkerThread.java:1878)
> ~[mcf-pull-agent.jar:?]
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocument(WebcrawlerConnector.java:1470)
> ~[?:?]
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:753)
> ~[?:?]
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:402)
> [mcf-pull-agent.jar:?]
> Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value
> violates unique constraint "repohistory_pkey"
>   Detail: Key (id)=(1673507409625) already exists.
>         at
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:136)
> ~[postgresql-42.1.3.jar:42.1.3]
>         at
> org.apache.manifoldcf.core.database.Database.execute(Database.java:916)
> ~[mcf-core.jar:?]
>         at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696)
> ~[mcf-core.jar:?]
> ```
>
> First, I have noticed that ID of the entities in the ManifoldCF are
> actualy a timestamps. So I become curious how it handles duplications and
> starting to dig the sources to get an idea of how an ids are generated. I
> found that ids are generated by the `IDFactory`
> (`org.apache.manifoldcf.core.interfaces.IDFactory`). `IDFactory` is using
> the id's pool. Each time we need a new id it will be extracted from the
> pool. In case of pool is empty `IDFactory` will generate another 100
> entries. To make sure ids are not overlapped the last generated id is
> stored in the zookeeper, so each time `IDFactory` will start generating
> next batch of ids, it will start from the last id generated. This part
> looks clean to me.
>
> Next investigation was concerning locking. It is obvious that during id
> generation we should handle synronization on the thread level (local jvm)
> and global level (zookeeper). Both global and local locking also looks fine.
>
> The other observation I made is that all cases happens during saving the
> repository history records. So the next idea was that probably the same
> record was trying to be stored repeatedly. But it seems it is quite hard to
> investigate this part as a lot of service layers can call this.
>
> For now I have just disabled history completely by placing
> `org.apache.manifoldcf.crawler.repository.store_history` propeprty with
> `false` value to the Zookeeper. If you have some ideas or had an experience
> that can shed some light on the problem, I would be greatly appreciated.
>
> Thank you!
>
> With respect,
> Artem Abeleshev
>

Reply via email to