Hi everyone!
We are using ManifoldCF 2.22.1 with multiple nodes in our production. And I
am investigating the problem we've got recently (it happens at least 5-6
times already). Couple of our jobs are end up with the following error:
```
Error: ERROR: duplicate key value violates unique constraint
"repohistory_pkey" Detail: Key (id)=(1672652357009) already exists.
```
and following log entry appears in the logs of the one of the nodes:
```
org.apache.manifoldcf.core.interfaces.ManifoldCFException: ERROR: duplicate
key value violates unique constraint "repohistory_pkey"
Detail: Key (id)=(1673507409625) already exists.
at
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.reinterpretException(DBInterfacePostgreSQL.java:638)
~[mcf-core.jar:?]
at
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:665)
~[mcf-core.jar:?]
at
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:187)
~[mcf-core.jar:?]
at
org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:68)
~[mcf-core.jar:?]
at
org.apache.manifoldcf.crawler.repository.RepositoryHistoryManager.addRow(RepositoryHistoryManager.java:202)
~[mcf-pull-agent.jar:?]
at
org.apache.manifoldcf.crawler.repository.RepositoryConnectionManager.recordHistory(RepositoryConnectionManager.java:706)
~[mcf-pull-agent.jar:?]
at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordActivity(WorkerThread.java:1878)
~[mcf-pull-agent.jar:?]
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocument(WebcrawlerConnector.java:1470)
~[?:?]
at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:753)
~[?:?]
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:402)
[mcf-pull-agent.jar:?]
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value
violates unique constraint "repohistory_pkey"
Detail: Key (id)=(1673507409625) already exists.
at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
~[postgresql-42.1.3.jar:42.1.3]
at
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
~[postgresql-42.1.3.jar:42.1.3]
at
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:300)
~[postgresql-42.1.3.jar:42.1.3]
at
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428)
~[postgresql-42.1.3.jar:42.1.3]
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
~[postgresql-42.1.3.jar:42.1.3]
at
org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
~[postgresql-42.1.3.jar:42.1.3]
at
org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:136)
~[postgresql-42.1.3.jar:42.1.3]
at
org.apache.manifoldcf.core.database.Database.execute(Database.java:916)
~[mcf-core.jar:?]
at
org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:696)
~[mcf-core.jar:?]
```
First, I have noticed that ID of the entities in the ManifoldCF are actualy
a timestamps. So I become curious how it handles duplications and starting
to dig the sources to get an idea of how an ids are generated. I found that
ids are generated by the `IDFactory`
(`org.apache.manifoldcf.core.interfaces.IDFactory`). `IDFactory` is using
the id's pool. Each time we need a new id it will be extracted from the
pool. In case of pool is empty `IDFactory` will generate another 100
entries. To make sure ids are not overlapped the last generated id is
stored in the zookeeper, so each time `IDFactory` will start generating
next batch of ids, it will start from the last id generated. This part
looks clean to me.
Next investigation was concerning locking. It is obvious that during id
generation we should handle synronization on the thread level (local jvm)
and global level (zookeeper). Both global and local locking also looks fine.
The other observation I made is that all cases happens during saving the
repository history records. So the next idea was that probably the same
record was trying to be stored repeatedly. But it seems it is quite hard to
investigate this part as a lot of service layers can call this.
For now I have just disabled history completely by placing
`org.apache.manifoldcf.crawler.repository.store_history` propeprty with
`false` value to the Zookeeper. If you have some ideas or had an experience
that can shed some light on the problem, I would be greatly appreciated.
Thank you!
With respect,
Artem Abeleshev