Re: max_pred_locks_per_transaction
I will upload the log to Jira or place it on web as soon as the job ends. Erlend On 07.02.13 15.34, Karl Wright wrote: FWIW, the way this diagnostic code works is that it simply dumps ALL the handles that are currently considered to be in use. Some of them are leaked, and some are just being used. So it's really the sum total of all of them that will help us narrow down where leaks are occurring. I looked at the one location you provided; it's entirely unremarkable; it's not even happening in a transaction. If similar locations show up again and again, it will mean that the problem must be in the part of the code that allocates connections at the lowest level. Before I start picking that code apart line by line, though, I'd like all the data. Karl On Thu, Feb 7, 2013 at 9:07 AM, Karl Wright daddy...@gmail.com wrote: Thanks, this is great. Could you include ALL of the 'possibly leaked' connections? I'm looking for the ones that show up over and over again. Please attach that part of the log to the ticket. Karl On Thu, Feb 7, 2013 at 9:04 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: On 06.02.13 15.24, Karl Wright wrote: FWIW, schema creation has been working fine here, and all tests that exercise the schema have been passing fine as well. My fault, but everything is successfully set up now. I think I have reproduced the issue. The crawl is still running, but I guess it will stop very soon due to the following entries in my log. The last fetched URL before the first of these log entries below was: https://www.duo.uio.no/handle/123456789/31/discover?fq=author_filter%3Aaaberge%2C%5C+astrid%5C%7C%5C%7C%5C%7CAaberge%2C%5C+Astrid It seems that this URL is part of a search, so maybe we will get around this problem by filtering out all search-related URLs. The problem is that it is already filtered out, but by doing post filtering. I can try to adjust that at next attempt. INFO 2013-02-07 14:32:08,166 (Worker thread '13') - WEB: FETCH URL|https://www.duo.uio.no/handle/123456789/31/discover?fq=author_filter%3Aaaberge%2C%5C+astrid%5C%7C%5C%7C%5C%7CAaberge%2C%5C+Astrid|1360243927960+204|200|24992| WARN 2013-02-07 14:32:08,228 (Worker thread '16') - Out of db connections, list of outstanding ones follows. WARN 2013-02-07 14:32:08,229 (Worker thread '16') - Found a possibly leaked db connection java.lang.Exception: Possibly leaked db connection at org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection(ConnectionPool.java:72) at org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:125) at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:96) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:758) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1430) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:803) at org.apache.manifoldcf.core.database.BaseTable.performQuery(BaseTable.java:221) at org.apache.manifoldcf.crawler.jobs.HopCount.processFind(HopCount.java:674) at org.apache.manifoldcf.crawler.jobs.HopCount.findHopCounts(HopCount.java:649) at org.apache.manifoldcf.crawler.jobs.JobManager.findHopCounts(JobManager.java:3777) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:224) WARN 2013-02-07 14:32:08,233 (Worker thread '16') - Found a possibly leaked db connection -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: max_pred_locks_per_transaction
I have some time this week to investigate this further after I finally delivered a heavy job about SAML integration. I will look through the source code as well and try to log even more if necessary. Maciej: Can you please send me a detailed description about your similar problems? I previously mentioned that the exclude from index functionality in the exclusion tab could be the source of the problem, but that does not seem to be the case anymore after I got the same problem for another job without any settings there. Just a guessing, but I think the source of the problem is problematic documents, for instance invalid XML documents which the crawler tries to parse. Karl: If necessary, you can try to crawl these documents yourself, but first I need to isolate the problem to one particular host. Thanks about the warnings about no backward compatibility for this test, so I will only use our test server where data integrity is not that important. I'm starting a new crawl right away and will report thereafter. Erlend On 05.02.13 14.10, Karl Wright wrote: Ok, it is clear from this that most of your threads are waiting to get a connection, and there are no connections to be found. This is exactly the problem that Maciej reported, which I created the CONNECTORS-638 ticket for. There has to be a connection leak somewhere. Obviously it is not a common situation, or the problem would arise almost right away; it probably occurs as a result of some error condition or pathway that is relatively uncommon. The diagnostic code that is now checked into trunk should work as follows: (1) First, checkout and build trunk. Since there are schema changes in trunk vs. older versions of ManifoldCF, you cannot go backwards and run an older version on a particular database instance once you've run trunk. Keep that in mind. (2) Add a line to the properties.xml file, as follows: property name=org.apache.manifoldcf.database.connectiontracking value=true/ (3) Start the system up and let it run. (4) When it fails, you should start to see dumps in the log like this: Logging.db.warn(Out of db connections, list of outstanding ones follows.); for (WrappedConnection c : outstandingConnections) { Logging.db.warn(Found a possibly leaked db connection,c.getInstantiationException()); } ... which will dump where all the offending connections were allocated. Hopefully this will point us at what the problem is. If there seems to be no consistency here, I'll have to explore the possibility that there are bugs in the connection allocation/free code, but we'll see. Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: max_pred_locks_per_transaction
Hi Erlend, Various crawler features are not where the problem is going to be coming from. It will be coming from the basic process of queuing documents and processing them, and the interaction with PostgreSQL. If you are worried that you won't be able to reproduce the issue, my suggestion is to set up a job similar to what you had before that hung. As long as you are using the same PostgreSQL server (not instance), the conditions will be ripe for the same kind of handle leakage as before. If you had tons of jobs running, and have no idea which one caused the problem, don't worry - the only thing that might actually matters is the kind of jobs you are doing: web, rss, etc., along with possibly the overall system load. Karl On Wed, Feb 6, 2013 at 4:48 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: I have some time this week to investigate this further after I finally delivered a heavy job about SAML integration. I will look through the source code as well and try to log even more if necessary. Maciej: Can you please send me a detailed description about your similar problems? I previously mentioned that the exclude from index functionality in the exclusion tab could be the source of the problem, but that does not seem to be the case anymore after I got the same problem for another job without any settings there. Just a guessing, but I think the source of the problem is problematic documents, for instance invalid XML documents which the crawler tries to parse. Karl: If necessary, you can try to crawl these documents yourself, but first I need to isolate the problem to one particular host. Thanks about the warnings about no backward compatibility for this test, so I will only use our test server where data integrity is not that important. I'm starting a new crawl right away and will report thereafter. Erlend On 05.02.13 14.10, Karl Wright wrote: Ok, it is clear from this that most of your threads are waiting to get a connection, and there are no connections to be found. This is exactly the problem that Maciej reported, which I created the CONNECTORS-638 ticket for. There has to be a connection leak somewhere. Obviously it is not a common situation, or the problem would arise almost right away; it probably occurs as a result of some error condition or pathway that is relatively uncommon. The diagnostic code that is now checked into trunk should work as follows: (1) First, checkout and build trunk. Since there are schema changes in trunk vs. older versions of ManifoldCF, you cannot go backwards and run an older version on a particular database instance once you've run trunk. Keep that in mind. (2) Add a line to the properties.xml file, as follows: property name=org.apache.manifoldcf.database.connectiontracking value=true/ (3) Start the system up and let it run. (4) When it fails, you should start to see dumps in the log like this: Logging.db.warn(Out of db connections, list of outstanding ones follows.); for (WrappedConnection c : outstandingConnections) { Logging.db.warn(Found a possibly leaked db connection,c.getInstantiationException()); } ... which will dump where all the offending connections were allocated. Hopefully this will point us at what the problem is. If there seems to be no consistency here, I'll have to explore the possibility that there are bugs in the connection allocation/free code, but we'll see. Karl -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: max_pred_locks_per_transaction
On 06.02.13 12.56, Erlend Garåsen wrote: Just a question. I see the following dumping into my logs every second even though no jobs are running at the moment. Is this related to connection tracking? I'm just afraid that it will be difficult to find the relevant log entries with such an active logging. Nevermind, I had several other log levels set to debug, so that was the reason. I'm afraid that trunk does not work well on Resin. If I go to List all jobs, I'm getting the following error: java.lang.IllegalStateException: can't forward after writing HTTP headers. I will open a ticket if this error does not disappear after a second deployment. I have seen this before for a year ago, but I cannot remember the cause. I will investigate this as well. Full stack trace: [2013-02-06 13:09:09.443] {resin-port-127.0.0.1:6945-43} java.lang.IllegalStateException: can't forward after writing HTTP headers at com.caucho.jsp.PageContextImpl.forward(PageContextImpl.java:1122) at _jsp._listjobs__jsp._jspService(_listjobs__jsp.java:267) at _jsp._listjobs__jsp._jspService(_listjobs__jsp.java:36) at com.caucho.jsp.JavaPage.service(JavaPage.java:64) at com.caucho.jsp.Page.pageservice(Page.java:542) at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194) at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:156) at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289) at com.caucho.server.hmux.HmuxRequest.handleInvocation(HmuxRequest.java:468) at com.caucho.server.hmux.HmuxRequest.handleRequestImpl(HmuxRequest.java:369) at com.caucho.server.hmux.HmuxRequest.handleRequest(HmuxRequest.java:336) at com.caucho.network.listen.TcpSocketLink.dispatchRequest(TcpSocketLink.java:1301) at com.caucho.network.listen.TcpSocketLink.handleRequest(TcpSocketLink.java:1257) at com.caucho.network.listen.TcpSocketLink.handleRequestsImpl(TcpSocketLink.java:1241) at com.caucho.network.listen.TcpSocketLink.handleRequests(TcpSocketLink.java:1149) at com.caucho.network.listen.TcpSocketLink.handleAcceptTaskImpl(TcpSocketLink.java:950) at com.caucho.network.listen.ConnectionTask.runThread(ConnectionTask.java:117) at com.caucho.network.listen.ConnectionTask.run(ConnectionTask.java:93) at com.caucho.network.listen.SocketLinkThreadLauncher.handleTasks(SocketLinkThreadLauncher.java:169) at com.caucho.network.listen.TcpSocketAcceptThread.run(TcpSocketAcceptThread.java:61) at com.caucho.env.thread2.ResinThread2.runTasks(ResinThread2.java:173) at com.caucho.env.thread2.ResinThread2.run(ResinThread2.java:118) -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: max_pred_locks_per_transaction
The jsp forward error is occurring because you are getting some kind of error listing jobs, and the UI is attempting to forward you to the error page. Unfortunately under Resin that is not apparently possible since Resin apparently sets some Http headers before we have any chance of doing the forward operation. So I think this is likely a Resin-only problem. Karl On Wed, Feb 6, 2013 at 7:15 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: On 06.02.13 12.56, Erlend Garåsen wrote: Just a question. I see the following dumping into my logs every second even though no jobs are running at the moment. Is this related to connection tracking? I'm just afraid that it will be difficult to find the relevant log entries with such an active logging. Nevermind, I had several other log levels set to debug, so that was the reason. I'm afraid that trunk does not work well on Resin. If I go to List all jobs, I'm getting the following error: java.lang.IllegalStateException: can't forward after writing HTTP headers. I will open a ticket if this error does not disappear after a second deployment. I have seen this before for a year ago, but I cannot remember the cause. I will investigate this as well. Full stack trace: [2013-02-06 13:09:09.443] {resin-port-127.0.0.1:6945-43} java.lang.IllegalStateException: can't forward after writing HTTP headers at com.caucho.jsp.PageContextImpl.forward(PageContextImpl.java:1122) at _jsp._listjobs__jsp._jspService(_listjobs__jsp.java:267) at _jsp._listjobs__jsp._jspService(_listjobs__jsp.java:36) at com.caucho.jsp.JavaPage.service(JavaPage.java:64) at com.caucho.jsp.Page.pageservice(Page.java:542) at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java:194) at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:156) at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:289) at com.caucho.server.hmux.HmuxRequest.handleInvocation(HmuxRequest.java:468) at com.caucho.server.hmux.HmuxRequest.handleRequestImpl(HmuxRequest.java:369) at com.caucho.server.hmux.HmuxRequest.handleRequest(HmuxRequest.java:336) at com.caucho.network.listen.TcpSocketLink.dispatchRequest(TcpSocketLink.java:1301) at com.caucho.network.listen.TcpSocketLink.handleRequest(TcpSocketLink.java:1257) at com.caucho.network.listen.TcpSocketLink.handleRequestsImpl(TcpSocketLink.java:1241) at com.caucho.network.listen.TcpSocketLink.handleRequests(TcpSocketLink.java:1149) at com.caucho.network.listen.TcpSocketLink.handleAcceptTaskImpl(TcpSocketLink.java:950) at com.caucho.network.listen.ConnectionTask.runThread(ConnectionTask.java:117) at com.caucho.network.listen.ConnectionTask.run(ConnectionTask.java:93) at com.caucho.network.listen.SocketLinkThreadLauncher.handleTasks(SocketLinkThreadLauncher.java:169) at com.caucho.network.listen.TcpSocketAcceptThread.run(TcpSocketAcceptThread.java:61) at com.caucho.env.thread2.ResinThread2.runTasks(ResinThread2.java:173) at com.caucho.env.thread2.ResinThread2.run(ResinThread2.java:118) -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
max_pred_locks_per_transaction
After we started to crawl journals, the crawler just stopped after a couple of hours. Running MCF in debug mode gave me the stack trace shown below. I thing we need to adjust some PG parameters, perhaps max_pred_locks_per_transaction. The database admins are now asking me about which value to set. They have increased it, but I don't know whether it is sufficient. Another thing. I did not see this message until I changed the log level to debug, but the log4j should catch this error messages with warn level enabled. So maybe it is a dead end, i.e. a totally different cause and that this just occurred as a coincidence. ERROR 2013-01-19 03:47:15,049 (Worker thread '49') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:803) at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554) Caused by: org.postgresql.util.PSQLException: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.manifoldcf.core.database.Database.execute(Database.java:826) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) DEBUG 2013-01-19 03:47:23,386 (Idle cleanup thread) - Checking for connections, idleTimeout: 1358563583386 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: max_pred_locks_per_transaction
Hi Erlend, Leaving logging at the default values would have shown the ERROR message you have below. So the cause for the pause must have been something else. When ManifoldCF seems to make no progress, the first thing to do is look at the simple history and see if it is retrying on something for some reason. If that is not helpful, get a thread dump. You can use jstack for that purpose. As for the PostgreSQL parameters, max_pred_locks_per_transaction seems to be PostgreSQL 9 black magic. Here's the documentation: http://www.postgresql.org/docs/9.1/static/runtime-config-locks.html Default is 64, but they don't say how it is allocated. I'd guess therefore that you should try 50% higher and see if that works, e.g. 96. I guess the limit is the amount of shared memory your OS allows you to allocate. Karl On Fri, Jan 25, 2013 at 5:26 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote: After we started to crawl journals, the crawler just stopped after a couple of hours. Running MCF in debug mode gave me the stack trace shown below. I thing we need to adjust some PG parameters, perhaps max_pred_locks_per_transaction. The database admins are now asking me about which value to set. They have increased it, but I don't know whether it is sufficient. Another thing. I did not see this message until I changed the log level to debug, but the log4j should catch this error messages with warn level enabled. So maybe it is a dead end, i.e. a totally different cause and that this just occurred as a coincidence. ERROR 2013-01-19 03:47:15,049 (Worker thread '49') - Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:803) at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554) Caused by: org.postgresql.util.PSQLException: ERROR: out of shared memory Hint: You might need to increase max_pred_locks_per_transaction. at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.manifoldcf.core.database.Database.execute(Database.java:826) at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) DEBUG 2013-01-19 03:47:23,386 (Idle cleanup thread) - Checking for connections, idleTimeout: 1358563583386 -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050