I just looked in the code with svn for differences in the web connector from release 0.6. There is a change to the html parser to allow for handling default values for <option> tags, and a change that fixes an IndexOutOfBounds exception. Neither of these can possibly affect socket timeouts.
I also looked at the solr connector (presuming that is what you are using as an output connector). No changes at all since 0.6. So honestly, I can see no significant changes whatsoever in the behavior of how a web crawler indexing into Solr would behave. If you are seeing differences, therefore, I simply cannot account for them. Karl On Fri, Oct 19, 2012 at 5:01 AM, Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp> wrote: > Due to the error, I had to downgrade to a lower version so I haven't found > the MySQL error code yet. > > I installed MCF1.0 in a different environment where crawlable contents are > different from the above environment. > I could not reproduce the Database exception but socket timeout occurred In > the same environment, I ran MCF0.6 and it completed crawling without socket > timeout. > Like you said, socket timeout seems to be a different problem from the > Database exception . > > 2012/10/18 Karl Wright <daddy...@gmail.com> >> >> So, what was the resolution of this problem? Any news? >> Karl >> >> On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright <daddy...@gmail.com> wrote: >> > The only change is that the MySQL driver now performs ANALYZE >> > operations on the fly in order to keep the database operating at high >> > efficiency. This is CONNECTORS-510. It is possible that, on a large >> > database table, these operations will cause others to wait long enough >> > so that their timeout is exceeded. Such an event does not take place >> > while the load tests run, however. If you want to turn off the >> > analyze operation, you can do that by setting a per-table property to >> > override the analyze default of 10000 operations: >> > >> > analyzeThreshold = >> > >> > ManifoldCF.getIntProperty("org.apache.manifold.db.mysql.analyze."+tableName,10000); >> > >> > The table in question is "jobqueue". If you set this value to >> > something like 1000000000 and you still see MySQL timeouts, then this >> > new code is not the problem. And, like I said, the best solution is >> > to recognize the error and retry, but first I would need the error >> > code. Adding an appropriate output of sqlState around line 123 of >> > >> > framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java >> > would allow us to see what code to catch, when it happened again. >> > >> > For the Web connector, the only modifications have been in regards to >> > how it handles 500 errors, which now correctly code to avoid an >> > IndexExceptionOutOfBounds exception. This has nothing to do with >> > socket exceptions, which are caused for external reasons only. >> > >> > Karl >> > >> > >> > On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi >> > <shigeki.kobayas...@g.softbank.co.jp> wrote: >> >> Hi Karl, >> >> >> >> >> >> I was comparing version 1.0 with old trunk based on version 0.6 >> >> implementing >> >> CONNECTORS-501( >> >> Medium-scale web crawl with hopcount-based filtering fails to find >> >> correct >> >> number of documents). >> >> >> >> Running each version with the same MySQL setting and the same >> >> throttling, >> >> somehow the version 1.0 hangs with the error. >> >> Since the old trunk completes crawling, I wonder if something has >> >> changed. >> >> >> >> Just to make sure I will recheck if there are any wrong settings in >> >> MCF. >> >> >> >> Thanks. >> >> >> >> Regards, >> >> >> >> Shigeki >> >> >> >> 2012/10/10 Karl Wright <daddy...@gmail.com> >> >>> >> >>> Hi Shigeki, >> >>> >> >>> The socket timeout exception is only a warning. It means that some >> >>> site you are crawling did not accept a socket connection within the >> >>> allowed time (5 minutes I think). The Web Connector will retry the >> >>> connection a few times, and if it is still rejected, it will >> >>> eventually give up on that page. One thing you want to check, though, >> >>> is that you are using proper throttling, because if you aren't then >> >>> one cause of this problem is that the webmaster of the site you are >> >>> trying to crawl may have blocked you from accessing it. >> >>> >> >>> The database exception is more problematic. It means that MySQL >> >>> thinks it took too long for a specific transaction to complete, and >> >>> the database aborted the transaction due to a timeout. There are two >> >>> ways of dealing with this issue. One way is to modify your MySQL >> >>> configuration to increase the transaction timeout value to some high >> >>> number. The second way is to modify ManifoldCF to recognize the >> >>> timeout error specifically, and cause a retry. But in order to do the >> >>> latter, I would need to know what SQL error code MySQL returns for >> >>> this situation, which will mean we either need to look it up (if we >> >>> can), or modify a ManifoldCF instance to log it when this problem >> >>> occurs. >> >>> >> >>> Please let me know how you would like to proceed. >> >>> >> >>> Karl >> >>> >> >>> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi >> >>> <shigeki.kobayas...@g.softbank.co.jp> wrote: >> >>> > >> >>> > Hi >> >>> > >> >>> > I am having a trouble with crawling web using MCF1.0. >> >>> > I run MCF with MySQL 5.5 and Tomcat 6.0. >> >>> > It should keep crawling contents, but MCF prints the following >> >>> > Database >> >>> > exception log, then hangs. >> >>> > After DB Exception, Socket Time Exception occurs. >> >>> > >> >>> > Anyone has faced this problem? >> >>> > >> >>> > --Database Exception log: >> >>> > >> >>> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread >> >>> > aborting >> >>> > and restarting due to database connection reset: Database exception: >> >>> > Exception doing query: Lock wait timeout exceeded; try restarting >> >>> > transaction >> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database >> >>> > exception: Exception doing query: Lock wait timeout exceeded; try >> >>> > restarting >> >>> > transaction >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) >> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try >> >>> > restarting >> >>> > transaction >> >>> > at >> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) >> >>> > at >> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) >> >>> > at >> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) >> >>> > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002) >> >>> > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) >> >>> > at >> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624) >> >>> > at >> >>> > >> >>> > >> >>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127) >> >>> > at >> >>> > >> >>> > >> >>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293) >> >>> > at >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) >> >>> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread >> >>> > aborting >> >>> > and restarting due to database connection reset: Database exception: >> >>> > Exception doing query: Lock wait timeout exceeded; try restarting >> >>> > transaction >> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database >> >>> > exception: Exception doing query: Lock wait timeout exceeded; try >> >>> > restarting >> >>> > transaction >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554) >> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try >> >>> > restarting >> >>> > transaction >> >>> > at >> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073) >> >>> > at >> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609) >> >>> > at >> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541) >> >>> > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002) >> >>> > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163) >> >>> > at >> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624) >> >>> > at >> >>> > >> >>> > >> >>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127) >> >>> > at >> >>> > >> >>> > >> >>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293) >> >>> > at >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641) >> >>> > >> >>> > >> >>> > >> >>> > ---- Socket Timeout: >> >>> > >> >>> > >> >>> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout >> >>> > exception trying to close connection: Read timed out >> >>> > java.net.SocketTimeoutException: Read timed out >> >>> > at java.net.SocketInputStream.socketRead0(Native Method) >> >>> > at >> >>> > java.net.SocketInputStream.read(SocketInputStream.java:129) >> >>> > at >> >>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> >>> > at >> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258) >> >>> > at >> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> >>> > at >> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown >> >>> > Source) >> >>> > at >> >>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown >> >>> > Source) >> >>> > at >> >>> > java.io.FilterInputStream.close(FilterInputStream.java:155) >> >>> > at >> >>> > >> >>> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown >> >>> > Source) >> >>> > at >> >>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321) >> >>> > INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH >> >>> > >> >>> > >> >>> > URL|http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException| >> >>> > Interrupted: Socket timeout: Read timed out >> >>> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch >> >>> > exception >> >>> > for 'http://xxxxxx/...' >> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> >>> > Interrupted: >> >>> > Socket timeout: Read timed out >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321) >> >>> > Caused by: >> >>> > org.apache.manifoldcf.agents.interfaces.ServiceInterruption: >> >>> > Socket timeout: Read timed out >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745) >> >>> > ... 1 more >> >>> > Caused by: java.net.SocketTimeoutException: Read timed out >> >>> > at java.net.SocketInputStream.socketRead0(Native Method) >> >>> > at >> >>> > java.net.SocketInputStream.read(SocketInputStream.java:129) >> >>> > at >> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> >>> > at >> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> >>> > at >> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > java.io.FilterInputStream.read(FilterInputStream.java:116) >> >>> > at >> >>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown >> >>> > Source) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976) >> >>> > at >> >>> > >> >>> > >> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95) >> >>> > ... 2 more >> >>> > WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest >> >>> > service >> >>> > interruption reported for job 1349774325961 connection 'WEB': Socket >> >>> > timeout: Read timed out >> >>> > >> >>> > >> >>> > >> >>> > Regards, >> >>> > >> >>> > Shigeki >> >> >> >> >> >> >> >> > > > >