[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

2011-10-27 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13137765#comment-13137765
 ] 

Karl Wright commented on CONNECTORS-281:


With the fixes as checked in so far, there's no appreciable difference between 
early parts of the crawl and later parts.  So I'm going to resolve this issue.


 RSS connector takes nearly a second to fetch a document even with no 
 throttling
 ---

 Key: CONNECTORS-281
 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
 Project: ManifoldCF
  Issue Type: Bug
  Components: RSS connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The RSS connector load test shows that the RSS connector is overthrottling, 
 for some reason.
 10-24-2011 05:30:50.423   fetch   
 http://localhost:8189/rss/gen.php?doc=4feed=782type=doc
   200 46  843
 ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with 
 connection parameters as follows:
 Parameters:   Robots usage=none
 Max fetches per minute=100
 Email address=someb...@somewhere.com
 KB per second=100
 Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

2011-10-24 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133951#comment-13133951
 ] 

Karl Wright commented on CONNECTORS-281:


A thread dump shows all worker threads waiting on database functionality, but 
this is interesting.  A full 11/30 threads are waiting to RETURN connections to 
the pool:

Worker thread '5' daemon prio=6 tid=0x055b5c00 nid=0x1c14 waiting for monitor 
entry [0x05def000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
com.bitmechanic.sql.ConnectionPool.returnConnection(ConnectionPool.java:474)
- waiting to lock 0x292ad3a0 (a com.bitmechanic.sql.ConnectionPool)
at com.bitmechanic.sql.PooledConnection.close(PooledConnection.java:202)
at 
org.apache.manifoldcf.core.database.ConnectionFactory.releaseConnection(ConnectionFactory.java:113)
at 
org.apache.manifoldcf.core.database.Database.endTransaction(Database.java:330)
at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.endTransaction(DBInterfacePostgreSQL.java:1112)
at 
org.apache.manifoldcf.core.database.BaseTable.endTransaction(BaseTable.java:274)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.noteDocumentIngest(IncrementalIngester.java:1373)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:503)
at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentRecord(IncrementalIngester.java:325)
at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.recordDocument(WorkerThread.java:1556)
at 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.processDocuments(RSSConnector.java:1281)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)

It's not clear at all why this should be.  The only possible hint is that 
there's one thread waiting on GETTING a connection from the pool:

Worker thread '19' daemon prio=6 tid=0x045e1400 nid=0x1b94 waiting for 
monitor entry [0x0624f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
com.bitmechanic.sql.ConnectionPool.getConnection(ConnectionPool.java:375)
- waiting to lock 0x292ad3a0 (a com.bitmechanic.sql.ConnectionPool)
at 
com.bitmechanic.sql.ConnectionPoolManager.connect(ConnectionPoolManager.java:442)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at 
org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:144)
at 
org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:90)
at 
org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:502)
at 
org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1152)
at 
org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at 
org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:168)
at 
org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:860)
at 
org.apache.manifoldcf.crawler.jobs.Carrydown.getDataValuesAsFiles(Carrydown.java:603)
at 
org.apache.manifoldcf.crawler.jobs.JobManager.retrieveParentDataAsFiles(JobManager.java:4263)
at 
org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.retrieveParentDataAsFiles(WorkerThread.java:1211)
at 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:818)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

This has me wondering if we're seeing a bug in the pool driver.  I'll try to 
confirm with further stack traces, since this does not explain the high CPU 
usage.



 RSS connector takes nearly a second to fetch a document even with no 
 throttling
 ---

 Key: CONNECTORS-281
 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
 Project: ManifoldCF
  Issue Type: Bug
  Components: RSS connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The RSS connector load test shows that the RSS connector is overthrottling, 
 for some reason.
 10-24-2011 05:30:50.423   fetch   
 http://localhost:8189/rss/gen.php?doc=4feed=782type=doc
   200 46  843
 ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with 
 connection parameters as follows:
 Parameters:   Robots usage=none
 Max fetches per 

[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

2011-10-24 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133957#comment-13133957
 ] 

Karl Wright commented on CONNECTORS-281:


A second capture shows a much more expected mix of thread activities.  Most of 
the threads are waiting for database activities, mostly inserting documents and 
managing carrydown information.  The non-database work seems to be fetching 
documents and parsing URLs, which is as it should be.


 RSS connector takes nearly a second to fetch a document even with no 
 throttling
 ---

 Key: CONNECTORS-281
 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
 Project: ManifoldCF
  Issue Type: Bug
  Components: RSS connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The RSS connector load test shows that the RSS connector is overthrottling, 
 for some reason.
 10-24-2011 05:30:50.423   fetch   
 http://localhost:8189/rss/gen.php?doc=4feed=782type=doc
   200 46  843
 ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with 
 connection parameters as follows:
 Parameters:   Robots usage=none
 Max fetches per minute=100
 Email address=someb...@somewhere.com
 KB per second=100
 Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

2011-10-24 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133960#comment-13133960
 ] 

Karl Wright commented on CONNECTORS-281:


A third dump shows up the temporary file tracker as being a potential 
bottleneck.  Many threads are waiting on the same synchronizer:

Worker thread '7' daemon prio=6 tid=0x055b6800 nid=0x165c waiting for monitor 
entry [0x05e8f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.manifoldcf.core.system.ManifoldCF$FileTrack.addFile(ManifoldCF.java:1177)
- waiting to lock 0x292a2728 (a 
org.apache.manifoldcf.core.system.ManifoldCF$FileTrack)
at 
org.apache.manifoldcf.core.system.ManifoldCF.addFile(ManifoldCF.java:701)
at 
org.apache.manifoldcf.crawler.connectors.rss.DataCache.addData(DataCache.java:67)
at 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.getDocumentVersions(RSSConnector.java:1065)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

... both add and delete.


 RSS connector takes nearly a second to fetch a document even with no 
 throttling
 ---

 Key: CONNECTORS-281
 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
 Project: ManifoldCF
  Issue Type: Bug
  Components: RSS connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The RSS connector load test shows that the RSS connector is overthrottling, 
 for some reason.
 10-24-2011 05:30:50.423   fetch   
 http://localhost:8189/rss/gen.php?doc=4feed=782type=doc
   200 46  843
 ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with 
 connection parameters as follows:
 Parameters:   Robots usage=none
 Max fetches per minute=100
 Email address=someb...@somewhere.com
 KB per second=100
 Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-281) RSS connector takes nearly a second to fetch a document even with no throttling

2011-10-24 Thread Karl Wright (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134662#comment-13134662
 ] 

Karl Wright commented on CONNECTORS-281:


Use of temporary files, plus carry-down data, seems to be what makes the RSS 
connector significantly slower than a file crawl.  I'm still trying to assess 
whether the carrydown data is the issue later in the crawl.


 RSS connector takes nearly a second to fetch a document even with no 
 throttling
 ---

 Key: CONNECTORS-281
 URL: https://issues.apache.org/jira/browse/CONNECTORS-281
 Project: ManifoldCF
  Issue Type: Bug
  Components: RSS connector
Affects Versions: ManifoldCF 0.4
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.4


 The RSS connector load test shows that the RSS connector is overthrottling, 
 for some reason.
 10-24-2011 05:30:50.423   fetch   
 http://localhost:8189/rss/gen.php?doc=4feed=782type=doc
   200 46  843
 ... Where 843 ms is taken to fetch a document of size 46 bytes.  This is with 
 connection parameters as follows:
 Parameters:   Robots usage=none
 Max fetches per minute=100
 Email address=someb...@somewhere.com
 KB per second=100
 Max server connections=100

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira