Re: CrawlerCommons ManifoldCF
Hi, We could reuse RobotsData indeed and refactor it a bit. Ken, you said you'd be keen to contribute your code for robot parsing as well - do you think it would be quicker than refactoring Manifold's code? Or does it do support additional features? What about Droids? Julien PS: Anyone attending BerlinBuzzwords next week? On 2 June 2011 17:57, Karl Wright daddy...@gmail.com wrote: I don't think it would be hard to peel out the robots parser, although obviously it would need refactoring to live in a more standard library environment. If you want to look at it, it is in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java Look for the static class RobotsData, around line 299. Karl On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Karl, Maybe a good start would be to identify which parts of your crawler could be shared and would not take too much effort to be made generic. I haven't looked to the code of the crawler in great details but do you think the robots parser would be a good candidate? Julien On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote: Absolutely! We're a bit thin on active committers at the moment, which will probably limit our ability to take any highly active roles in your development process. But we do have a pile of code which you might be able to leverage, and once there is common functionality available I think we'd all prefer to use that rather than home-grown code. How would you prefer that we proceed? Karl On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org)
I've signed off our report, thanks Karl for taking care. Tommaso 2011/6/2 Karl Wright daddy...@gmail.com I've now edited the page accordingly. Let me know of any changes you'd like to see. Karl On Thu, Jun 2, 2011 at 4:18 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: it sounds good to me, any others? Tommaso 2011/6/1 Karl Wright daddy...@gmail.com Here's my proposed text: ManifoldCF --Description-- ManifoldCF is an incremental crawler framework and set of connectors designed to pull documents from various kinds of repositories into search engine indexes or other targets. The current bevy of connectors includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio (Autonomy), SharePoint (Microsoft), RSS feeds, and web content. ManifoldCF also provides components for individual document security within a target search engine, so that repository security access conventions can be enforced in the search results. ManifoldCF has been in incubation since January, 2010. It was originally a planned subproject of Lucene but is now a likely top-level project. --A list of the three most important issues to address in the move towards graduation-- 1. We need at least one additional active committers, as well as additional users and repeat contributors 2. We may want another release before graduating 3. We'd like to see long-term contributions for project testing, especially infrastructure access --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be aware of?-- All issues have been addressed to our satisfaction at this time. --How has the community developed since the last report?-- A book has being completed, and is becoming available in early-release form, available from Manning Publishing. We have signed up two new committers and one new mentor. We continue to have user community interest. We've had a number of extremely helpful bug reports and contributions from the field. --How has the project developed since the last report?-- An 0.1 release was made on January 31, 2011, and a 0.2 release occurred on May 17, 2011. Another release is being considered. Signed off by mentor: Karl On Wed, Jun 1, 2011 at 10:23 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: I think the successful release should be mentioned too :-) Tommaso 2011/6/1 Karl Wright daddy...@gmail.com The March report looked like this: ManifoldCF --Description-- ManifoldCF is an incremental crawler framework and set of connectors designed to pull documents from various kinds of repositories into search engine indexes or other targets. The current bevy of connectors includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio (Autonomy), SharePoint (Microsoft), RSS feeds, and web content. ManifoldCF also provides components for individual document security within a target search engine, so that repository security access conventions can be enforced in the search results. ManifoldCF has been in incubation since January, 2010. It was originally a planned subproject of Lucene but is now a likely top-level project. --A list of the three most important issues to address in the move towards graduation-- 1. We need at least three additional active committers, as well as additional users and repeat contributors 2. We should have at least one or two more releases before graduating 3. We'd like to see long-term contributions for project testing, especially infrastructure access --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be aware of?-- All issues have been addressed to our satisfaction at this time. --How has the community developed since the last report?-- A book is being written, and has entered the early-release phase, available from Manning Publishing. We continue to have user community interest. We've had a number of extremely helpful bug reports and contributions from the field. The active committer list remains short, however. --How has the project developed since the last report?-- An 0.1 release was made on January 31, 2011, and another release is being considered. Contributions extending the FileNet connector have been made, as well as contributions to the Solr connector. Signed off by mentor: Grant Ingersoll I'd like to mention our new committers and mentor, and the completion of the book. Anything else that should be added? Karl -- Forwarded message -- From: no-re...@apache.org Date: Wed, Jun 1, 2011 at 10:00 AM Subject: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org) To: connectors-dev@incubator.apache.org Dear ManifoldCF Developers, This email was sent by an automated system on behalf of the
RE: CrawlerCommons ManifoldCF
Thanks Julien; I found it, strange... Yes, I need to separate Robots Rules Parser, if BIXO agrees... ManifoldCF current style: 1. Open socket 2. Load 500 kbits (in 2 milliseconds) 3. Speep 998 milliseconds Just because there is user interface where we set bandwidth limit to 500 kbps (probably 50 kbytes) So that it will be hard... I'd like to see HttpClient instead... or, if crawler-commons includes fetcher, to see that... even better if fetcher is rich enough to support POST (there was some interest at Droids) Existing code seems outdated: why external server should allocate resources (TCP and HTTP Handler) which are not used 99.8% of the time? But reusing of Robots Rules is most importnant; Nutch has some prooblems too... Thanks -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: June-03-11 7:01 AM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: Re: CrawlerCommons ManifoldCF There is a link to the discussion group on the main page, becoming a member of the group is pretty straightforward On 3 June 2011 00:36, Fuad Efendi f...@efendi.ca wrote: I mean join button at http://code.google.com/p/crawler-commons/ I am well familiar with BIXO and Droids; it will be hard to make minor changes in ManifoldCF... although it's possible (without crawler part, only robots rules parser)... -Fuad -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: June-02-11 7:05 PM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: RE: CrawlerCommons ManifoldCF I'd like to join this project but can't find join button :) Thanks! Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: June-02-11 11:11 AM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: CrawlerCommons ManifoldCF Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (CONNECTORS-114) Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB
[ https://issues.apache.org/jira/browse/CONNECTORS-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043369#comment-13043369 ] Karl Wright commented on CONNECTORS-114: Remaining issues with HSQLDB have been resolved, so I'm closing this ticket. r1131056. Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB --- Key: CONNECTORS-114 URL: https://issues.apache.org/jira/browse/CONNECTORS-114 Project: ManifoldCF Issue Type: Bug Components: Framework core Reporter: Karl Wright Fix For: ManifoldCF 0.3 Derby seems to have multiple problems: (1) It has internal deadlocks, which even if caught cause poor performance due to stalling (CONNECTORS-111); (2) It has no support for certain SQL constructs (CONNECTORS-109 and CONNECTORS-110); (3) It locks up entirely for some people (CONNECTORS-100). HSQLDB has been recommended as another potential embedded database that might work better. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CONNECTORS-114) Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB
[ https://issues.apache.org/jira/browse/CONNECTORS-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-114. Resolution: Fixed Fix Version/s: ManifoldCF 0.3 Assignee: Karl Wright I have not yet made HSQLDB the official Derby replacement, but it is currently a better embedded option for many situations than Derby is. Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB --- Key: CONNECTORS-114 URL: https://issues.apache.org/jira/browse/CONNECTORS-114 Project: ManifoldCF Issue Type: Bug Components: Framework core Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 0.3 Derby seems to have multiple problems: (1) It has internal deadlocks, which even if caught cause poor performance due to stalling (CONNECTORS-111); (2) It has no support for certain SQL constructs (CONNECTORS-109 and CONNECTORS-110); (3) It locks up entirely for some people (CONNECTORS-100). HSQLDB has been recommended as another potential embedded database that might work better. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CONNECTORS-206) HSQLDB is now a first-class ManifoldCF database; we should describe how to use it in the documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-206: --- Affects Version/s: ManifoldCF 0.3 HSQLDB is now a first-class ManifoldCF database; we should describe how to use it in the documentation -- Key: CONNECTORS-206 URL: https://issues.apache.org/jira/browse/CONNECTORS-206 Project: ManifoldCF Issue Type: Improvement Components: Documentation Affects Versions: ManifoldCF 0.3 Reporter: Karl Wright We're currently missing pretty much all mention of HSQLDB in the documentation. This includes how to enable it: org.apache.manifoldcf.databaseimplementationclass value org.apache.manifoldcf.core.database.DBInterfaceHSQLDB ... as well as the property it has for pointing at the database instance: org.apache.manifoldcf.hsqldbdatabasepath value relative path In addition to the site documentation for how to use it, we should also consider making HSQLDB be the default example database, since it seems to have fewer real problems than Derby. But this must wait until a test suite is written for this database. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Exception Handling
Your choice of exception would have been fine if this was a repository connector, but output connectors do not have the same ability to abort jobs via ManifoldCFExceptions at this time. (You can create a ticket if you think this is how it should work). But if you want the job to abort, you probably want to throw a ServiceInterruption exception, with zero retries. You have a choice of skip or abort job as actions. I recently made this work, so let me know if you encounter any problems. http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java Karl On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote: So my output connector connects to another repository. If I can't login to that repository, I execute the following line throw new ManifoldCFException(txn [ + txn + ] failed with error + e.toString(), e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR); ManifoldCF continues the crawl and actually puts out a WARN message. I expected ManifoldCF to hault the job and show the error in the UI, at least that is my desired out come. Do I need a different exception type to throw besides Repository Connection Error? Here is what I get in the log file: WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) - Connection service interruption reported for job 1306961303236 connection 'FileShare': txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202) ... 13 more Caused by: java.net.ConnectException: Connection timed out: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(Unknown Source) at java.net.PlainSocketImpl.connectToAddress(Unknown Source) at java.net.PlainSocketImpl.connect(Unknown Source) at java.net.SocksSocketImpl.connect(Unknown Source) at java.net.Socket.connect(Unknown Source) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148) ... 20 more
Re: Exception Handling
Actually, looking at the code, the REPOSITORY_CONNECTION type ManifoldCFException error is retried very specifically in this way for both repository and output connectors. Any other ManifoldCFException type (except INTERRUPTED) will cause the job to abort. The reason for this special behavior for this ManifoldCFException type I'm having a hard time recollecting; but I seem to recall vaguely it had something to do with the LiveLink connector. I'll post later if it comes back to me. Karl On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote: Your choice of exception would have been fine if this was a repository connector, but output connectors do not have the same ability to abort jobs via ManifoldCFExceptions at this time. (You can create a ticket if you think this is how it should work). But if you want the job to abort, you probably want to throw a ServiceInterruption exception, with zero retries. You have a choice of skip or abort job as actions. I recently made this work, so let me know if you encounter any problems. http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java Karl On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote: So my output connector connects to another repository. If I can't login to that repository, I execute the following line throw new ManifoldCFException(txn [ + txn + ] failed with error + e.toString(), e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR); ManifoldCF continues the crawl and actually puts out a WARN message. I expected ManifoldCF to hault the job and show the error in the UI, at least that is my desired out come. Do I need a different exception type to throw besides Repository Connection Error? Here is what I get in the log file: WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) - Connection service interruption reported for job 1306961303236 connection 'FileShare': txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202) ... 13 more Caused by: java.net.ConnectException: Connection timed out: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(Unknown Source) at
Re: Exception Handling
I remember now. The problem was that the LiveLink API code, under certain conditions, lied about the error it got back from the server. Under these conditions, therefore, a job would sometimes abort if a transient error occurred. The fix for this problem was made at the framework level because the CIFS connector also suffers from this same kind of problem, where a network glitch could cause a job to incorrectly abort for connection reasons In both cases, the underlying problems were resolved eventually by other means - in the case of Livelink, by periodically restarting the livelink server, and in the case of CIFS, by fixing a too-short timeout in jcifs. So, in theory, this retry logic could be removed. I'll create a ticket to research this further. Karl On Fri, Jun 3, 2011 at 1:29 PM, Karl Wright daddy...@gmail.com wrote: Actually, looking at the code, the REPOSITORY_CONNECTION type ManifoldCFException error is retried very specifically in this way for both repository and output connectors. Any other ManifoldCFException type (except INTERRUPTED) will cause the job to abort. The reason for this special behavior for this ManifoldCFException type I'm having a hard time recollecting; but I seem to recall vaguely it had something to do with the LiveLink connector. I'll post later if it comes back to me. Karl On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote: Your choice of exception would have been fine if this was a repository connector, but output connectors do not have the same ability to abort jobs via ManifoldCFExceptions at this time. (You can create a ticket if you think this is how it should work). But if you want the job to abort, you probably want to throw a ServiceInterruption exception, with zero retries. You have a choice of skip or abort job as actions. I recently made this work, so let me know if you encounter any problems. http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java Karl On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote: So my output connector connects to another repository. If I can't login to that repository, I execute the following line throw new ManifoldCFException(txn [ + txn + ] failed with error + e.toString(), e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR); ManifoldCF continues the crawl and actually puts out a WARN message. I expected ManifoldCF to hault the job and show the error in the UI, at least that is my desired out come. Do I need a different exception type to throw besides Repository Connection Error? Here is what I get in the log file: WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) - Connection service interruption reported for job 1306961303236 connection 'FileShare': txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) at
[jira] [Created] (CONNECTORS-207) ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second retry, but should probably abort the job instead
ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second retry, but should probably abort the job instead -- Key: CONNECTORS-207 URL: https://issues.apache.org/jira/browse/CONNECTORS-207 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 0.2, ManifoldCF 0.1, ManifoldCF 0.3 Reporter: Karl Wright The way a worker thread treats ManifoldCFException type REPOSITORY_CONNECTION_ERROR is no longer correct. It should probably just allow the job to be aborted with no retries. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CONNECTORS-207) ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-minute retry, but may want to abort the job instead
[ https://issues.apache.org/jira/browse/CONNECTORS-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-207: --- Description: The way a worker thread treats ManifoldCFException type REPOSITORY_CONNECTION_ERROR is to wait 5 minutes and retry. It might want to just allow the job to be aborted with no retries. The current behavior is not actually *wrong*, but the circumstances under which it was added were the result of severe problems at various sites that were unrelated to ManifoldCF. was: The way a worker thread treats ManifoldCFException type REPOSITORY_CONNECTION_ERROR is no longer correct. It should probably just allow the job to be aborted with no retries. Priority: Minor (was: Major) Summary: ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-minute retry, but may want to abort the job instead (was: ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second retry, but should probably abort the job instead) ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-minute retry, but may want to abort the job instead -- Key: CONNECTORS-207 URL: https://issues.apache.org/jira/browse/CONNECTORS-207 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2, ManifoldCF 0.3 Reporter: Karl Wright Priority: Minor The way a worker thread treats ManifoldCFException type REPOSITORY_CONNECTION_ERROR is to wait 5 minutes and retry. It might want to just allow the job to be aborted with no retries. The current behavior is not actually *wrong*, but the circumstances under which it was added were the result of severe problems at various sites that were unrelated to ManifoldCF. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Exception Handling
CONNECTORS-207 describes the situation. Karl On Fri, Jun 3, 2011 at 1:41 PM, Karl Wright daddy...@gmail.com wrote: I remember now. The problem was that the LiveLink API code, under certain conditions, lied about the error it got back from the server. Under these conditions, therefore, a job would sometimes abort if a transient error occurred. The fix for this problem was made at the framework level because the CIFS connector also suffers from this same kind of problem, where a network glitch could cause a job to incorrectly abort for connection reasons In both cases, the underlying problems were resolved eventually by other means - in the case of Livelink, by periodically restarting the livelink server, and in the case of CIFS, by fixing a too-short timeout in jcifs. So, in theory, this retry logic could be removed. I'll create a ticket to research this further. Karl On Fri, Jun 3, 2011 at 1:29 PM, Karl Wright daddy...@gmail.com wrote: Actually, looking at the code, the REPOSITORY_CONNECTION type ManifoldCFException error is retried very specifically in this way for both repository and output connectors. Any other ManifoldCFException type (except INTERRUPTED) will cause the job to abort. The reason for this special behavior for this ManifoldCFException type I'm having a hard time recollecting; but I seem to recall vaguely it had something to do with the LiveLink connector. I'll post later if it comes back to me. Karl On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote: Your choice of exception would have been fine if this was a repository connector, but output connectors do not have the same ability to abort jobs via ManifoldCFExceptions at this time. (You can create a ticket if you think this is how it should work). But if you want the job to abort, you probably want to throw a ServiceInterruption exception, with zero retries. You have a choice of skip or abort job as actions. I recently made this work, so let me know if you encounter any problems. http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java Karl On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote: So my output connector connects to another repository. If I can't login to that repository, I execute the following line throw new ManifoldCFException(txn [ + txn + ] failed with error + e.toString(), e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR); ManifoldCF continues the crawl and actually puts out a WARN message. I expected ManifoldCF to hault the job and show the error in the UI, at least that is my desired out come. Do I need a different exception type to throw besides Repository Connection Error? Here is what I get in the log file: WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) - Connection service interruption reported for job 1306961303236 connection 'FileShare': txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at
Re: Exception Handling
Got it working, question about retryTime and failTime. From your reply I got the impression that the user will get the choice to skip or abort, then what do you set these parms to? 0? Thanks! On 6/3/2011 12:11 PM, Karl Wright wrote: Your choice of exception would have been fine if this was a repository connector, but output connectors do not have the same ability to abort jobs via ManifoldCFExceptions at this time. (You can create a ticket if you think this is how it should work). But if you want the job to abort, you probably want to throw a ServiceInterruption exception, with zero retries. You have a choice of skip or abort job as actions. I recently made this work, so let me know if you encounter any problems. http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java Karl On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valadho...@farzad.net wrote: So my output connector connects to another repository. If I can't login to that repository, I execute the following line throw new ManifoldCFException(txn [ + txn + ] failed with error + e.toString(), e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR); ManifoldCF continues the crawl and actually puts out a WARN message. I expected ManifoldCF to hault the job and show the error in the UI, at least that is my desired out come. Do I need a different exception type to throw besides Repository Connection Error? Here is what I get in the log file: WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) - Connection service interruption reported for job 1306961303236 connection 'FileShare': txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login] failed with error org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://valadbld:34544 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202) ... 13 more Caused by: java.net.ConnectException: Connection timed out: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(Unknown Source) at java.net.PlainSocketImpl.connectToAddress(Unknown Source) at java.net.PlainSocketImpl.connect(Unknown Source) at java.net.SocksSocketImpl.connect(Unknown Source) at java.net.Socket.connect(Unknown Source) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123) at
[jira] [Commented] (CONNECTORS-204) Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it
[ https://issues.apache.org/jira/browse/CONNECTORS-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044007#comment-13044007 ] Karl Wright commented on CONNECTORS-204: r1131177 has part of the code. Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it Key: CONNECTORS-204 URL: https://issues.apache.org/jira/browse/CONNECTORS-204 Project: ManifoldCF Issue Type: Improvement Components: Build Reporter: Karl Wright Assignee: Karl Wright The latest HSQLDB fixes and features make it an attractive alternative to Derby. But we need a test target that exercises it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Strange Exception
So I've been trying to figure this out for days now and still not even close. So I'm getting this in the log file: FATAL 2011-06-03 14:26:39,188 [Worker thread '22'] (WorkerThread.java:955) - Error tossed: null java.lang.NullPointerException at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:153) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564) When I go to line 153 of DupFinderConnector I'm calling: boolean isDuplicate = dataManager.insertData(timeStamp, rcdCounter++, documentURI, outputDescription, authorityNameString, document.getBinaryLength(), 1, hashsumHexValue, inputStream); Added a log statement to print out all the parms. The only null one is authorityNameString. First off, this error shows up in a few worker threads and not all or in all crawl jobs. Any explanation or clue as to what I should be looking for? Thanks!