Re: CrawlerCommons ManifoldCF

2011-06-03 Thread Julien Nioche
Hi,

We could reuse RobotsData indeed and refactor it a bit.

Ken, you said you'd be keen to contribute your code for robot parsing as
well - do you think it would be quicker than refactoring Manifold's code? Or
does it do support additional features? What about Droids?

Julien

PS: Anyone attending BerlinBuzzwords next week?


On 2 June 2011 17:57, Karl Wright daddy...@gmail.com wrote:

 I don't think it would be hard to peel out the robots parser, although
 obviously it would need refactoring to live in a more standard library
 environment.  If you want to look at it, it is in:


 https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java

 Look for the static class RobotsData, around line 299.

 Karl



 On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
  Hi Karl,
 
  Maybe a good start would be to identify which parts of your crawler could
 be
  shared and would not take too much effort to be made generic. I haven't
  looked to the code of the crawler in great details but do you think the
  robots parser would be a good candidate?
 
  Julien
 
  On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote:
 
  Absolutely!
  We're a bit thin on active committers at the moment, which will
  probably limit our ability to take any highly active roles in your
  development process.  But we do have a pile of code which you might be
  able to leverage, and once there is common functionality available I
  think we'd all prefer to use that rather than home-grown code.
 
  How would you prefer that we proceed?
 
  Karl
 
 
  On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
  lists.digitalpeb...@gmail.com wrote:
   Hi guys,
  
   I'd just like to mention Crawler Commons which is a effort between the
   committers of various crawl-related projects (Nutch, Bixo or Heritrix)
 to
   put some basic functionalities in common. We currently have mostly a
 top
   level domain finder and a sitemap parser, but are definitely planning
 to
   have other things there as well, e.g. robots.txt parser, protocol
 handler
   etc...
  
   Would you like to get involved? There are quite a few things that the
   crawler in Manifold could reuse or contribute to.
  
   Best,
  
   Julien
  
   --
   *
   *Open Source Solutions for Text Engineering
  
   http://digitalpebble.blogspot.com/
   http://www.digitalpebble.com
  
 
 
 
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
 




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org)

2011-06-03 Thread Tommaso Teofili
I've signed off our report, thanks Karl for taking care.
Tommaso

2011/6/2 Karl Wright daddy...@gmail.com

 I've now edited the page accordingly.  Let me know of any changes
 you'd like to see.
 Karl

 On Thu, Jun 2, 2011 at 4:18 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  it sounds good to me, any others?
  Tommaso
 
  2011/6/1 Karl Wright daddy...@gmail.com
 
  Here's my proposed text:
 
  ManifoldCF
 
  --Description--
 
  ManifoldCF is an incremental crawler framework and set of connectors
  designed to pull documents from various kinds of repositories into
  search engine indexes or other targets. The current bevy of connectors
  includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio
  (Autonomy), SharePoint (Microsoft), RSS feeds, and web content.
  ManifoldCF also provides components for individual document security
  within a target search engine, so that repository security access
  conventions can be enforced in the search results.
 
  ManifoldCF has been in incubation since January, 2010. It was
  originally a planned subproject of Lucene but is now a likely
  top-level project.
 
  --A list of the three most important issues to address in the move
  towards graduation--
 
  1. We need at least one additional active committers, as well as
  additional users and repeat contributors
  2. We may want another release before graduating
  3. We'd like to see long-term contributions for project testing,
  especially infrastructure access
 
  --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to
  be aware of?--
 
  All issues have been addressed to our satisfaction at this time.
 
  --How has the community developed since the last report?--
 
  A book has being completed, and is becoming available in early-release
  form, available from Manning Publishing.  We have signed up two new
  committers and one new mentor.  We continue to have user community
  interest.  We've had a number of extremely helpful bug reports and
  contributions from the field.
 
  --How has the project developed since the last report?--
 
  An 0.1 release was made on January 31, 2011, and a 0.2 release
  occurred on May 17, 2011.  Another release is being considered.
 
  Signed off by mentor:
 
 
 
  Karl
 
  On Wed, Jun 1, 2011 at 10:23 AM, Tommaso Teofili
  tommaso.teof...@gmail.com wrote:
   I think the successful release should be mentioned too :-)
   Tommaso
  
   2011/6/1 Karl Wright daddy...@gmail.com
  
   The March report looked like this:
  
   ManifoldCF
  
   --Description--
  
   ManifoldCF is an incremental crawler framework and set of connectors
   designed to pull documents from various kinds of repositories into
   search engine indexes or other targets. The current bevy of
 connectors
   includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText),
 Meridio
   (Autonomy), SharePoint (Microsoft), RSS feeds, and web content.
   ManifoldCF also provides components for individual document security
   within a target search engine, so that repository security access
   conventions can be enforced in the search results.
  
   ManifoldCF has been in incubation since January, 2010. It was
   originally a planned subproject of Lucene but is now a likely
   top-level project.
  
   --A list of the three most important issues to address in the move
   towards graduation--
  
   1. We need at least three additional active committers, as well as
   additional users and repeat contributors
   2. We should have at least one or two more releases before graduating
   3. We'd like to see long-term contributions for project testing,
   especially infrastructure access
  
   --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to
   be aware of?--
  
   All issues have been addressed to our satisfaction at this time.
  
   --How has the community developed since the last report?--
  
   A book is being written, and has entered the early-release phase,
   available from Manning Publishing.  We continue to have user
 community
   interest.  We've had a number of extremely helpful bug reports and
   contributions from the field.  The active committer list remains
   short, however.
  
   --How has the project developed since the last report?--
  
   An 0.1 release was made on January 31, 2011, and another release is
   being considered.  Contributions extending the FileNet connector have
   been made, as well as contributions to the Solr connector.
  
   Signed off by mentor: Grant Ingersoll
  
   I'd like to mention our new committers and mentor, and the completion
   of the book.  Anything else that should be added?
   Karl
  
  
   -- Forwarded message --
   From:  no-re...@apache.org
   Date: Wed, Jun 1, 2011 at 10:00 AM
   Subject: Incubator PMC/Board report for June 2011
   (connectors-dev@incubator.apache.org)
   To: connectors-dev@incubator.apache.org
  
  
   Dear ManifoldCF Developers,
  
   This email was sent by an automated system on behalf of the 

RE: CrawlerCommons ManifoldCF

2011-06-03 Thread Fuad Efendi
Thanks Julien; I found it, strange...

Yes, I need to separate Robots Rules Parser, if BIXO agrees...


ManifoldCF current style:

1. Open socket
2. Load 500 kbits (in 2 milliseconds)
3. Speep 998 milliseconds

Just because there is user interface where we set bandwidth limit to 500
kbps (probably 50 kbytes)

So that it will be hard... I'd like to see HttpClient instead... or, if
crawler-commons includes fetcher, to see that... even better if fetcher
is rich enough to support POST (there was some interest at Droids)

Existing code seems outdated: why external server should allocate resources
(TCP and HTTP Handler) which are not used 99.8% of the time?

But reusing of Robots Rules is most importnant; Nutch has some prooblems
too...


Thanks




-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: June-03-11 7:01 AM
To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com
Subject: Re: CrawlerCommons  ManifoldCF

There is a link to the discussion group on the main page, becoming a member
of the group is pretty straightforward

On 3 June 2011 00:36, Fuad Efendi f...@efendi.ca wrote:

 I mean join button at http://code.google.com/p/crawler-commons/
 I am well familiar with BIXO and Droids; it will be hard to make minor 
 changes in ManifoldCF... although it's possible (without crawler 
 part, only robots rules parser)...
 -Fuad


 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: June-02-11 7:05 PM
 To: connectors-dev@incubator.apache.org; 
 crawler-comm...@googlegroups.com
 Subject: RE: CrawlerCommons  ManifoldCF

 I'd like to join this project but can't find join button :) Thanks!

 Fuad Efendi
 +1 416-993-2060
 http://www.linkedin.com/in/liferay

 Tokenizer Inc.
 http://www.tokenizer.ca/
 Data Mining, Vertical Search

 -Original Message-
 From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
 Sent: June-02-11 11:11 AM
 To: connectors-dev@incubator.apache.org; 
 crawler-comm...@googlegroups.com
 Subject: CrawlerCommons  ManifoldCF

 Hi guys,

 I'd just like to mention Crawler Commons which is a effort between the 
 committers of various crawl-related projects (Nutch, Bixo or Heritrix) 
 to put some basic functionalities in common. We currently have mostly 
 a top level domain finder and a sitemap parser, but are definitely 
 planning to have other things there as well, e.g. robots.txt parser, 
 protocol handler etc...

 Would you like to get involved? There are quite a few things that the 
 crawler in Manifold could reuse or contribute to.

 Best,

 Julien

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com



[jira] [Commented] (CONNECTORS-114) Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB

2011-06-03 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043369#comment-13043369
 ] 

Karl Wright commented on CONNECTORS-114:


Remaining issues with HSQLDB have been resolved, so I'm closing this ticket.
r1131056.


 Derby seems too unstable in multithreaded situations to be a good database 
 for ManifoldCF, so try to add support for HSQLDB
 ---

 Key: CONNECTORS-114
 URL: https://issues.apache.org/jira/browse/CONNECTORS-114
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Reporter: Karl Wright
 Fix For: ManifoldCF 0.3


 Derby seems to have multiple problems:
 (1) It has internal deadlocks, which even if caught cause poor performance 
 due to stalling (CONNECTORS-111);
 (2) It has no support for certain SQL constructs (CONNECTORS-109 and 
 CONNECTORS-110);
 (3) It locks up entirely for some people (CONNECTORS-100).
 HSQLDB has been recommended as another potential embedded database that might 
 work better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (CONNECTORS-114) Derby seems too unstable in multithreaded situations to be a good database for ManifoldCF, so try to add support for HSQLDB

2011-06-03 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-114.


   Resolution: Fixed
Fix Version/s: ManifoldCF 0.3
 Assignee: Karl Wright

I have not yet made HSQLDB the official Derby replacement, but it is currently 
a better embedded option for many situations than Derby is.

 Derby seems too unstable in multithreaded situations to be a good database 
 for ManifoldCF, so try to add support for HSQLDB
 ---

 Key: CONNECTORS-114
 URL: https://issues.apache.org/jira/browse/CONNECTORS-114
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 0.3


 Derby seems to have multiple problems:
 (1) It has internal deadlocks, which even if caught cause poor performance 
 due to stalling (CONNECTORS-111);
 (2) It has no support for certain SQL constructs (CONNECTORS-109 and 
 CONNECTORS-110);
 (3) It locks up entirely for some people (CONNECTORS-100).
 HSQLDB has been recommended as another potential embedded database that might 
 work better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CONNECTORS-206) HSQLDB is now a first-class ManifoldCF database; we should describe how to use it in the documentation

2011-06-03 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-206:
---

Affects Version/s: ManifoldCF 0.3

 HSQLDB is now a first-class ManifoldCF database; we should describe how to 
 use it in the documentation
 --

 Key: CONNECTORS-206
 URL: https://issues.apache.org/jira/browse/CONNECTORS-206
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Documentation
Affects Versions: ManifoldCF 0.3
Reporter: Karl Wright

 We're currently missing pretty much all mention of HSQLDB in the 
 documentation.  This includes how to enable it:
 org.apache.manifoldcf.databaseimplementationclass value 
 org.apache.manifoldcf.core.database.DBInterfaceHSQLDB
 ... as well as the property it has for pointing at the database instance:
 org.apache.manifoldcf.hsqldbdatabasepath value relative path
 In addition to the site documentation for how to use it, we should also 
 consider making HSQLDB be the default example database, since it seems to 
 have fewer real problems than Derby.  But this must wait until a test suite 
 is written for this database.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Exception Handling

2011-06-03 Thread Karl Wright
Your choice of exception would have been fine if this was a repository
connector, but output connectors do not have the same ability to abort
jobs via ManifoldCFExceptions at this time.  (You can create a ticket
if you think this is how it should work).  But if you want the job to
abort, you probably want to throw a ServiceInterruption exception,
with zero retries.  You have a choice of skip or abort job as
actions.  I recently made this work, so let me know if you encounter
any problems.

http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java

Karl

On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote:
 So my output connector connects to another repository.  If I can't login to
 that repository, I execute the following line throw new
 ManifoldCFException(txn [ + txn + ] failed with error  + e.toString(),
 e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR);

 ManifoldCF continues the crawl and actually puts out a WARN message.  I
 expected ManifoldCF to hault the job and show the error in the UI, at least
 that is my desired out come.  Do I need a different exception type to throw
 besides Repository Connection Error?  Here is what I get in the log file:

  WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) -
 Connection service interruption reported for job 1306961303236 connection
 'FileShare': txn [login] failed with error
 org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login]
 failed with error org.apache.http.conn.HttpHostConnectException: Connection
 to http://valadbld:34544 refused
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
    at
 org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
    at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)
 Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
    at
 org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
    at
 org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149)
    at
 org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
    at
 org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561)
    at
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
    at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
    at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202)
    ... 13 more
 Caused by: java.net.ConnectException: Connection timed out: connect
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(Unknown Source)
    at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
    at java.net.PlainSocketImpl.connect(Unknown Source)
    at java.net.SocksSocketImpl.connect(Unknown Source)
    at java.net.Socket.connect(Unknown Source)
    at
 org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
    at
 org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
    ... 20 more



Re: Exception Handling

2011-06-03 Thread Karl Wright
Actually, looking at the code, the REPOSITORY_CONNECTION type
ManifoldCFException error is retried very specifically in this way for
both repository and output connectors.  Any other ManifoldCFException
type (except INTERRUPTED) will cause the job to abort.  The reason for
this special behavior for this ManifoldCFException type I'm having a
hard time recollecting; but I seem to recall vaguely it had something
to do with the LiveLink connector.  I'll post later if it comes back
to me.

Karl

On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote:
 Your choice of exception would have been fine if this was a repository
 connector, but output connectors do not have the same ability to abort
 jobs via ManifoldCFExceptions at this time.  (You can create a ticket
 if you think this is how it should work).  But if you want the job to
 abort, you probably want to throw a ServiceInterruption exception,
 with zero retries.  You have a choice of skip or abort job as
 actions.  I recently made this work, so let me know if you encounter
 any problems.

 http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java

 Karl

 On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote:
 So my output connector connects to another repository.  If I can't login to
 that repository, I execute the following line throw new
 ManifoldCFException(txn [ + txn + ] failed with error  + e.toString(),
 e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR);

 ManifoldCF continues the crawl and actually puts out a WARN message.  I
 expected ManifoldCF to hault the job and show the error in the UI, at least
 that is my desired out come.  Do I need a different exception type to throw
 besides Repository Connection Error?  Here is what I get in the log file:

  WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) -
 Connection service interruption reported for job 1306961303236 connection
 'FileShare': txn [login] failed with error
 org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login]
 failed with error org.apache.http.conn.HttpHostConnectException: Connection
 to http://valadbld:34544 refused
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
    at
 org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
    at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)
 Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
    at
 org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
    at
 org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149)
    at
 org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
    at
 org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561)
    at
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
    at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
    at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202)
    ... 13 more
 Caused by: java.net.ConnectException: Connection timed out: connect
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(Unknown Source)
    at 

Re: Exception Handling

2011-06-03 Thread Karl Wright
I remember now.
The problem was that the LiveLink API code, under certain conditions,
lied about the error it got back from the server.  Under these
conditions, therefore, a job would sometimes abort if a transient
error occurred.  The fix for this problem was made at the framework
level because the CIFS connector also suffers from this same kind of
problem, where a network glitch could cause a job to incorrectly abort
for connection reasons

In both cases, the underlying problems were resolved eventually by
other means - in the case of Livelink, by periodically restarting the
livelink server, and in the case of CIFS, by fixing a too-short
timeout in jcifs.  So, in theory, this retry logic could be removed.

I'll create a ticket to research this further.

Karl

On Fri, Jun 3, 2011 at 1:29 PM, Karl Wright daddy...@gmail.com wrote:
 Actually, looking at the code, the REPOSITORY_CONNECTION type
 ManifoldCFException error is retried very specifically in this way for
 both repository and output connectors.  Any other ManifoldCFException
 type (except INTERRUPTED) will cause the job to abort.  The reason for
 this special behavior for this ManifoldCFException type I'm having a
 hard time recollecting; but I seem to recall vaguely it had something
 to do with the LiveLink connector.  I'll post later if it comes back
 to me.

 Karl

 On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote:
 Your choice of exception would have been fine if this was a repository
 connector, but output connectors do not have the same ability to abort
 jobs via ManifoldCFExceptions at this time.  (You can create a ticket
 if you think this is how it should work).  But if you want the job to
 abort, you probably want to throw a ServiceInterruption exception,
 with zero retries.  You have a choice of skip or abort job as
 actions.  I recently made this work, so let me know if you encounter
 any problems.

 http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java

 Karl

 On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote:
 So my output connector connects to another repository.  If I can't login to
 that repository, I execute the following line throw new
 ManifoldCFException(txn [ + txn + ] failed with error  + e.toString(),
 e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR);

 ManifoldCF continues the crawl and actually puts out a WARN message.  I
 expected ManifoldCF to hault the job and show the error in the UI, at least
 that is my desired out come.  Do I need a different exception type to throw
 besides Repository Connection Error?  Here is what I get in the log file:

  WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) -
 Connection service interruption reported for job 1306961303236 connection
 'FileShare': txn [login] failed with error
 org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login]
 failed with error org.apache.http.conn.HttpHostConnectException: Connection
 to http://valadbld:34544 refused
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
    at
 org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
    at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)
 Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
    at
 org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
    at
 

[jira] [Created] (CONNECTORS-207) ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second retry, but should probably abort the job instead

2011-06-03 Thread Karl Wright (JIRA)
ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second 
retry, but should probably abort the job instead
--

 Key: CONNECTORS-207
 URL: https://issues.apache.org/jira/browse/CONNECTORS-207
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 0.2, ManifoldCF 0.1, ManifoldCF 0.3
Reporter: Karl Wright


The way a worker thread treats ManifoldCFException type 
REPOSITORY_CONNECTION_ERROR is no longer correct.  It should probably just 
allow the job to be aborted with no retries.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CONNECTORS-207) ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-minute retry, but may want to abort the job instead

2011-06-03 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-207:
---

Description: 
The way a worker thread treats ManifoldCFException type 
REPOSITORY_CONNECTION_ERROR is to wait 5 minutes and retry.  It might want to 
just allow the job to be aborted with no retries.  The current behavior is not 
actually *wrong*, but the circumstances under which it was added were the 
result of severe problems at various sites that were unrelated to ManifoldCF.



  was:
The way a worker thread treats ManifoldCFException type 
REPOSITORY_CONNECTION_ERROR is no longer correct.  It should probably just 
allow the job to be aborted with no retries.


   Priority: Minor  (was: Major)
Summary: ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a 
five-minute retry, but may want to abort the job instead  (was: 
ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-second 
retry, but should probably abort the job instead)

 ManifoldCFException type REPOSITORY_CONNECTION_ERROR causes a five-minute 
 retry, but may want to abort the job instead
 --

 Key: CONNECTORS-207
 URL: https://issues.apache.org/jira/browse/CONNECTORS-207
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2, ManifoldCF 0.3
Reporter: Karl Wright
Priority: Minor

 The way a worker thread treats ManifoldCFException type 
 REPOSITORY_CONNECTION_ERROR is to wait 5 minutes and retry.  It might want to 
 just allow the job to be aborted with no retries.  The current behavior is 
 not actually *wrong*, but the circumstances under which it was added were the 
 result of severe problems at various sites that were unrelated to ManifoldCF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Exception Handling

2011-06-03 Thread Karl Wright
CONNECTORS-207 describes the situation.
Karl

On Fri, Jun 3, 2011 at 1:41 PM, Karl Wright daddy...@gmail.com wrote:
 I remember now.
 The problem was that the LiveLink API code, under certain conditions,
 lied about the error it got back from the server.  Under these
 conditions, therefore, a job would sometimes abort if a transient
 error occurred.  The fix for this problem was made at the framework
 level because the CIFS connector also suffers from this same kind of
 problem, where a network glitch could cause a job to incorrectly abort
 for connection reasons

 In both cases, the underlying problems were resolved eventually by
 other means - in the case of Livelink, by periodically restarting the
 livelink server, and in the case of CIFS, by fixing a too-short
 timeout in jcifs.  So, in theory, this retry logic could be removed.

 I'll create a ticket to research this further.

 Karl

 On Fri, Jun 3, 2011 at 1:29 PM, Karl Wright daddy...@gmail.com wrote:
 Actually, looking at the code, the REPOSITORY_CONNECTION type
 ManifoldCFException error is retried very specifically in this way for
 both repository and output connectors.  Any other ManifoldCFException
 type (except INTERRUPTED) will cause the job to abort.  The reason for
 this special behavior for this ManifoldCFException type I'm having a
 hard time recollecting; but I seem to recall vaguely it had something
 to do with the LiveLink connector.  I'll post later if it comes back
 to me.

 Karl

 On Fri, Jun 3, 2011 at 1:11 PM, Karl Wright daddy...@gmail.com wrote:
 Your choice of exception would have been fine if this was a repository
 connector, but output connectors do not have the same ability to abort
 jobs via ManifoldCFExceptions at this time.  (You can create a ticket
 if you think this is how it should work).  But if you want the job to
 abort, you probably want to throw a ServiceInterruption exception,
 with zero retries.  You have a choice of skip or abort job as
 actions.  I recently made this work, so let me know if you encounter
 any problems.

 http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java

 Karl

 On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valad ho...@farzad.net wrote:
 So my output connector connects to another repository.  If I can't login to
 that repository, I execute the following line throw new
 ManifoldCFException(txn [ + txn + ] failed with error  + e.toString(),
 e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR);

 ManifoldCF continues the crawl and actually puts out a WARN message.  I
 expected ManifoldCF to hault the job and show the error in the UI, at least
 that is my desired out come.  Do I need a different exception type to throw
 besides Repository Connection Error?  Here is what I get in the log file:

  WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) 
 -
 Connection service interruption reported for job 1306961303236 connection
 'FileShare': txn [login] failed with error
 org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
 org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login]
 failed with error org.apache.http.conn.HttpHostConnectException: Connection
 to http://valadbld:34544 refused
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134)
    at
 org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261)
    at
 org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
    at
 org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
    at
 org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
    at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
    at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)
 Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
 http://valadbld:34544 refused
    at
 

Re: Exception Handling

2011-06-03 Thread Farzad Valad
Got it working, question about retryTime and failTime.  From your reply 
I got the impression that the user will get the choice to skip or abort, 
then what do you set these parms to? 0?  Thanks!


On 6/3/2011 12:11 PM, Karl Wright wrote:

Your choice of exception would have been fine if this was a repository
connector, but output connectors do not have the same ability to abort
jobs via ManifoldCFExceptions at this time.  (You can create a ticket
if you think this is how it should work).  But if you want the job to
abort, you probably want to throw a ServiceInterruption exception,
with zero retries.  You have a choice of skip or abort job as
actions.  I recently made this work, so let me know if you encounter
any problems.

http://svn.apache.org/repos/asf/incubator/lcf/trunk/framework/agents/src/main/java/org/apache/manifoldcf/agents/interfaces/ServiceInterruption.java

Karl

On Fri, Jun 3, 2011 at 1:02 PM, Farzad Valadho...@farzad.net  wrote:

So my output connector connects to another repository.  If I can't login to
that repository, I execute the following line throw new
ManifoldCFException(txn [ + txn + ] failed with error  + e.toString(),
e, ManifoldCFException.REPOSITORY_CONNECTION_ERROR);

ManifoldCF continues the crawl and actually puts out a WARN message.  I
expected ManifoldCF to hault the job and show the error in the UI, at least
that is my desired out come.  Do I need a different exception type to throw
besides Repository Connection Error?  Here is what I get in the log file:

  WARN 2011-06-01 15:51:42,276 [Worker thread '27'] (WorkerThread.java:855) -
Connection service interruption reported for job 1306961303236 connection
'FileShare': txn [login] failed with error
org.apache.http.conn.HttpHostConnectException: Connection to
http://valadbld:34544 refused
org.apache.manifoldcf.core.interfaces.ManifoldCFException: txn [login]
failed with error org.apache.http.conn.HttpHostConnectException: Connection
to http://valadbld:34544 refused
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:266)
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:318)
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:314)
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.Login(CIConnector.java:134)
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.initialize(CIConnector.java:114)
at
org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.getSession(DupFinderConnector.java:261)
at
org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:137)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
at
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
at
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)
Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
http://valadbld:34544 refused
at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
at
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:149)
at
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
at
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:561)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
at
org.apache.manifoldcf.agents.output.dupfinder.CIConnector.sendTxn(CIConnector.java:202)
... 13 more
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at
org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
at

[jira] [Commented] (CONNECTORS-204) Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it

2011-06-03 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044007#comment-13044007
 ] 

Karl Wright commented on CONNECTORS-204:


r1131177 has part of the code.

 Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to 
 test it
 

 Key: CONNECTORS-204
 URL: https://issues.apache.org/jira/browse/CONNECTORS-204
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Build
Reporter: Karl Wright
Assignee: Karl Wright

 The latest HSQLDB fixes and features make it an attractive alternative to 
 Derby.  But we need a test target that exercises it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Strange Exception

2011-06-03 Thread Farzad Valad
So I've been trying to figure this out for days now and still not even 
close.  So I'm getting this in the log file:


FATAL 2011-06-03 14:26:39,188 [Worker thread '22'] 
(WorkerThread.java:955) - Error tossed: null

java.lang.NullPointerException
 at 
org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:153)
 at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1433)
 at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:418)
 at 
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:313)
 at 
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1565)
 at 
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
 at 
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:564)


When I go to line 153 of DupFinderConnector I'm calling:

boolean isDuplicate = dataManager.insertData(timeStamp, 
rcdCounter++, documentURI, outputDescription, authorityNameString, 
document.getBinaryLength(), 1, hashsumHexValue, inputStream);


Added a log statement to print out all the parms.  The only null one is 
authorityNameString.  First off, this error shows up in a few worker 
threads and not all or in all crawl jobs.  Any explanation or clue as to 
what I should be looking for?


Thanks!