Re: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org)

2011-06-02 Thread Tommaso Teofili
it sounds good to me, any others?
Tommaso

2011/6/1 Karl Wright daddy...@gmail.com

 Here's my proposed text:

 ManifoldCF

 --Description--

 ManifoldCF is an incremental crawler framework and set of connectors
 designed to pull documents from various kinds of repositories into
 search engine indexes or other targets. The current bevy of connectors
 includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio
 (Autonomy), SharePoint (Microsoft), RSS feeds, and web content.
 ManifoldCF also provides components for individual document security
 within a target search engine, so that repository security access
 conventions can be enforced in the search results.

 ManifoldCF has been in incubation since January, 2010. It was
 originally a planned subproject of Lucene but is now a likely
 top-level project.

 --A list of the three most important issues to address in the move
 towards graduation--

 1. We need at least one additional active committers, as well as
 additional users and repeat contributors
 2. We may want another release before graduating
 3. We'd like to see long-term contributions for project testing,
 especially infrastructure access

 --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to
 be aware of?--

 All issues have been addressed to our satisfaction at this time.

 --How has the community developed since the last report?--

 A book has being completed, and is becoming available in early-release
 form, available from Manning Publishing.  We have signed up two new
 committers and one new mentor.  We continue to have user community
 interest.  We've had a number of extremely helpful bug reports and
 contributions from the field.

 --How has the project developed since the last report?--

 An 0.1 release was made on January 31, 2011, and a 0.2 release
 occurred on May 17, 2011.  Another release is being considered.

 Signed off by mentor:



 Karl

 On Wed, Jun 1, 2011 at 10:23 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  I think the successful release should be mentioned too :-)
  Tommaso
 
  2011/6/1 Karl Wright daddy...@gmail.com
 
  The March report looked like this:
 
  ManifoldCF
 
  --Description--
 
  ManifoldCF is an incremental crawler framework and set of connectors
  designed to pull documents from various kinds of repositories into
  search engine indexes or other targets. The current bevy of connectors
  includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio
  (Autonomy), SharePoint (Microsoft), RSS feeds, and web content.
  ManifoldCF also provides components for individual document security
  within a target search engine, so that repository security access
  conventions can be enforced in the search results.
 
  ManifoldCF has been in incubation since January, 2010. It was
  originally a planned subproject of Lucene but is now a likely
  top-level project.
 
  --A list of the three most important issues to address in the move
  towards graduation--
 
  1. We need at least three additional active committers, as well as
  additional users and repeat contributors
  2. We should have at least one or two more releases before graduating
  3. We'd like to see long-term contributions for project testing,
  especially infrastructure access
 
  --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to
  be aware of?--
 
  All issues have been addressed to our satisfaction at this time.
 
  --How has the community developed since the last report?--
 
  A book is being written, and has entered the early-release phase,
  available from Manning Publishing.  We continue to have user community
  interest.  We've had a number of extremely helpful bug reports and
  contributions from the field.  The active committer list remains
  short, however.
 
  --How has the project developed since the last report?--
 
  An 0.1 release was made on January 31, 2011, and another release is
  being considered.  Contributions extending the FileNet connector have
  been made, as well as contributions to the Solr connector.
 
  Signed off by mentor: Grant Ingersoll
 
  I'd like to mention our new committers and mentor, and the completion
  of the book.  Anything else that should be added?
  Karl
 
 
  -- Forwarded message --
  From:  no-re...@apache.org
  Date: Wed, Jun 1, 2011 at 10:00 AM
  Subject: Incubator PMC/Board report for June 2011
  (connectors-dev@incubator.apache.org)
  To: connectors-dev@incubator.apache.org
 
 
  Dear ManifoldCF Developers,
 
  This email was sent by an automated system on behalf of the Apache
  Incubator PMC.
  It is an initial reminder to give you plenty of time to prepare your
  quarterly
  board report.
 
  The board meeting is scheduled for  Wed, 15 June 2011, 10 am Pacific.
 The
  report
  for your podling will form a part of the Incubator PMC report. The
  Incubator PMC
  requires your report to be submitted one week before the board meeting,
 to
  allow
  sufficient time for review.
 
  

[jira] [Commented] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby

2011-06-02 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042655#comment-13042655
 ] 

Karl Wright commented on CONNECTORS-110:


HSQLDB is now also in roughly the same situation, although I've gotten a rough 
outline of a way to make this work involving temporary tables. This is as 
follows:

SELECT * FROM (SELECT DISTINCT customerid FROM invoice)  AS  i_one,
LATERAL ( SELECT id, total FROM invoice WHERE customerid =
i_one.customerid ORDER BY total DESC LIMIT 1) AS i_two

... where invoice would be a temporary table created on the fly, as follows:


DECLARE LOCAL TEMPORARY TABLE T AS (SELECT statement) [ON COMMIT {
PRESERVE | DELETE } ROWS]

For example:

DECLARE LOCAL TEMPORARY TABLE invoice AS (SELECT * FROM whatever) ON COMMIT 
DELETE ROWS WITH DATA

then perform the kind of query I suggested.

The issue is that this does not fit in a our single-query abstraction metaphor 
at all.  Maybe a (different but identically named) stored procedure could be 
generated on all three databases that would do the trick.  Alternatively, all 
databases could go the temporary table route, but then PostgreSQL would be 
unnecessarily crippled.




 Max activity and Max bandwidth reports don't work properly under Derby
 --

 Key: CONNECTORS-110
 URL: https://issues.apache.org/jira/browse/CONNECTORS-110
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Reporter: Karl Wright

 The reason for the failure is because the queries used are doing the 
 Postgresql DISTINCT ON (xxx) syntax, which Derby does not support.  
 Unfortunately, there does not seem to be a way in Derby at present to do 
 anything similar to DISTINCT ON (xxx), and the queries really can't be done 
 without that.
 One option is to introduce a getCapabilities() method into the database 
 implementation, which would allow ACF to query the database capabilities 
 before even presenting the report in the navigation menu in the UI.  Another 
 alternative is to do a sizable chunk of resultset processing within ACF, 
 which would require not only the DISTINCT ON() implementation, but also the 
 enclosing sort and limit stuff.  It's the latter that would be most 
 challenging, because of the difficulties with i18n etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB

2011-06-02 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-110:
---

Summary: Max activity and Max bandwidth reports don't work properly under 
Derby or HSQLDB  (was: Max activity and Max bandwidth reports don't work 
properly under Derby)

 Max activity and Max bandwidth reports don't work properly under Derby or 
 HSQLDB
 

 Key: CONNECTORS-110
 URL: https://issues.apache.org/jira/browse/CONNECTORS-110
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Reporter: Karl Wright

 The reason for the failure is because the queries used are doing the 
 Postgresql DISTINCT ON (xxx) syntax, which Derby does not support.  
 Unfortunately, there does not seem to be a way in Derby at present to do 
 anything similar to DISTINCT ON (xxx), and the queries really can't be done 
 without that.
 One option is to introduce a getCapabilities() method into the database 
 implementation, which would allow ACF to query the database capabilities 
 before even presenting the report in the navigation menu in the UI.  Another 
 alternative is to do a sizable chunk of resultset processing within ACF, 
 which would require not only the DISTINCT ON() implementation, but also the 
 enclosing sort and limit stuff.  It's the latter that would be most 
 challenging, because of the difficulties with i18n etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CONNECTORS-204) Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it

2011-06-02 Thread Karl Wright (JIRA)
Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to 
test it


 Key: CONNECTORS-204
 URL: https://issues.apache.org/jira/browse/CONNECTORS-204
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Build
Reporter: Karl Wright


The latest HSQLDB fixes and features make it an attractive alternative to 
Derby.  But we need a test target that exercises it.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CONNECTORS-203) Consider porting ManifoldCF to Java 1.5 code standards

2011-06-02 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042724#comment-13042724
 ] 

Karl Wright commented on CONNECTORS-203:


I've merged in all the major interface changes into trunk in r1130475.  The 
branch can now go away and further changes be made incrementally on trunk.


 Consider porting ManifoldCF to Java 1.5 code standards
 --

 Key: CONNECTORS-203
 URL: https://issues.apache.org/jira/browse/CONNECTORS-203
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Active Directory authority, Authority Service, Build, 
 Documentation, Documentum connector, File system connector, FileNet 
 connector, Framework agents process, Framework core, Framework crawler agent, 
 GTS connector, JCIFS connector, JDBC connector, LiveLink connector, 
 Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint 
 connector, Web connector
Affects Versions: ManifoldCF 0.3
Reporter: Karl Wright

 Consider porting ManifoldCF to Java 1.5 standards.  This includes (but is not 
 limited to):
 - build files
 - removing use of enum variable name
 - introducing generics in both implementation code and interfaces (cautiously)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[RESULT][VOTE] Adopt Java 1.5 as the minimum Java release for ManifoldCF

2011-06-02 Thread Karl Wright
Although it hasn't been the quite required 3 days, this vote isn't
binding anyway, so I'm going to declare it closed and commit the code.

Karl

On Mon, May 30, 2011 at 7:32 PM, Karl Wright daddy...@gmail.com wrote:
 Please have a look at CONNECTORS-203 and vote +1 if you think it's
 time to move beyond Java 1.4 at the source level, to Java 1.5.  I did
 some work on this over the weekend and managed to convince myself that
 a migration to a newer Java version will have no obvious ill effects.
 But I'd like your thoughts.

 Especially interesting will be whether or not we try to maintain
 backwards compatibility in the connector interfaces: IConnector,
 IRepositoryConnector, IAuthorityConnector, and IOutputConnector.  I
 have some ideas how I could bring these into the modern world in a
 relatively painless manner, should the community require it.  But I'd
 like to hear your views as to whether we *should* work towards that
 end, or be willing to disrupt our early adopters in this process.

 Karl



[jira] [Created] (CONNECTORS-205) Database DISTINCT ON abstraction needs to include ordering information in order to work for HSQLDB

2011-06-02 Thread Karl Wright (JIRA)
Database DISTINCT ON abstraction needs to include ordering information in order 
to work for HSQLDB
--

 Key: CONNECTORS-205
 URL: https://issues.apache.org/jira/browse/CONNECTORS-205
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework core, Framework crawler agent
Reporter: Karl Wright


The constructDistinctOnClause database method cannot support HSQLDB because it 
presumes that the ORDER BY clause is already part of the base query.  This 
blocks us from using the HSQLDB WITH/LATERAL temporary table solution for the 
functionality.

Adding ORDER BY information to the abstraction should work for all databases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


CrawlerCommons ManifoldCF

2011-06-02 Thread Julien Nioche
Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: CrawlerCommons ManifoldCF

2011-06-02 Thread Karl Wright
Absolutely!
We're a bit thin on active committers at the moment, which will
probably limit our ability to take any highly active roles in your
development process.  But we do have a pile of code which you might be
able to leverage, and once there is common functionality available I
think we'd all prefer to use that rather than home-grown code.

How would you prefer that we proceed?

Karl


On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
 Hi guys,

 I'd just like to mention Crawler Commons which is a effort between the
 committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
 put some basic functionalities in common. We currently have mostly a top
 level domain finder and a sitemap parser, but are definitely planning to
 have other things there as well, e.g. robots.txt parser, protocol handler
 etc...

 Would you like to get involved? There are quite a few things that the
 crawler in Manifold could reuse or contribute to.

 Best,

 Julien

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com



Re: CrawlerCommons ManifoldCF

2011-06-02 Thread Julien Nioche
Hi Karl,

Maybe a good start would be to identify which parts of your crawler could be
shared and would not take too much effort to be made generic. I haven't
looked to the code of the crawler in great details but do you think the
robots parser would be a good candidate?

Julien

On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote:

 Absolutely!
 We're a bit thin on active committers at the moment, which will
 probably limit our ability to take any highly active roles in your
 development process.  But we do have a pile of code which you might be
 able to leverage, and once there is common functionality available I
 think we'd all prefer to use that rather than home-grown code.

 How would you prefer that we proceed?

 Karl


 On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
  Hi guys,
 
  I'd just like to mention Crawler Commons which is a effort between the
  committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
  put some basic functionalities in common. We currently have mostly a top
  level domain finder and a sitemap parser, but are definitely planning to
  have other things there as well, e.g. robots.txt parser, protocol handler
  etc...
 
  Would you like to get involved? There are quite a few things that the
  crawler in Manifold could reuse or contribute to.
 
  Best,
 
  Julien
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
 




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Commented] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB

2011-06-02 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042861#comment-13042861
 ] 

Karl Wright commented on CONNECTORS-110:


r1130644 implements this for HSQLDB.  Unfortunately, performance is extremely 
slow, even when the number of rows in the temporary table is only a few dozen.


 Max activity and Max bandwidth reports don't work properly under Derby or 
 HSQLDB
 

 Key: CONNECTORS-110
 URL: https://issues.apache.org/jira/browse/CONNECTORS-110
 Project: ManifoldCF
  Issue Type: Bug
  Components: Framework crawler agent
Reporter: Karl Wright

 The reason for the failure is because the queries used are doing the 
 Postgresql DISTINCT ON (xxx) syntax, which Derby does not support.  
 Unfortunately, there does not seem to be a way in Derby at present to do 
 anything similar to DISTINCT ON (xxx), and the queries really can't be done 
 without that.
 One option is to introduce a getCapabilities() method into the database 
 implementation, which would allow ACF to query the database capabilities 
 before even presenting the report in the navigation menu in the UI.  Another 
 alternative is to do a sizable chunk of resultset processing within ACF, 
 which would require not only the DISTINCT ON() implementation, but also the 
 enclosing sort and limit stuff.  It's the latter that would be most 
 challenging, because of the difficulties with i18n etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: CrawlerCommons ManifoldCF

2011-06-02 Thread Karl Wright
I don't think it would be hard to peel out the robots parser, although
obviously it would need refactoring to live in a more standard library
environment.  If you want to look at it, it is in:

https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java

Look for the static class RobotsData, around line 299.

Karl



On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
 Hi Karl,

 Maybe a good start would be to identify which parts of your crawler could be
 shared and would not take too much effort to be made generic. I haven't
 looked to the code of the crawler in great details but do you think the
 robots parser would be a good candidate?

 Julien

 On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote:

 Absolutely!
 We're a bit thin on active committers at the moment, which will
 probably limit our ability to take any highly active roles in your
 development process.  But we do have a pile of code which you might be
 able to leverage, and once there is common functionality available I
 think we'd all prefer to use that rather than home-grown code.

 How would you prefer that we proceed?

 Karl


 On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
  Hi guys,
 
  I'd just like to mention Crawler Commons which is a effort between the
  committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
  put some basic functionalities in common. We currently have mostly a top
  level domain finder and a sitemap parser, but are definitely planning to
  have other things there as well, e.g. robots.txt parser, protocol handler
  etc...
 
  Would you like to get involved? There are quite a few things that the
  crawler in Manifold could reuse or contribute to.
 
  Best,
 
  Julien
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
 




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com



RE: CrawlerCommons ManifoldCF

2011-06-02 Thread Fuad Efendi
I'd like to join this project but can't find join button :)
Thanks!

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: June-02-11 11:11 AM
To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com
Subject: CrawlerCommons  ManifoldCF

Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com



RE: CrawlerCommons ManifoldCF

2011-06-02 Thread Fuad Efendi
I mean join button at http://code.google.com/p/crawler-commons/
I am well familiar with BIXO and Droids; it will be hard to make minor
changes in ManifoldCF... although it's possible (without crawler part,
only robots rules parser)...
-Fuad


-Original Message-
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: June-02-11 7:05 PM
To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com
Subject: RE: CrawlerCommons  ManifoldCF

I'd like to join this project but can't find join button :) Thanks!

Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
Sent: June-02-11 11:11 AM
To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com
Subject: CrawlerCommons  ManifoldCF

Hi guys,

I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning to
have other things there as well, e.g. robots.txt parser, protocol handler
etc...

Would you like to get involved? There are quite a few things that the
crawler in Manifold could reuse or contribute to.

Best,

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com