Re: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org)
it sounds good to me, any others? Tommaso 2011/6/1 Karl Wright daddy...@gmail.com Here's my proposed text: ManifoldCF --Description-- ManifoldCF is an incremental crawler framework and set of connectors designed to pull documents from various kinds of repositories into search engine indexes or other targets. The current bevy of connectors includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio (Autonomy), SharePoint (Microsoft), RSS feeds, and web content. ManifoldCF also provides components for individual document security within a target search engine, so that repository security access conventions can be enforced in the search results. ManifoldCF has been in incubation since January, 2010. It was originally a planned subproject of Lucene but is now a likely top-level project. --A list of the three most important issues to address in the move towards graduation-- 1. We need at least one additional active committers, as well as additional users and repeat contributors 2. We may want another release before graduating 3. We'd like to see long-term contributions for project testing, especially infrastructure access --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be aware of?-- All issues have been addressed to our satisfaction at this time. --How has the community developed since the last report?-- A book has being completed, and is becoming available in early-release form, available from Manning Publishing. We have signed up two new committers and one new mentor. We continue to have user community interest. We've had a number of extremely helpful bug reports and contributions from the field. --How has the project developed since the last report?-- An 0.1 release was made on January 31, 2011, and a 0.2 release occurred on May 17, 2011. Another release is being considered. Signed off by mentor: Karl On Wed, Jun 1, 2011 at 10:23 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: I think the successful release should be mentioned too :-) Tommaso 2011/6/1 Karl Wright daddy...@gmail.com The March report looked like this: ManifoldCF --Description-- ManifoldCF is an incremental crawler framework and set of connectors designed to pull documents from various kinds of repositories into search engine indexes or other targets. The current bevy of connectors includes Documentum (EMC), FileNet (IBM), LiveLink (OpenText), Meridio (Autonomy), SharePoint (Microsoft), RSS feeds, and web content. ManifoldCF also provides components for individual document security within a target search engine, so that repository security access conventions can be enforced in the search results. ManifoldCF has been in incubation since January, 2010. It was originally a planned subproject of Lucene but is now a likely top-level project. --A list of the three most important issues to address in the move towards graduation-- 1. We need at least three additional active committers, as well as additional users and repeat contributors 2. We should have at least one or two more releases before graduating 3. We'd like to see long-term contributions for project testing, especially infrastructure access --Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be aware of?-- All issues have been addressed to our satisfaction at this time. --How has the community developed since the last report?-- A book is being written, and has entered the early-release phase, available from Manning Publishing. We continue to have user community interest. We've had a number of extremely helpful bug reports and contributions from the field. The active committer list remains short, however. --How has the project developed since the last report?-- An 0.1 release was made on January 31, 2011, and another release is being considered. Contributions extending the FileNet connector have been made, as well as contributions to the Solr connector. Signed off by mentor: Grant Ingersoll I'd like to mention our new committers and mentor, and the completion of the book. Anything else that should be added? Karl -- Forwarded message -- From: no-re...@apache.org Date: Wed, Jun 1, 2011 at 10:00 AM Subject: Incubator PMC/Board report for June 2011 (connectors-dev@incubator.apache.org) To: connectors-dev@incubator.apache.org Dear ManifoldCF Developers, This email was sent by an automated system on behalf of the Apache Incubator PMC. It is an initial reminder to give you plenty of time to prepare your quarterly board report. The board meeting is scheduled for Wed, 15 June 2011, 10 am Pacific. The report for your podling will form a part of the Incubator PMC report. The Incubator PMC requires your report to be submitted one week before the board meeting, to allow sufficient time for review.
[jira] [Commented] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby
[ https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042655#comment-13042655 ] Karl Wright commented on CONNECTORS-110: HSQLDB is now also in roughly the same situation, although I've gotten a rough outline of a way to make this work involving temporary tables. This is as follows: SELECT * FROM (SELECT DISTINCT customerid FROM invoice) AS i_one, LATERAL ( SELECT id, total FROM invoice WHERE customerid = i_one.customerid ORDER BY total DESC LIMIT 1) AS i_two ... where invoice would be a temporary table created on the fly, as follows: DECLARE LOCAL TEMPORARY TABLE T AS (SELECT statement) [ON COMMIT { PRESERVE | DELETE } ROWS] For example: DECLARE LOCAL TEMPORARY TABLE invoice AS (SELECT * FROM whatever) ON COMMIT DELETE ROWS WITH DATA then perform the kind of query I suggested. The issue is that this does not fit in a our single-query abstraction metaphor at all. Maybe a (different but identically named) stored procedure could be generated on all three databases that would do the trick. Alternatively, all databases could go the temporary table route, but then PostgreSQL would be unnecessarily crippled. Max activity and Max bandwidth reports don't work properly under Derby -- Key: CONNECTORS-110 URL: https://issues.apache.org/jira/browse/CONNECTORS-110 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Reporter: Karl Wright The reason for the failure is because the queries used are doing the Postgresql DISTINCT ON (xxx) syntax, which Derby does not support. Unfortunately, there does not seem to be a way in Derby at present to do anything similar to DISTINCT ON (xxx), and the queries really can't be done without that. One option is to introduce a getCapabilities() method into the database implementation, which would allow ACF to query the database capabilities before even presenting the report in the navigation menu in the UI. Another alternative is to do a sizable chunk of resultset processing within ACF, which would require not only the DISTINCT ON() implementation, but also the enclosing sort and limit stuff. It's the latter that would be most challenging, because of the difficulties with i18n etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB
[ https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-110: --- Summary: Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB (was: Max activity and Max bandwidth reports don't work properly under Derby) Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB Key: CONNECTORS-110 URL: https://issues.apache.org/jira/browse/CONNECTORS-110 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Reporter: Karl Wright The reason for the failure is because the queries used are doing the Postgresql DISTINCT ON (xxx) syntax, which Derby does not support. Unfortunately, there does not seem to be a way in Derby at present to do anything similar to DISTINCT ON (xxx), and the queries really can't be done without that. One option is to introduce a getCapabilities() method into the database implementation, which would allow ACF to query the database capabilities before even presenting the report in the navigation menu in the UI. Another alternative is to do a sizable chunk of resultset processing within ACF, which would require not only the DISTINCT ON() implementation, but also the enclosing sort and limit stuff. It's the latter that would be most challenging, because of the difficulties with i18n etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CONNECTORS-204) Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it
Now that HSQLDB functions with ManifoldCF, write a test-hsqldb ant target to test it Key: CONNECTORS-204 URL: https://issues.apache.org/jira/browse/CONNECTORS-204 Project: ManifoldCF Issue Type: Improvement Components: Build Reporter: Karl Wright The latest HSQLDB fixes and features make it an attractive alternative to Derby. But we need a test target that exercises it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CONNECTORS-203) Consider porting ManifoldCF to Java 1.5 code standards
[ https://issues.apache.org/jira/browse/CONNECTORS-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042724#comment-13042724 ] Karl Wright commented on CONNECTORS-203: I've merged in all the major interface changes into trunk in r1130475. The branch can now go away and further changes be made incrementally on trunk. Consider porting ManifoldCF to Java 1.5 code standards -- Key: CONNECTORS-203 URL: https://issues.apache.org/jira/browse/CONNECTORS-203 Project: ManifoldCF Issue Type: Improvement Components: Active Directory authority, Authority Service, Build, Documentation, Documentum connector, File system connector, FileNet connector, Framework agents process, Framework core, Framework crawler agent, GTS connector, JCIFS connector, JDBC connector, LiveLink connector, Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint connector, Web connector Affects Versions: ManifoldCF 0.3 Reporter: Karl Wright Consider porting ManifoldCF to Java 1.5 standards. This includes (but is not limited to): - build files - removing use of enum variable name - introducing generics in both implementation code and interfaces (cautiously) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[RESULT][VOTE] Adopt Java 1.5 as the minimum Java release for ManifoldCF
Although it hasn't been the quite required 3 days, this vote isn't binding anyway, so I'm going to declare it closed and commit the code. Karl On Mon, May 30, 2011 at 7:32 PM, Karl Wright daddy...@gmail.com wrote: Please have a look at CONNECTORS-203 and vote +1 if you think it's time to move beyond Java 1.4 at the source level, to Java 1.5. I did some work on this over the weekend and managed to convince myself that a migration to a newer Java version will have no obvious ill effects. But I'd like your thoughts. Especially interesting will be whether or not we try to maintain backwards compatibility in the connector interfaces: IConnector, IRepositoryConnector, IAuthorityConnector, and IOutputConnector. I have some ideas how I could bring these into the modern world in a relatively painless manner, should the community require it. But I'd like to hear your views as to whether we *should* work towards that end, or be willing to disrupt our early adopters in this process. Karl
[jira] [Created] (CONNECTORS-205) Database DISTINCT ON abstraction needs to include ordering information in order to work for HSQLDB
Database DISTINCT ON abstraction needs to include ordering information in order to work for HSQLDB -- Key: CONNECTORS-205 URL: https://issues.apache.org/jira/browse/CONNECTORS-205 Project: ManifoldCF Issue Type: Bug Components: Framework core, Framework crawler agent Reporter: Karl Wright The constructDistinctOnClause database method cannot support HSQLDB because it presumes that the ORDER BY clause is already part of the base query. This blocks us from using the HSQLDB WITH/LATERAL temporary table solution for the functionality. Adding ORDER BY information to the abstraction should work for all databases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
CrawlerCommons ManifoldCF
Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: CrawlerCommons ManifoldCF
Absolutely! We're a bit thin on active committers at the moment, which will probably limit our ability to take any highly active roles in your development process. But we do have a pile of code which you might be able to leverage, and once there is common functionality available I think we'd all prefer to use that rather than home-grown code. How would you prefer that we proceed? Karl On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: CrawlerCommons ManifoldCF
Hi Karl, Maybe a good start would be to identify which parts of your crawler could be shared and would not take too much effort to be made generic. I haven't looked to the code of the crawler in great details but do you think the robots parser would be a good candidate? Julien On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote: Absolutely! We're a bit thin on active committers at the moment, which will probably limit our ability to take any highly active roles in your development process. But we do have a pile of code which you might be able to leverage, and once there is common functionality available I think we'd all prefer to use that rather than home-grown code. How would you prefer that we proceed? Karl On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (CONNECTORS-110) Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB
[ https://issues.apache.org/jira/browse/CONNECTORS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042861#comment-13042861 ] Karl Wright commented on CONNECTORS-110: r1130644 implements this for HSQLDB. Unfortunately, performance is extremely slow, even when the number of rows in the temporary table is only a few dozen. Max activity and Max bandwidth reports don't work properly under Derby or HSQLDB Key: CONNECTORS-110 URL: https://issues.apache.org/jira/browse/CONNECTORS-110 Project: ManifoldCF Issue Type: Bug Components: Framework crawler agent Reporter: Karl Wright The reason for the failure is because the queries used are doing the Postgresql DISTINCT ON (xxx) syntax, which Derby does not support. Unfortunately, there does not seem to be a way in Derby at present to do anything similar to DISTINCT ON (xxx), and the queries really can't be done without that. One option is to introduce a getCapabilities() method into the database implementation, which would allow ACF to query the database capabilities before even presenting the report in the navigation menu in the UI. Another alternative is to do a sizable chunk of resultset processing within ACF, which would require not only the DISTINCT ON() implementation, but also the enclosing sort and limit stuff. It's the latter that would be most challenging, because of the difficulties with i18n etc. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: CrawlerCommons ManifoldCF
I don't think it would be hard to peel out the robots parser, although obviously it would need refactoring to live in a more standard library environment. If you want to look at it, it is in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java Look for the static class RobotsData, around line 299. Karl On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Karl, Maybe a good start would be to identify which parts of your crawler could be shared and would not take too much effort to be made generic. I haven't looked to the code of the crawler in great details but do you think the robots parser would be a good candidate? Julien On 2 June 2011 16:23, Karl Wright daddy...@gmail.com wrote: Absolutely! We're a bit thin on active committers at the moment, which will probably limit our ability to take any highly active roles in your development process. But we do have a pile of code which you might be able to leverage, and once there is common functionality available I think we'd all prefer to use that rather than home-grown code. How would you prefer that we proceed? Karl On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
RE: CrawlerCommons ManifoldCF
I'd like to join this project but can't find join button :) Thanks! Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: June-02-11 11:11 AM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: CrawlerCommons ManifoldCF Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
RE: CrawlerCommons ManifoldCF
I mean join button at http://code.google.com/p/crawler-commons/ I am well familiar with BIXO and Droids; it will be hard to make minor changes in ManifoldCF... although it's possible (without crawler part, only robots rules parser)... -Fuad -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: June-02-11 7:05 PM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: RE: CrawlerCommons ManifoldCF I'd like to join this project but can't find join button :) Thanks! Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: June-02-11 11:11 AM To: connectors-dev@incubator.apache.org; crawler-comm...@googlegroups.com Subject: CrawlerCommons ManifoldCF Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning to have other things there as well, e.g. robots.txt parser, protocol handler etc... Would you like to get involved? There are quite a few things that the crawler in Manifold could reuse or contribute to. Best, Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com