+1 Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
----- Original Message ---- > From: Grant Ingersoll <gsing...@apache.org> > To: general@incubator.apache.org > Sent: Fri, January 8, 2010 8:51:57 AM > Subject: [VOTE] Incubate Lucene Connector Framework > > Hi, > > Given the lack of response on the proposal, I'll assume lazy consensus and > call > a vote. > > On behalf of the Lucene PMC, I'd like to propose incubation for a new Lucene > subproject called the Lucene Connector Framework (LCF). I think we have all > the > necessary bits in place for the proposal to go forward. > > Proposal: http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal > > [] +1. Accept LCF into the Incubator. > [] 0. Don't care. > [] -1. Do not accept (and why.) > > Here's my +1. > > Thanks, Grant Ingersoll > > > > ------ Wiki Text Copied Below ----- > > Lucene Connector Framework > > Abstract > > Many, many search engines, as well as other applications, have a need to > connect > with content repositories (SharePoint, CMS, Documentum, etc.) in a standard > manner. The Lucene Connector Framework (LCF) is a project aimed at building > out > these connectors in open source under the Apache brand. > > Proposal > > The goal of LCF is to create a viable Lucene subproject aimed at delivering a > best of breed connector framework under the Apache Lucene name. As a > framework, > the project will not only provide a way to connect to individual repositories, > but also a mechanism for plugging in new connectors or custom connectors in a > straightforward manner. > > A connector framework is vital for search engines and other tools that need to > access data located in corporate repositories. By abstracting the problem > into a > framework, applications can code to a set of well-defined interfaces instead > of > having to use a different interface for each connector. > > Connector Framework is an extendible incremental crawler, which uses a > database > to manage configuration and crawl history, and provides reasonably high > performance in accessing content in multiple repositories for the main purpose > of search engine indexing. Connector Framework also establishes a > repository-specific security model which can be used to limit search user > access > to repository content based on a user's identity. Connector Framework also > includes existing connectors and authorities for: > > • File system • Windows shares • JDBC-supported databases • RSS feeds • > General > websites • LiveLink [from OpenText] > > • Documentum [from EMC] • SharePoint [from Microsoft] > > • Meridio [from Meridio] • Memex [from Memex] • FileNet [from IBM] > > Key design points for Connector Framework are as follows: > > • Extendability - you can add new connectors for new repositories, and new > authorities for specific repository security models • Incrementality - the > ability to process only what changed between crawls, in > a repository-specific manner • Restartability - using a database with ACID > properties to insure that crawls > are safe against process interruption or machine shutdown • Security - > establishing a model of security tokens that allows a search > engine to enforce a repository's security model • Limited footprint - ability > to > operate reliably within a fixed amount of > process memory, regardless of configuration • Performance - management of > connector-specific resources to maximize overall > thoughput • Transparency - ability to generate reports on the activity of all > crawls and > repository connections > > Background > > MetaCarta originally approached Grant Ingersoll from the Lucene PMC about > donating their existing connector framework to the Lucene PMC. After some > discussion about accepting it as a software grant, the PMC decided it would be > best to incubate the project first. > > Rationale > > The Connector Framework fills an often significant gap in the Lucene > experience, > namely, how to get content locked away in a content repository into > Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search engines and > others) will also have this same problem. A Connector Framework would also be > useful for someone wishing to migrate between content repositories, too. > > Current Status > > Connector Framework has been under development and in use in the field for > close > to five years, deployed on a MetaCarta search appliance. Almost all > development > of the project has been done by Karl Wright ( kwri...@metacarta.com ). Some > individual connectors were developed initially by contractors hired by > MetaCarta, Inc., but maintenance and further development is currently handled > by > the MetaCarta team. > > Development of Connector Framework can therefore be viewed as core framework > development, plus development of individual connectors. Core framework > development is currently not a terribly collaborative process, as there are no > maintainers of the core functionality other than Mr. Wright. Development of > new > connectors has been done in the past in a much more collaborative way by > supplying a developer with a "development kit", and then integrating the > resulting connector (with whatever changes might have been necessary) into the > source tree. > > Reasonable efforts have been made to maintain the generality of the code base > during the time that MetaCarta has owned it. Nevertheless, certain > MetaCarta-specific changes have been made which may require review and > modification. The following areas probably need to be addressed in the code > before graduation can occur: > > • Branding. The UI brands it as a MetaCarta project. > > • Package names. Package names would have to be changed. • How Connector > Framework handles document delivery needs to be generalized, at > least for a single, configurable target output connector, and perhaps for > multiple, independently-configurable targets. Simple example output connectors > need to be written. Work in this direction is currently underway at MetaCarta > and may or may not be complete at the time of the code handover. > > • Connector Framework-specific dependent package modifications need to be > addressed somehow. For instance, the following projects that Connector > Framework > depends upon have been modified, but the modifications have not been accepted > upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web, SharePoint, > Meridio, and Livelink connectors]; commons-httpclient custom HTTPS protocol > factory support [Web, SharePoint, Meridio, and Livelink connectors]; xerces > ability to handle non-legal RSS feeds [RSS and Web connectors] > > • MetaCarta-specific features, like document templates, are explicitly handled > by the UI and the infrastructure. These features should be generalized so that > they are controlled by the choice of output connector. > > • Some specific hooks, namely support for configuration change notification, > and for database maintenance notification, may need to be made more generic. > • > Share Connector has a "fingerprinting" feature, which prefilters documents > based on a document type it surmises using a document inspection technique. > This > feature is only viable at the moment for very basic document types. It should > either be removed, or generalized significantly to be much more flexible. • > Documentation needs to be fleshed out, including javadoc and overall usage > documents. • Tests need to be written and/or ported from MetaCarta's test > suite. > > Longer term, the project will likely grow into a more distributed crawler, > where > multiple machines might well be involved in coordinated crawling activity. > > Meritocracy > > Building the community using a meritocratic approach is very important to the > success of LCF. We know many, many people in the search space (and otherwise) > have either written their own connectors or are in need of connectors. Thus, > we > expect a meritocratic community will lead to widespread participation. > > Community > > Our hope is that our existing code, features and capabilities will attract a > large community of both developers and users. We also believe that other > organizations will find this project interesting and relevant, and contribute > resources. > > The user community of LCF would be similar to that of the other Lucene > projects, > and in many cases they would overlap. > > Core Developers > > See the initial committer list below. > > Alignment > > We expect LCF will align quite well with the existing Lucene community and > will > also provide significant value to other ASF and non-ASF projects as well as > many > companies and individuals looking to access their content repositories in a > programmatic fashion. > > Known Risks > > Orphaned Products > > The Connector Framework is an important piece of any search engine, including > MetaCarta's, as it provides the primary mechanism for getting content out of a > repository and into the search engine's index. Thus, we don't expect it will > be > orphaned anytime soon. Once the project is established and the code is > available, we expect to attract not only other search companies, but others > with > similar needs. > > Inexperience with Open Source > > Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the majority of the > experience with Open Source at the ASF, but all of the initial committers are > familiar with Open Source and have contributed to other open source projects. > > Homogeneous Developers > > The current list of committers are mostly members of either the MetaCarta or > Lucid Imagination developer team, but several are not. Additionally, we are > actively recruiting other developers. > > Reliance on Salaried Developers > > We have a variety of committers represented. Some are being paid to work on > the > project and some are not. > > Cryptography > > Connector Framework itself has no real cryptography component, although it > does > currently obfuscate passwords it saves to the database or to a configuration > file using a proprietary algorithm. The algorithm is present simply to avoid > using cleartext and is not secure in any sense other than by obscurity. > > Various connectors, such as Share Connector, Web Connector, RSS Connector, > SharePoint Connector, LiveLink Connector, and Meridio Connector make use of > cryptographic principles via secondary libraries. Specifically, these > connectors > support NTLM, NTLMv2, and NTLM2 Session authentication via commons-httpclient > and jCIFS. The changes to commons-httpclient necessary to support these > varieties of Windows protocols have not yet been accepted upstream by the > Apache > httpclient project. > > It is unknown at this time exactly to what degree the Oracle JDBC driver, the > jtds JDBC driver, or the Postgresql JDBC driver uses cryptography. Also, the > FileNet API class, the Memex API classes, the OpenText LAPI api classes, and > the > Documentum DFC classes all may or may not use cryptography. > > Legal Concerns > > Some of the connectors in the existing framework require paid licenses to use. > We will need to evaluate each connector to see what can be appropriately > included. For those connectors that require a paid license, we will need to > determine a plan for including the wrapper code without the underlying > bindings > in a legal manner. We expect we can provide the wrapper code without the > binding > and that the code will thus only be compilable by someone who has access to > the > binding. (This is what Google has done for their individual connectors). > Longer > term, we expect to demonstrate to the companies with proprietary connectors > why > it is more valuable for them to open up their specific connector pieces to > give > broader access to people looking to leverage their content in the repository. > > Trademark > > The project is being rebranded from a MetaCarta internal name to the Lucene > Connector Framework, which will be an ASF mark. > > Relationships with Other Apache Products > > We expect almost all of the Apache Lucene ecosystem will benefit from having a > standard way of connecting to content repositories. Additionally, users of > UIMA > should also benefit. We also see an especially tight connection with Tika, as > much of the content in these types of repositories are "rich" document types > which will then need their content extracted. > > An Excessive Fascination with the Apache Brand > > All of us are familiar with the value that Apache brings to a project in > building out a community. We also are all significant users of Apache Lucene > and > related tools (Solr, Nutch, Mahout, Tika) and expect a close relationship with > those projects will help significantly grow the LCF community. > > Documentation > > MetaCarta has end-user documentation for Lucene Connector Framework, which > might > function as the core the open-source end-user documentation. The documentation > is in LaTeX form, and thus usable sources can readily be extracted. Research > as > to any ownership issues for the documentation as it stands still needs to be > examined. > > The existing java doc of the code, while fairly extensive, needs review and > perhaps augmentation to insure it meets the needs of an ASF project. > Significant > attention to maintaining its accuracy was made during MetaCarta's ownership of > the code base. > > Initial Source > > All initial sources will be coming from MetaCarta, Inc., with the goal of > folding in changes from others shortly thereafter. > > Source and Intellectual Property Submission Plan > > Code IP grants need to be made from MetaCarta, Inc. But, in addition, several > connectors (notably Documentum, LiveLink, Memex, and FileNet) rely directly on > client API's in order to be compiled. Another connector (JDBC) relies on the > existence of the Oracle JDBC Driver in the classpath in order to enable crawls > against Oracle databases. > > It is unlikely that EMC, OpenText, Memex, or IBM would grant > Apache-license-compatible use of these client libraries. Thus, the expectation > is that users of these connectors obtain the necessary client libraries from > the > owners prior to building or using the corresponding connector. An alternative > would be to undertake a clean-room implementation of the client API's, which > may > well yield suitable results in some cases (LiveLink, Memex, FileNet), while > being out of reach in others (Documentum). Conditional compilation, for the > short term, is thus likely to be a necessity. > > Other external dependencies, such as jCIFS for the Share Connector, are > licensed > with LGPL, and thus may need to be treated in a manner similar to the closed > API's even though they are open source. These include the postgresql JDBC > driver, and JTDS. > > The Lucene Connector Framework core and individual connectors are completely > separable, and many of the connectors require no third party licenses. > Therefore, there is significant utility for this project even in the absence > of > any third-party software grants, or clean-room engineering. > > The software grant will be faxed to the Apache Software Foundation if and when > the proposal herein described is accepted. MetaCarta patents are not infringed > by this grant. Also, MetaCarta trademarks are not included in this grant. > > External Dependencies > > The project dependencies, other than on other Apache projects, are as follows: > > The ConnectorFramework core currently uses the Bitmechanic JDBC pool driver, > which is BSD licensed, and the Postgresql JDBC driver, which is also BSD > licensed. > > The LiveLink Connector relies on LAPI, which is privately licensed by > OpenText. > The Documentum Connector relies on DFC, which is privately licensed by EMC. > The > Share Connector relies on jCIFS, which is LGPL. The Memex Connector relies on > privately licensed java libraries from Memex. The FileNet Connector relies on > privately licensed java libraries from IBM. > > Required Resources > > • Mailing lists • connectors-private (with moderated subscriptions) • > connectors-user@ • connectors-dev@ • connectors-commit@ • Subversion > directory • > https://svn.apache.org/repos/asf/incubator/connectors > > • Website • Confluence (CONNECTORS) • Issue Tracking • JIRA (CONNECTORS) > > Initial Committers > > Names of initial committers with affiliation and current ASF status: > > • Karl Wright (kwright at metacarta) • Josiah Strandberg (jstrandberg at > metacarta) • Ken Baker (bakerkj at metacarta) • Marc Meadows (mam at > metacarta) > • Grant Ingersoll ( gsing...@a.o Lucid Imagination, ASF Member) > > • Brian Pinkerton (brian.pinkerton at Lucid Imagination) • Simon Willnauer > (simonw at apache org, Committer on Lucene Java and Lucene > Open Relevance Project) • Ryan McKinley (ryan at apache org, Committer on > Lucene > and Solr) > > • Robert Muir (rmuir at apache org, Committer on Lucene and Open Relevance) • > Sami Siren ( si...@a.o , Committer on Nutch and Tika) > > • Otis Gospodnetic ( o...@a.o , Committer on Lucene, Solr, Nutch, Mahout, and > Open Relevance Project) > > • Shalin Shekhar Mangar ( sha...@a.o , AOL, Committer on Apache Solr) > > • Noble Paul ( no...@a.o , AOL, Committer on Apache Solr) > > • George Aroush (george at aroush.net, Committer on Lucene.Net) > > Sponsors > > Champion > > • Grant Ingersoll > > Nominated Mentors > > • Grant Ingersoll • Jukka Zitting • Gianugo Rabellino > > Sponsoring Entity > > • Apache Lucene PMC: Message ID: af7e...@gmail.com > in priv...@lucene.a.o --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org