RE: [RESULT] [VOTE] Tika - a content analysis toolkit
Reports due: April, May, June, and then quarterly. --- Noel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[RESULT] [VOTE] Tika - a content analysis toolkit
Hi, On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [ ] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) The vote passes with 9 binding +1 and 3 non-binding +1 votes. The binding votes were: +1 Bertrand Delacretaz +1 Brett Porter +1 Davanum Srinivas +1 Doug Cutting +1 J Aaron Farr +1 Jukka Zitting +1 Niclas Hedhman +1 Robert Burrell Donkin +1 Yoav Shapira The non-binding votes were: +1 Jeremias Maerki +1 Marshall Schor +1 Tony Ambrozie Thanks for voting! I'll proceed to request the relevant infrastructure and to include Tika in the Incubator books. BR, Jukka Zitting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
Hi, On 3/19/07, Petar Tahchiev <[EMAIL PROTECTED]> wrote: Although I am not part of the jakarta organisation (so I have no right to vote) Only the votes from Incubator PMC members are binding, but this certainly doesn't mean that others aren't allowed to participate in the vote. In fact it's encouraged for people to cast their non-binding votes (see how many people have voted with a "non-binding" qualifier also in this thread) in whichever Apache votes they have an interest in. Often the opinions and concerns of interested community members are just as or even more important than those of the official decision makers. I think that the proposal is more than interesting, so I am willing to help with whatever I can, once this project is being incubated. :-) Excellent, thanks for the interest! BR, Jukka Zitting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
Hi, On 3/18/07, Jeremias Maerki <[EMAIL PROTECTED]> wrote: I would like to make the Tika people aware that we've recently started a little XMP framework as part of the XML Graphics Project. XMP is used with a number of document formats, with PDF its most prominent format. It could be interesting to work together on this. That's very interesting, thanks for bringing this up! BR, Jukka Zitting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
On 3/19/07, Doug Cutting <[EMAIL PROTECTED]> wrote: Jukka Zitting wrote: > Please vote on the proposal that follows. The vote is open for the > next 72 hours and only votes from the Incubator PMC are binding. > > [ ] +1 Accept Tika as a new podling > [ ] -1 Do not accept the new podling (provide reason, please) > > The proposal can be found at > http://wiki.apache.org/incubator/TikaProposal and is included below > for archival purposes. +1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Although I am not part of the jakarta organisation (so I have no right to vote), I think that the proposal is more than interesting, so I am willing to help with whatever I can, once this project is being incubated. :-) -- Regards, Petar! Karlovo, Bulgaria. Public PGP Key at: http://keyserver.linux.it/pks/lookup?op=get&search=0x1A15B53B761500F9 Key Fingerprint: AA16 8004 AADD 9C76 EF5B 4210 1A15 B53B 7615 00F9
Re: [VOTE] Tika - a content analysis toolkit
Jukka Zitting wrote: Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [ ] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) The proposal can be found at http://wiki.apache.org/incubator/TikaProposal and is included below for archival purposes. +1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: Hi, I would like to call the Incubator PMC to vote to incubate the proposed Tika project. I posted the proposal draft for review a while ago, and the final proposal text is included below. The only changes in the proposal text are the addition of Bertrand Delacretaz as the third mentor and marking Apache Lucene as the sponsor based on a recent Lucene PMC vote. Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [X] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) - robert - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
[+1] Accept Tika as a new podling -Bertrand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
On 18/03/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: [X] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
"Jukka Zitting" <[EMAIL PROTECTED]> writes: > Please vote on the proposal that follows. The vote is open for the > next 72 hours and only votes from the Incubator PMC are binding. [X] +1 Accept Tika as a new podling Good luck! -- jaaron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
+1 (non-binding) - alignment with existing standards (such as Dublin Core, etc) will be important... On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: Hi, I would like to call the Incubator PMC to vote to incubate the proposed Tika project. I posted the proposal draft for review a while ago, and the final proposal text is included below. The only changes in the proposal text are the addition of Bertrand Delacretaz as the third mentor and marking Apache Lucene as the sponsor based on a recent Lucene PMC vote. Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [ ] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) The proposal can be found at http://wiki.apache.org/incubator/TikaProposal and is included below for archival purposes. Here's my +1 BR, Jukka Zitting Tika, a content analysis toolkit Abstract Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Proposal The Tika content analysis toolkit will include features for detecting the content types, character encodings, languages, and other characteristics of existing documents and for extracting structured text content from the documents. The toolkit is targeted especially for search engines and other content indexing and analysis tools, but will be useful also for other applications that need to extract meaningful information from documents that might be presented as nothing else than binary streams. Instead of implementing its own document parsers, Tika will use existing parser libraries like Jakarta POI [1] and PDFBox [2]. Background -- The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis features were seen as value-added components that would benefit also other projects. The idea received positive feedback, but lacked the momentum. The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum and a Google Code project was set up as a staging area for prototype code before deciding how to best handle the setup of a new project. After a few initial commits the activity again declined. In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute the Lius project [3] to Apache Lucene and when Mark Harwood also started looking for a generic toolkit like Tika. This proposal is the result of the above efforts and related discussions both in private and on various public forums. Some alternatives to incubation, like Apache Labs [4] or Jakarta Commons [5], came up during the discussions but we believe that taking the project to the Incubator is the best way to start growing a viable community to sustain the Tika toolkit. Rationale - There is ever more demand for tools that automatically analyze and index documents in various formats. Search engines, content repositories, and other tools often need to extract metadata and text content from documents given as nothing or little else than a simple octet stream. While there are a number of existing parser libraries for various document types, each of them comes with a custom API and there are no generic tools for automatically determining which parser to use for which documents. Currently many projects end up creating their custom content analysis and extraction tools. The Tika project attempts to remove this duplication of efforts. We believe that by pooling the efforts of multiple projects we will be able to create a generic toolkit that exceeds the capabilities and quality of the custom solutions of any single project. A generic toolkit project will also provide common ground for the developers of parser libraries and content applications to interact. Initial Goals - The initial goals of the proposed project are: * Viable community around the Tika codebase * Active relationships and possible cooperation with related projects and communities * Generic parser API for extracting structured text content from various document formats * Flexible metadata detection and extraction API * Java implementations of the metadata standards mentioned below Current Status == Meritocracy --- All the initial committers are familiar with the meritocracy principles of Apache, and have already worked on the various source codebases. We will follow the normal meritocracy rules also with other potential contributors. Community - There is not yet a clear Tika community. Instead we have a number of people and related projects with an un
Re: [VOTE] Tika - a content analysis toolkit
Here's my non-binding +1: [ X ] +1 Accept Tika as a new podling -Marshall Schor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
non-binding +1 from me. On 18.03.2007 10:51:37 Jukka Zitting wrote: > [ ] +1 Accept Tika as a new podling > [ ] -1 Do not accept the new podling (provide reason, please) > Instead of implementing its own document parsers, Tika will use existing > parser libraries like Jakarta POI [1] and PDFBox [2]. I would like to make the Tika people aware that we've recently started a little XMP framework as part of the XML Graphics Project. XMP is used with a number of document formats, with PDF its most prominent format. It could be interesting to work together on this. I've also been in contact with Ben Litchfield, author of PDFBox, about possibly joining forces on the topic. However, not much has happened. At the moment, the XMP code can only cover what is necessary to implement the very basics of the PDF/A-1b specification. But I'm sure it can be easily enhanced to fit a wider audience. I already see the need to take the code a step further in order to cover extension schemas that is mandated by the PDF/A-1 standard. Finally, the code doesn't absolutely have to stay within XML Graphics, I guess, but that's only me speaking. Links: http://xmlgraphics.apache.org/commons/ http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/ Jeremias Maerki (watching with interest) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
On Sunday 18 March 2007 17:51, Jukka Zitting wrote: > [ ] +1 Accept Tika as a new podling > [ ] -1 Do not accept the new podling (provide reason, please) +1 Cheers Niclas Hedhman - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
Hola, On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: [ X ] +1 Accept Tika as a new podling Good luck, Yoav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Tika - a content analysis toolkit
+1 Accept Tika as a new podling On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote: Hi, I would like to call the Incubator PMC to vote to incubate the proposed Tika project. I posted the proposal draft for review a while ago, and the final proposal text is included below. The only changes in the proposal text are the addition of Bertrand Delacretaz as the third mentor and marking Apache Lucene as the sponsor based on a recent Lucene PMC vote. Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [ ] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) The proposal can be found at http://wiki.apache.org/incubator/TikaProposal and is included below for archival purposes. Here's my +1 BR, Jukka Zitting Tika, a content analysis toolkit Abstract Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Proposal The Tika content analysis toolkit will include features for detecting the content types, character encodings, languages, and other characteristics of existing documents and for extracting structured text content from the documents. The toolkit is targeted especially for search engines and other content indexing and analysis tools, but will be useful also for other applications that need to extract meaningful information from documents that might be presented as nothing else than binary streams. Instead of implementing its own document parsers, Tika will use existing parser libraries like Jakarta POI [1] and PDFBox [2]. Background -- The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis features were seen as value-added components that would benefit also other projects. The idea received positive feedback, but lacked the momentum. The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum and a Google Code project was set up as a staging area for prototype code before deciding how to best handle the setup of a new project. After a few initial commits the activity again declined. In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute the Lius project [3] to Apache Lucene and when Mark Harwood also started looking for a generic toolkit like Tika. This proposal is the result of the above efforts and related discussions both in private and on various public forums. Some alternatives to incubation, like Apache Labs [4] or Jakarta Commons [5], came up during the discussions but we believe that taking the project to the Incubator is the best way to start growing a viable community to sustain the Tika toolkit. Rationale - There is ever more demand for tools that automatically analyze and index documents in various formats. Search engines, content repositories, and other tools often need to extract metadata and text content from documents given as nothing or little else than a simple octet stream. While there are a number of existing parser libraries for various document types, each of them comes with a custom API and there are no generic tools for automatically determining which parser to use for which documents. Currently many projects end up creating their custom content analysis and extraction tools. The Tika project attempts to remove this duplication of efforts. We believe that by pooling the efforts of multiple projects we will be able to create a generic toolkit that exceeds the capabilities and quality of the custom solutions of any single project. A generic toolkit project will also provide common ground for the developers of parser libraries and content applications to interact. Initial Goals - The initial goals of the proposed project are: * Viable community around the Tika codebase * Active relationships and possible cooperation with related projects and communities * Generic parser API for extracting structured text content from various document formats * Flexible metadata detection and extraction API * Java implementations of the metadata standards mentioned below Current Status == Meritocracy --- All the initial committers are familiar with the meritocracy principles of Apache, and have already worked on the various source codebases. We will follow the normal meritocracy rules also with other potential contributors. Community - There is not yet a clear Tika community. Instead we have a number of people and related projects with an understanding that a shared toolkit project would best serve everyone's
[VOTE] Tika - a content analysis toolkit
Hi, I would like to call the Incubator PMC to vote to incubate the proposed Tika project. I posted the proposal draft for review a while ago, and the final proposal text is included below. The only changes in the proposal text are the addition of Bertrand Delacretaz as the third mentor and marking Apache Lucene as the sponsor based on a recent Lucene PMC vote. Please vote on the proposal that follows. The vote is open for the next 72 hours and only votes from the Incubator PMC are binding. [ ] +1 Accept Tika as a new podling [ ] -1 Do not accept the new podling (provide reason, please) The proposal can be found at http://wiki.apache.org/incubator/TikaProposal and is included below for archival purposes. Here's my +1 BR, Jukka Zitting Tika, a content analysis toolkit Abstract Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Proposal The Tika content analysis toolkit will include features for detecting the content types, character encodings, languages, and other characteristics of existing documents and for extracting structured text content from the documents. The toolkit is targeted especially for search engines and other content indexing and analysis tools, but will be useful also for other applications that need to extract meaningful information from documents that might be presented as nothing else than binary streams. Instead of implementing its own document parsers, Tika will use existing parser libraries like Jakarta POI [1] and PDFBox [2]. Background -- The initial idea for the Tika project was voiced in April 2006 by Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch parser framework and other content analysis features were seen as value-added components that would benefit also other projects. The idea received positive feedback, but lacked the momentum. The idea was revisited in August 2006 when Jukka Zitting from the Jackrabbit project contacted Nutch for possible cooperation with similar ideas. The original Tika idea gained extra momentum and a Google Code project was set up as a staging area for prototype code before deciding how to best handle the setup of a new project. After a few initial commits the activity again declined. In January 2007 the idea started gaining more momentum when Rida Benjelloun offered to contribute the Lius project [3] to Apache Lucene and when Mark Harwood also started looking for a generic toolkit like Tika. This proposal is the result of the above efforts and related discussions both in private and on various public forums. Some alternatives to incubation, like Apache Labs [4] or Jakarta Commons [5], came up during the discussions but we believe that taking the project to the Incubator is the best way to start growing a viable community to sustain the Tika toolkit. Rationale - There is ever more demand for tools that automatically analyze and index documents in various formats. Search engines, content repositories, and other tools often need to extract metadata and text content from documents given as nothing or little else than a simple octet stream. While there are a number of existing parser libraries for various document types, each of them comes with a custom API and there are no generic tools for automatically determining which parser to use for which documents. Currently many projects end up creating their custom content analysis and extraction tools. The Tika project attempts to remove this duplication of efforts. We believe that by pooling the efforts of multiple projects we will be able to create a generic toolkit that exceeds the capabilities and quality of the custom solutions of any single project. A generic toolkit project will also provide common ground for the developers of parser libraries and content applications to interact. Initial Goals - The initial goals of the proposed project are: * Viable community around the Tika codebase * Active relationships and possible cooperation with related projects and communities * Generic parser API for extracting structured text content from various document formats * Flexible metadata detection and extraction API * Java implementations of the metadata standards mentioned below Current Status == Meritocracy --- All the initial committers are familiar with the meritocracy principles of Apache, and have already worked on the various source codebases. We will follow the normal meritocracy rules also with other potential contributors. Community - There is not yet a clear Tika community. Instead we have a number of people and related projects with an understanding that a shared toolkit project would best serve everyone's interests. The primary goal of the incubating project is to build a self-sustaining community