RE: [RESULT] [VOTE] Tika - a content analysis toolkit

2007-03-22 Thread Noel J. Bergman
Reports due: April, May, June, and then quarterly.

--- Noel



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[RESULT] [VOTE] Tika - a content analysis toolkit

2007-03-22 Thread Jukka Zitting

Hi,

On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[ ] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)


The vote passes with 9 binding +1 and 3 non-binding +1 votes.

The binding votes were:

   +1 Bertrand Delacretaz
   +1 Brett Porter
   +1 Davanum Srinivas
   +1 Doug Cutting
   +1 J Aaron Farr
   +1 Jukka Zitting
   +1 Niclas Hedhman
   +1 Robert Burrell Donkin
   +1 Yoav Shapira

The non-binding votes were:

   +1 Jeremias Maerki
   +1 Marshall Schor
   +1 Tony Ambrozie

Thanks for voting! I'll proceed to request the relevant infrastructure
and to include Tika in the Incubator books.

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-19 Thread Jukka Zitting

Hi,

On 3/19/07, Petar Tahchiev <[EMAIL PROTECTED]> wrote:

Although I am not part of the jakarta organisation (so I have no right to
vote)


Only the votes from Incubator PMC members are binding, but this
certainly doesn't mean that others aren't allowed to participate in
the vote. In fact it's encouraged for people to cast their non-binding
votes (see how many people have voted with a "non-binding" qualifier
also in this thread) in whichever Apache votes they have an interest
in. Often the opinions and concerns of interested community members
are just as or even more important than those of the official decision
makers.


I think that the proposal is more than interesting, so I am willing to help
with whatever I can, once this project is being incubated. :-)


Excellent, thanks for the interest!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-19 Thread Jukka Zitting

Hi,

On 3/18/07, Jeremias Maerki <[EMAIL PROTECTED]> wrote:

I would like to make the Tika people aware that we've recently started a
little XMP framework as part of the XML Graphics Project. XMP is used
with a number of document formats, with PDF its most prominent format.
It could be interesting to work together on this.


That's very interesting, thanks for bringing this up!

BR,

Jukka Zitting

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-19 Thread Petar Tahchiev

On 3/19/07, Doug Cutting <[EMAIL PROTECTED]> wrote:


Jukka Zitting wrote:
> Please vote on the proposal that follows. The vote is open for the
> next 72 hours and only votes from the Incubator PMC are binding.
>
> [ ] +1 Accept Tika as a new podling
> [ ] -1 Do not accept the new podling (provide reason, please)
>
> The proposal can be found at
> http://wiki.apache.org/incubator/TikaProposal and is included below
> for archival purposes.

+1

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Although I am not part of the jakarta organisation (so I have no right to
vote), I think that the proposal
is more than interesting, so I am willing to help with whatever I can, once
this project is
being incubated. :-)

--
Regards, Petar!
Karlovo, Bulgaria.

Public PGP Key at:
http://keyserver.linux.it/pks/lookup?op=get&search=0x1A15B53B761500F9
Key Fingerprint: AA16 8004 AADD 9C76 EF5B  4210 1A15 B53B 7615 00F9


Re: [VOTE] Tika - a content analysis toolkit

2007-03-19 Thread Doug Cutting

Jukka Zitting wrote:

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[ ] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)

The proposal can be found at
http://wiki.apache.org/incubator/TikaProposal and is included below
for archival purposes.


+1

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-19 Thread robert burrell donkin

On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

Hi,

I would like to call the Incubator PMC to vote to incubate the
proposed Tika project. I posted the proposal draft for review a while
ago, and the final proposal text is included below. The only changes
in the proposal text are the addition of Bertrand Delacretaz as the
third mentor and marking Apache Lucene as the sponsor based on a
recent Lucene PMC vote.

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[X] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)


- robert

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Bertrand Delacretaz

[+1]  Accept Tika as a new podling


-Bertrand

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Brett Porter

On 18/03/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

[X] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread J Aaron Farr
"Jukka Zitting" <[EMAIL PROTECTED]> writes:

> Please vote on the proposal that follows. The vote is open for the
> next 72 hours and only votes from the Incubator PMC are binding.

[X] +1 Accept Tika as a new podling

Good luck!

-- 
  jaaron

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Tony Ambrozie

+1 (non-binding) - alignment with existing standards (such as Dublin Core,
etc) will be important...

On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:


Hi,

I would like to call the Incubator PMC to vote to incubate the
proposed Tika project. I posted the proposal draft for review a while
ago, and the final proposal text is included below. The only changes
in the proposal text are the addition of Bertrand Delacretaz as the
third mentor and marking Apache Lucene as the sponsor based on a
recent Lucene PMC vote.

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[ ] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)

The proposal can be found at
http://wiki.apache.org/incubator/TikaProposal and is included below
for archival purposes.

Here's my +1

BR,

Jukka Zitting



Tika, a content analysis toolkit


Abstract


Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics
of existing documents and for extracting structured text content from
the documents.

The toolkit is targeted especially for search engines and other content
indexing and analysis tools, but will be useful also for other
applications
that need to extract meaningful information from documents that might
be presented as nothing else than binary streams.

Instead of implementing its own document parsers, Tika will use existing
parser libraries like Jakarta POI [1] and PDFBox [2].

Background
--

The initial idea for the Tika project was voiced in April 2006 by
Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
parser framework and other content analysis features were seen as
value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.

The idea was revisited in August 2006 when Jukka Zitting from the
Jackrabbit project contacted Nutch for possible cooperation with similar
ideas. The original Tika idea gained extra momentum and a Google Code
project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial
commits the activity again declined.

In January 2007 the idea started gaining more momentum when Rida
Benjelloun
offered to contribute the Lius project [3] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.

This proposal is the result of the above efforts and related discussions
both in private and on various public forums. Some alternatives to
incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
the discussions but we believe that taking the project to the Incubator
is the best way to start growing a viable community to sustain the Tika
toolkit.

Rationale
-

There is ever more demand for tools that automatically analyze and index
documents in various formats. Search engines, content repositories, and
other tools often need to extract metadata and text content from documents
given as nothing or little else than a simple octet stream. While there
are a number of existing parser libraries for various document types,
each of them comes with a custom API and there are no generic tools for
automatically determining which parser to use for which documents.
Currently many projects end up creating their custom content analysis
and extraction tools.

The Tika project attempts to remove this duplication of efforts. We
believe that by pooling the efforts of multiple projects we will be able
to create a generic toolkit that exceeds the capabilities and quality of
the custom solutions of any single project. A generic toolkit project
will also provide common ground for the developers of parser libraries
and content applications to interact.

Initial Goals
-

The initial goals of the proposed project are:

   * Viable community around the Tika codebase

   * Active relationships and possible cooperation with related
 projects and communities

   * Generic parser API for extracting structured text content from
 various document formats

   * Flexible metadata detection and extraction API

   * Java implementations of the metadata standards mentioned below


Current Status
==

Meritocracy
---

All the initial committers are familiar with the meritocracy principles
of Apache, and have already worked on the various source codebases. We
will
follow the normal meritocracy rules also with other potential
contributors.

Community
-

There is not yet a clear Tika community. Instead we have a number of
people
and related projects with an un

Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Marshall Schor

Here's my non-binding +1:

[ X ] +1 Accept Tika as a new podling

-Marshall Schor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Jeremias Maerki
non-binding +1 from me.

On 18.03.2007 10:51:37 Jukka Zitting wrote:

> [ ] +1 Accept Tika as a new podling
> [ ] -1 Do not accept the new podling (provide reason, please)

> Instead of implementing its own document parsers, Tika will use existing
> parser libraries like Jakarta POI [1] and PDFBox [2].

I would like to make the Tika people aware that we've recently started a
little XMP framework as part of the XML Graphics Project. XMP is used
with a number of document formats, with PDF its most prominent format.
It could be interesting to work together on this. I've also been in
contact with Ben Litchfield, author of PDFBox, about possibly joining
forces on the topic. However, not much has happened. At the moment, the
XMP code can only cover what is necessary to implement the very basics
of the PDF/A-1b specification. But I'm sure it can be easily enhanced to
fit a wider audience. I already see the need to take the code a step
further in order to cover extension schemas that is mandated by the
PDF/A-1 standard. Finally, the code doesn't absolutely have to stay
within XML Graphics, I guess, but that's only me speaking.

Links:
http://xmlgraphics.apache.org/commons/
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/



Jeremias Maerki (watching with interest)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Niclas Hedhman
On Sunday 18 March 2007 17:51, Jukka Zitting wrote:

> [ ] +1 Accept Tika as a new podling
> [ ] -1 Do not accept the new podling (provide reason, please)

+1

Cheers
Niclas Hedhman

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Yoav Shapira

Hola,

On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

[ X ] +1 Accept Tika as a new podling


Good luck,

Yoav

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Davanum Srinivas

+1 Accept Tika as a new podling

On 3/18/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:

Hi,

I would like to call the Incubator PMC to vote to incubate the
proposed Tika project. I posted the proposal draft for review a while
ago, and the final proposal text is included below. The only changes
in the proposal text are the addition of Bertrand Delacretaz as the
third mentor and marking Apache Lucene as the sponsor based on a
recent Lucene PMC vote.

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[ ] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)

The proposal can be found at
http://wiki.apache.org/incubator/TikaProposal and is included below
for archival purposes.

Here's my +1

BR,

Jukka Zitting



Tika, a content analysis toolkit


Abstract


Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other characteristics
of existing documents and for extracting structured text content from
the documents.

The toolkit is targeted especially for search engines and other content
indexing and analysis tools, but will be useful also for other applications
that need to extract meaningful information from documents that might
be presented as nothing else than binary streams.

Instead of implementing its own document parsers, Tika will use existing
parser libraries like Jakarta POI [1] and PDFBox [2].

Background
--

The initial idea for the Tika project was voiced in April 2006 by
Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
parser framework and other content analysis features were seen as
value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.

The idea was revisited in August 2006 when Jukka Zitting from the
Jackrabbit project contacted Nutch for possible cooperation with similar
ideas. The original Tika idea gained extra momentum and a Google Code
project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial
commits the activity again declined.

In January 2007 the idea started gaining more momentum when Rida Benjelloun
offered to contribute the Lius project [3] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.

This proposal is the result of the above efforts and related discussions
both in private and on various public forums. Some alternatives to
incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
the discussions but we believe that taking the project to the Incubator
is the best way to start growing a viable community to sustain the Tika
toolkit.

Rationale
-

There is ever more demand for tools that automatically analyze and index
documents in various formats. Search engines, content repositories, and
other tools often need to extract metadata and text content from documents
given as nothing or little else than a simple octet stream. While there
are a number of existing parser libraries for various document types,
each of them comes with a custom API and there are no generic tools for
automatically determining which parser to use for which documents.
Currently many projects end up creating their custom content analysis
and extraction tools.

The Tika project attempts to remove this duplication of efforts. We
believe that by pooling the efforts of multiple projects we will be able
to create a generic toolkit that exceeds the capabilities and quality of
the custom solutions of any single project. A generic toolkit project
will also provide common ground for the developers of parser libraries
and content applications to interact.

Initial Goals
-

The initial goals of the proposed project are:

   * Viable community around the Tika codebase

   * Active relationships and possible cooperation with related
 projects and communities

   * Generic parser API for extracting structured text content from
 various document formats

   * Flexible metadata detection and extraction API

   * Java implementations of the metadata standards mentioned below


Current Status
==

Meritocracy
---

All the initial committers are familiar with the meritocracy principles
of Apache, and have already worked on the various source codebases. We will
follow the normal meritocracy rules also with other potential contributors.

Community
-

There is not yet a clear Tika community. Instead we have a number of people
and related projects with an understanding that a shared toolkit project
would best serve everyone's 

[VOTE] Tika - a content analysis toolkit

2007-03-18 Thread Jukka Zitting

Hi,

I would like to call the Incubator PMC to vote to incubate the
proposed Tika project. I posted the proposal draft for review a while
ago, and the final proposal text is included below. The only changes
in the proposal text are the addition of Bertrand Delacretaz as the
third mentor and marking Apache Lucene as the sponsor based on a
recent Lucene PMC vote.

Please vote on the proposal that follows. The vote is open for the
next 72 hours and only votes from the Incubator PMC are binding.

[ ] +1 Accept Tika as a new podling
[ ] -1 Do not accept the new podling (provide reason, please)

The proposal can be found at
http://wiki.apache.org/incubator/TikaProposal and is included below
for archival purposes.

Here's my +1

BR,

Jukka Zitting



Tika, a content analysis toolkit


Abstract


Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries.

Proposal


The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other characteristics
of existing documents and for extracting structured text content from
the documents.

The toolkit is targeted especially for search engines and other content
indexing and analysis tools, but will be useful also for other applications
that need to extract meaningful information from documents that might
be presented as nothing else than binary streams.

Instead of implementing its own document parsers, Tika will use existing
parser libraries like Jakarta POI [1] and PDFBox [2].

Background
--

The initial idea for the Tika project was voiced in April 2006 by
Jérôme Charron and Chris A. Mattman on the Nutch mailing list. The Nutch
parser framework and other content analysis features were seen as
value-added components that would benefit also other projects. The idea
received positive feedback, but lacked the momentum.

The idea was revisited in August 2006 when Jukka Zitting from the
Jackrabbit project contacted Nutch for possible cooperation with similar
ideas. The original Tika idea gained extra momentum and a Google Code
project was set up as a staging area for prototype code before deciding
how to best handle the setup of a new project. After a few initial
commits the activity again declined.

In January 2007 the idea started gaining more momentum when Rida Benjelloun
offered to contribute the Lius project [3] to Apache Lucene and when Mark
Harwood also started looking for a generic toolkit like Tika.

This proposal is the result of the above efforts and related discussions
both in private and on various public forums. Some alternatives to
incubation, like Apache Labs [4] or Jakarta Commons [5], came up during
the discussions but we believe that taking the project to the Incubator
is the best way to start growing a viable community to sustain the Tika
toolkit.

Rationale
-

There is ever more demand for tools that automatically analyze and index
documents in various formats. Search engines, content repositories, and
other tools often need to extract metadata and text content from documents
given as nothing or little else than a simple octet stream. While there
are a number of existing parser libraries for various document types,
each of them comes with a custom API and there are no generic tools for
automatically determining which parser to use for which documents.
Currently many projects end up creating their custom content analysis
and extraction tools.

The Tika project attempts to remove this duplication of efforts. We
believe that by pooling the efforts of multiple projects we will be able
to create a generic toolkit that exceeds the capabilities and quality of
the custom solutions of any single project. A generic toolkit project
will also provide common ground for the developers of parser libraries
and content applications to interact.

Initial Goals
-

The initial goals of the proposed project are:

  * Viable community around the Tika codebase

  * Active relationships and possible cooperation with related
projects and communities

  * Generic parser API for extracting structured text content from
various document formats

  * Flexible metadata detection and extraction API

  * Java implementations of the metadata standards mentioned below


Current Status
==

Meritocracy
---

All the initial committers are familiar with the meritocracy principles
of Apache, and have already worked on the various source codebases. We will
follow the normal meritocracy rules also with other potential contributors.

Community
-

There is not yet a clear Tika community. Instead we have a number of people
and related projects with an understanding that a shared toolkit project
would best serve everyone's interests. The primary goal of the incubating
project is to build a self-sustaining community