Re: [Proposal] New Lucene sub-project

2006-04-24 Thread Doug Cutting

Jérôme Charron wrote:

we think it would be a good idea to split Nutch into a new sub-project based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since we
are convinced they could be usefull
in many other projects than Nutch.


This sounds like it could arguably be six new projects.  Perhaps another 
way to approach this is as a build process.  Perhaps nutch, like Lucene 
Java, should start providing more than a single jar file.  Perhaps a 
release (both nightly and numbered) should consist of both a composite 
Nutch tar file and also a suite of sub tar files?


That said, if you're convinced that these components form a coherent, 
independently useful subset of Nutch, and that you have a sustainable 
set of committers who will maintain and regularly release this, then 
please submit a proposal to the Lucene PMC (pmc at lucene.a.o).  The PMC 
can discuss it and, eventually, vote to decide.


Doug


Re: [Proposal] New Lucene sub-project

2006-04-24 Thread Dawid Weiss


I also think it makes sense -- we use language idenfier component in 
Carrot2 and we'd love to just have a single library for this 
functionality. As always, some extra managerial effort is unfortunately 
needed to drive a stand-alone project.


D.

Chris Mattmann wrote:

Hi Otis,


This thread seems to have gotten very little attention.
Jérôme - I'm all for extracting sub-libraries that can really live on its
own and are substantial enough to warrant "their own identity".

Personally, I'm the most interested in Language Identifier plugin becoming
a standalone, Nutch-independent piece.  Doug had suggested we move it to
Lucene's contrib section.  If you think it makes sense to have some of
these things lumped together, that's fine, too.  It looks like Language
Identifier and Charset Detector may go well together.

Is this something you want/will push for and make happen?


Just to add to this, it's something that I would push for whole-heartedly.
In addition to Jerome, I would be happy to dedicate time to this
sub-project, and feel it's quite worthy of being its own Stand-alone
library. 


Just my two cents, thanks!

Cheers,
  Chris



Otis

- Original Message 
From: Jérôme Charron <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Friday, April 7, 2006 4:26:54 AM
Subject: [Proposal] New Lucene sub-project

Hi all,

While chatting with Chris Mattmann, it seems to be evident to us that
there
is a need for a new sub-project within Lucene.

For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.

Since Nutch contains some value added pieces of code that focus on content
analysis,
we think it would be a good idea to split Nutch into a new sub-project
based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since
we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
contributed.
If this proposal is accepted, we have a candidate name for this new
project:
Tika (comes from my son  ;-) )

Any comment is welcome.

Jérôme






RE: [Proposal] New Lucene sub-project

2006-04-24 Thread Chris Mattmann
Hi Otis,

> This thread seems to have gotten very little attention.
> Jérôme - I'm all for extracting sub-libraries that can really live on its
> own and are substantial enough to warrant "their own identity".
> 
> Personally, I'm the most interested in Language Identifier plugin becoming
> a standalone, Nutch-independent piece.  Doug had suggested we move it to
> Lucene's contrib section.  If you think it makes sense to have some of
> these things lumped together, that's fine, too.  It looks like Language
> Identifier and Charset Detector may go well together.
> 
> Is this something you want/will push for and make happen?

Just to add to this, it's something that I would push for whole-heartedly.
In addition to Jerome, I would be happy to dedicate time to this
sub-project, and feel it's quite worthy of being its own Stand-alone
library. 

Just my two cents, thanks!

Cheers,
  Chris


> 
> Otis
> 
> - Original Message 
> From: Jérôme Charron <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Friday, April 7, 2006 4:26:54 AM
> Subject: [Proposal] New Lucene sub-project
> 
> Hi all,
> 
> While chatting with Chris Mattmann, it seems to be evident to us that
> there
> is a need for a new sub-project within Lucene.
> 
> For now, Lucene's sub-projects used in Nutch are :
> 1. Lucene-java - The basis for search technology
> 2. Hadoop - The distributed computing platform
> 3. Nutch - The search engine that relies on Lucene and Hadoop.
> 
> Since Nutch contains some value added pieces of code that focus on content
> analysis,
> we think it would be a good idea to split Nutch into a new sub-project
> based
> on content analysis
> manipulation. The components we have identified are :
> 
> 1. MimeType Repository
> 2. Language Identifier
> 3. Content Signature (MD5Signature / TextProfileSignature / ...)
> (4. Generic Meta Data Infrastructure)
> (5. Charset Detector)
> (6. Parse Plugins Framework)
> 
> The idea is to expose these pieces of codes into a standalone lib, since
> we
> are convinced they could be usefull
> in many other projects than Nutch.
> The benefits will be to have some code more widely used / tested /
> contributed.
> If this proposal is accepted, we have a candidate name for this new
> project:
> Tika (comes from my son  ;-) )
> 
> Any comment is welcome.
> 
> Jérôme
> 




Re: [Proposal] New Lucene sub-project

2006-04-24 Thread ogjunk-nutch
This thread seems to have gotten very little attention.
Jérôme - I'm all for extracting sub-libraries that can really live on its own 
and are substantial enough to warrant "their own identity".

Personally, I'm the most interested in Language Identifier plugin becoming a 
standalone, Nutch-independent piece.  Doug had suggested we move it to Lucene's 
contrib section.  If you think it makes sense to have some of these things 
lumped together, that's fine, too.  It looks like Language Identifier and 
Charset Detector may go well together.

Is this something you want/will push for and make happen?

Otis

- Original Message 
From: Jérôme Charron <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Friday, April 7, 2006 4:26:54 AM
Subject: [Proposal] New Lucene sub-project

Hi all,

While chatting with Chris Mattmann, it seems to be evident to us that there
is a need for a new sub-project within Lucene.

For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.

Since Nutch contains some value added pieces of code that focus on content
analysis,
we think it would be a good idea to split Nutch into a new sub-project based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
contributed.
If this proposal is accepted, we have a candidate name for this new project:
Tika (comes from my son  ;-) )

Any comment is welcome.

Jérôme





Re: [Proposal] New Lucene sub-project

2006-04-10 Thread Jérôme Charron
> I found your idea very interesting. I will be interested to contribute to
> the Parse Plugins Framework. I have developed similar one using Lucene.
> The
> project name is Lius.

Hi Rida,

Yes, I know Lius.
It seems very interesting, and I think it would be very interesting too
if we can merge our efforts  to a common lucene's sub project
(but for the moment, it seems that the tika project  doesn't cause a lot of
interest...?)

If you are interested please let me know.

If nutch-dev are interested to create such a project, you are welcome.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Proposal] New Lucene sub-project

2006-04-07 Thread Rida Benjelloun
Hi Jérôme,

I found your idea very interesting. I will be interested to contribute to
the Parse Plugins Framework. I have developed similar one using Lucene. The
project name is Lius.

If you are interested please let me know.



On 4/7/06, Jérôme Charron <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> While chatting with Chris Mattmann, it seems to be evident to us that
> there
> is a need for a new sub-project within Lucene.
>
> For now, Lucene's sub-projects used in Nutch are :
> 1. Lucene-java - The basis for search technology
> 2. Hadoop - The distributed computing platform
> 3. Nutch - The search engine that relies on Lucene and Hadoop.
>
> Since Nutch contains some value added pieces of code that focus on content
> analysis,
> we think it would be a good idea to split Nutch into a new sub-project
> based
> on content analysis
> manipulation. The components we have identified are :
>
> 1. MimeType Repository
> 2. Language Identifier
> 3. Content Signature (MD5Signature / TextProfileSignature / ...)
> (4. Generic Meta Data Infrastructure)
> (5. Charset Detector)
> (6. Parse Plugins Framework)
>
> The idea is to expose these pieces of codes into a standalone lib, since
> we
> are convinced they could be usefull
> in many other projects than Nutch.
> The benefits will be to have some code more widely used / tested /
> contributed.
> If this proposal is accepted, we have a candidate name for this new
> project:
> Tika (comes from my son  ;-) )
>
> Any comment is welcome.
>
> Jérôme
>
>


[Proposal] New Lucene sub-project

2006-04-07 Thread Jérôme Charron
Hi all,

While chatting with Chris Mattmann, it seems to be evident to us that there
is a need for a new sub-project within Lucene.

For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.

Since Nutch contains some value added pieces of code that focus on content
analysis,
we think it would be a good idea to split Nutch into a new sub-project based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
contributed.
If this proposal is accepted, we have a candidate name for this new project:
Tika (comes from my son  ;-) )

Any comment is welcome.

Jérôme