Re: [Proposal] New Lucene sub-project
Jérôme Charron wrote: we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic Meta Data Infrastructure) (5. Charset Detector) (6. Parse Plugins Framework) The idea is to expose these pieces of codes into a standalone lib, since we are convinced they could be usefull in many other projects than Nutch. This sounds like it could arguably be six new projects. Perhaps another way to approach this is as a build process. Perhaps nutch, like Lucene Java, should start providing more than a single jar file. Perhaps a release (both nightly and numbered) should consist of both a composite Nutch tar file and also a suite of sub tar files? That said, if you're convinced that these components form a coherent, independently useful subset of Nutch, and that you have a sustainable set of committers who will maintain and regularly release this, then please submit a proposal to the Lucene PMC (pmc at lucene.a.o). The PMC can discuss it and, eventually, vote to decide. Doug
Re: [Proposal] New Lucene sub-project
I also think it makes sense -- we use language idenfier component in Carrot2 and we'd love to just have a single library for this functionality. As always, some extra managerial effort is unfortunately needed to drive a stand-alone project. D. Chris Mattmann wrote: Hi Otis, This thread seems to have gotten very little attention. Jérôme - I'm all for extracting sub-libraries that can really live on its own and are substantial enough to warrant "their own identity". Personally, I'm the most interested in Language Identifier plugin becoming a standalone, Nutch-independent piece. Doug had suggested we move it to Lucene's contrib section. If you think it makes sense to have some of these things lumped together, that's fine, too. It looks like Language Identifier and Charset Detector may go well together. Is this something you want/will push for and make happen? Just to add to this, it's something that I would push for whole-heartedly. In addition to Jerome, I would be happy to dedicate time to this sub-project, and feel it's quite worthy of being its own Stand-alone library. Just my two cents, thanks! Cheers, Chris Otis - Original Message From: Jérôme Charron <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Friday, April 7, 2006 4:26:54 AM Subject: [Proposal] New Lucene sub-project Hi all, While chatting with Chris Mattmann, it seems to be evident to us that there is a need for a new sub-project within Lucene. For now, Lucene's sub-projects used in Nutch are : 1. Lucene-java - The basis for search technology 2. Hadoop - The distributed computing platform 3. Nutch - The search engine that relies on Lucene and Hadoop. Since Nutch contains some value added pieces of code that focus on content analysis, we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic Meta Data Infrastructure) (5. Charset Detector) (6. Parse Plugins Framework) The idea is to expose these pieces of codes into a standalone lib, since we are convinced they could be usefull in many other projects than Nutch. The benefits will be to have some code more widely used / tested / contributed. If this proposal is accepted, we have a candidate name for this new project: Tika (comes from my son ;-) ) Any comment is welcome. Jérôme
RE: [Proposal] New Lucene sub-project
Hi Otis, > This thread seems to have gotten very little attention. > Jérôme - I'm all for extracting sub-libraries that can really live on its > own and are substantial enough to warrant "their own identity". > > Personally, I'm the most interested in Language Identifier plugin becoming > a standalone, Nutch-independent piece. Doug had suggested we move it to > Lucene's contrib section. If you think it makes sense to have some of > these things lumped together, that's fine, too. It looks like Language > Identifier and Charset Detector may go well together. > > Is this something you want/will push for and make happen? Just to add to this, it's something that I would push for whole-heartedly. In addition to Jerome, I would be happy to dedicate time to this sub-project, and feel it's quite worthy of being its own Stand-alone library. Just my two cents, thanks! Cheers, Chris > > Otis > > - Original Message > From: Jérôme Charron <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Friday, April 7, 2006 4:26:54 AM > Subject: [Proposal] New Lucene sub-project > > Hi all, > > While chatting with Chris Mattmann, it seems to be evident to us that > there > is a need for a new sub-project within Lucene. > > For now, Lucene's sub-projects used in Nutch are : > 1. Lucene-java - The basis for search technology > 2. Hadoop - The distributed computing platform > 3. Nutch - The search engine that relies on Lucene and Hadoop. > > Since Nutch contains some value added pieces of code that focus on content > analysis, > we think it would be a good idea to split Nutch into a new sub-project > based > on content analysis > manipulation. The components we have identified are : > > 1. MimeType Repository > 2. Language Identifier > 3. Content Signature (MD5Signature / TextProfileSignature / ...) > (4. Generic Meta Data Infrastructure) > (5. Charset Detector) > (6. Parse Plugins Framework) > > The idea is to expose these pieces of codes into a standalone lib, since > we > are convinced they could be usefull > in many other projects than Nutch. > The benefits will be to have some code more widely used / tested / > contributed. > If this proposal is accepted, we have a candidate name for this new > project: > Tika (comes from my son ;-) ) > > Any comment is welcome. > > Jérôme >
Re: [Proposal] New Lucene sub-project
This thread seems to have gotten very little attention. Jérôme - I'm all for extracting sub-libraries that can really live on its own and are substantial enough to warrant "their own identity". Personally, I'm the most interested in Language Identifier plugin becoming a standalone, Nutch-independent piece. Doug had suggested we move it to Lucene's contrib section. If you think it makes sense to have some of these things lumped together, that's fine, too. It looks like Language Identifier and Charset Detector may go well together. Is this something you want/will push for and make happen? Otis - Original Message From: Jérôme Charron <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Friday, April 7, 2006 4:26:54 AM Subject: [Proposal] New Lucene sub-project Hi all, While chatting with Chris Mattmann, it seems to be evident to us that there is a need for a new sub-project within Lucene. For now, Lucene's sub-projects used in Nutch are : 1. Lucene-java - The basis for search technology 2. Hadoop - The distributed computing platform 3. Nutch - The search engine that relies on Lucene and Hadoop. Since Nutch contains some value added pieces of code that focus on content analysis, we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic Meta Data Infrastructure) (5. Charset Detector) (6. Parse Plugins Framework) The idea is to expose these pieces of codes into a standalone lib, since we are convinced they could be usefull in many other projects than Nutch. The benefits will be to have some code more widely used / tested / contributed. If this proposal is accepted, we have a candidate name for this new project: Tika (comes from my son ;-) ) Any comment is welcome. Jérôme
Re: [Proposal] New Lucene sub-project
> I found your idea very interesting. I will be interested to contribute to > the Parse Plugins Framework. I have developed similar one using Lucene. > The > project name is Lius. Hi Rida, Yes, I know Lius. It seems very interesting, and I think it would be very interesting too if we can merge our efforts to a common lucene's sub project (but for the moment, it seems that the tika project doesn't cause a lot of interest...?) If you are interested please let me know. If nutch-dev are interested to create such a project, you are welcome. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Proposal] New Lucene sub-project
Hi Jérôme, I found your idea very interesting. I will be interested to contribute to the Parse Plugins Framework. I have developed similar one using Lucene. The project name is Lius. If you are interested please let me know. On 4/7/06, Jérôme Charron <[EMAIL PROTECTED]> wrote: > > Hi all, > > While chatting with Chris Mattmann, it seems to be evident to us that > there > is a need for a new sub-project within Lucene. > > For now, Lucene's sub-projects used in Nutch are : > 1. Lucene-java - The basis for search technology > 2. Hadoop - The distributed computing platform > 3. Nutch - The search engine that relies on Lucene and Hadoop. > > Since Nutch contains some value added pieces of code that focus on content > analysis, > we think it would be a good idea to split Nutch into a new sub-project > based > on content analysis > manipulation. The components we have identified are : > > 1. MimeType Repository > 2. Language Identifier > 3. Content Signature (MD5Signature / TextProfileSignature / ...) > (4. Generic Meta Data Infrastructure) > (5. Charset Detector) > (6. Parse Plugins Framework) > > The idea is to expose these pieces of codes into a standalone lib, since > we > are convinced they could be usefull > in many other projects than Nutch. > The benefits will be to have some code more widely used / tested / > contributed. > If this proposal is accepted, we have a candidate name for this new > project: > Tika (comes from my son ;-) ) > > Any comment is welcome. > > Jérôme > >
[Proposal] New Lucene sub-project
Hi all, While chatting with Chris Mattmann, it seems to be evident to us that there is a need for a new sub-project within Lucene. For now, Lucene's sub-projects used in Nutch are : 1. Lucene-java - The basis for search technology 2. Hadoop - The distributed computing platform 3. Nutch - The search engine that relies on Lucene and Hadoop. Since Nutch contains some value added pieces of code that focus on content analysis, we think it would be a good idea to split Nutch into a new sub-project based on content analysis manipulation. The components we have identified are : 1. MimeType Repository 2. Language Identifier 3. Content Signature (MD5Signature / TextProfileSignature / ...) (4. Generic Meta Data Infrastructure) (5. Charset Detector) (6. Parse Plugins Framework) The idea is to expose these pieces of codes into a standalone lib, since we are convinced they could be usefull in many other projects than Nutch. The benefits will be to have some code more widely used / tested / contributed. If this proposal is accepted, we have a candidate name for this new project: Tika (comes from my son ;-) ) Any comment is welcome. Jérôme