Hi Jukka,

  Thanks for your email. Jerome Charron and I proposed a project with a
similar goal in mind that we wanted to dub "Tika". Tika would effectively be
a Lucene sub-project, and would factor out some of the capabilities you
mention below from Nutch, incl:

1. MimeType repository
2. Parser interface and Parser plugins
3. Metadata infrastructure
4. LanguageIdentifier

And a few others. Here is the mailing list thread discussion that we had a
few months back:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200604.mbox/%3cc82
[EMAIL PROTECTED]

Jerome and I have been quite busy lately, however, and we haven't had a
chance to draft the proposal to send to the Lucene PMC, although Doug (and a
few others) told us that if we garner enough support and feel that the
project would make a significant contribution as it's own Lucene
sub-project, to email the PMC and see what happens. If you're interested in
this idea, maybe it would be a good idea to contact Jerome and I off-list,
and maybe we could get going on a proposal.

Thanks!

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Jukka Zitting [mailto:[EMAIL PROTECTED]
> Sent: Monday, July 24, 2006 11:29 AM
> To: [email protected]
> Subject: Re: Library for extracting text content from binaries
> 
> Hi,
> 
> Any interest in this? If not, is there some other Lucene project that
> I should approach?
> 
> BR,
> 
> Jukka Zitting
> 
> On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I'm a committer of the Apache Jackrabbit project, and I've recently
> > been working on improving the full text indexing support in
> > Jackrabbit. We've used standard Lucene Java as the embedded full text
> > search engine in Jackrabbit, but created our own set of parsers for
> > extracting text content from binary files. So far our parser interface
> > TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> > proposal, TextExtractor, [2] aims for a generic solution that converts
> > a generic InputStream into a Reader for passing to Lucene Java.
> >
> > Before coming up with the proposal I tried looking for similar
> > solutions, but couldn't find any that would have satisfied my
> > requirement of no external dependencies other than the JRE. Your
> > o.a.nutch.parse.Parser interface however came quite close, and you
> > already have an extensive set of existing implementations, so I'd like
> > to leverage your work with the Parser implementations while finding a
> > way to avoid the full Nutch and Hadoop dependencies. I believe that
> > there are a number of other Lucene users who have similar needs.
> >
> > Thus I'd like to ask if there would be interest in making your Parser
> > interface and implementations more easily accessible to external
> > projects, perhaps as a separate library. If  you're interested, I'd be
> > happy to participate in such an effort.
> >
> > [1]
> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org
> /apache/jackrabbit/core/query/TextFilter.java?view=markup
> > [2] http://issues.apache.org/jira/browse/JCR-415
> >
> >
> > BR,
> >
> > Jukka Zitting
> >
> > --
> > Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
> > Software craftsmanship, JCR consulting, and Java development
> >


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to