Hi Jukka, Thanks for your email. Jerome Charron and I proposed a project with a similar goal in mind that we wanted to dub "Tika". Tika would effectively be a Lucene sub-project, and would factor out some of the capabilities you mention below from Nutch, incl:
1. MimeType repository 2. Parser interface and Parser plugins 3. Metadata infrastructure 4. LanguageIdentifier And a few others. Here is the mailing list thread discussion that we had a few months back: http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200604.mbox/%3cc82 [EMAIL PROTECTED] Jerome and I have been quite busy lately, however, and we haven't had a chance to draft the proposal to send to the Lucene PMC, although Doug (and a few others) told us that if we garner enough support and feel that the project would make a significant contribution as it's own Lucene sub-project, to email the PMC and see what happens. If you're interested in this idea, maybe it would be a good idea to contact Jerome and I off-list, and maybe we could get going on a proposal. Thanks! Cheers, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: Jukka Zitting [mailto:[EMAIL PROTECTED] > Sent: Monday, July 24, 2006 11:29 AM > To: [email protected] > Subject: Re: Library for extracting text content from binaries > > Hi, > > Any interest in this? If not, is there some other Lucene project that > I should approach? > > BR, > > Jukka Zitting > > On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm a committer of the Apache Jackrabbit project, and I've recently > > been working on improving the full text indexing support in > > Jackrabbit. We've used standard Lucene Java as the embedded full text > > search engine in Jackrabbit, but created our own set of parsers for > > extracting text content from binary files. So far our parser interface > > TextFilter [1] has been Jackrabbit-specific, but my recent refactoring > > proposal, TextExtractor, [2] aims for a generic solution that converts > > a generic InputStream into a Reader for passing to Lucene Java. > > > > Before coming up with the proposal I tried looking for similar > > solutions, but couldn't find any that would have satisfied my > > requirement of no external dependencies other than the JRE. Your > > o.a.nutch.parse.Parser interface however came quite close, and you > > already have an extensive set of existing implementations, so I'd like > > to leverage your work with the Parser implementations while finding a > > way to avoid the full Nutch and Hadoop dependencies. I believe that > > there are a number of other Lucene users who have similar needs. > > > > Thus I'd like to ask if there would be interest in making your Parser > > interface and implementations more easily accessible to external > > projects, perhaps as a separate library. If you're interested, I'd be > > happy to participate in such an effort. > > > > [1] > http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org > /apache/jackrabbit/core/query/TextFilter.java?view=markup > > [2] http://issues.apache.org/jira/browse/JCR-415 > > > > > > BR, > > > > Jukka Zitting > > > > -- > > Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] > > Software craftsmanship, JCR consulting, and Java development > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
