Re: Library for extracting text content from binaries

Jukka Zitting Mon, 24 Jul 2006 11:28:57 -0700

Hi,

Any interest in this? If not, is there some other Lucene project that
I should approach?


BR,

Jukka Zitting

On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote:

Hi,

I'm a committer of the Apache Jackrabbit project, and I've recently
been working on improving the full text indexing support in
Jackrabbit. We've used standard Lucene Java as the embedded full text
search engine in Jackrabbit, but created our own set of parsers for
extracting text content from binary files. So far our parser interface
TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
proposal, TextExtractor, [2] aims for a generic solution that converts
a generic InputStream into a Reader for passing to Lucene Java.

Before coming up with the proposal I tried looking for similar
solutions, but couldn't find any that would have satisfied my
requirement of no external dependencies other than the JRE. Your
o.a.nutch.parse.Parser interface however came quite close, and you
already have an extensive set of existing implementations, so I'd like
to leverage your work with the Parser implementations while finding a
way to avoid the full Nutch and Hadoop dependencies. I believe that
there are a number of other Lucene users who have similar needs.

Thus I'd like to ask if there would be interest in making your Parser
interface and implementations more easily accessible to external
projects, perhaps as a separate library. If  you're interested, I'd be
happy to participate in such an effort.

[1] 
http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup
[2] http://issues.apache.org/jira/browse/JCR-415


BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development

Re: Library for extracting text content from binaries

Reply via email to