DO NOT REPLY [Bug 29842] New: - A lucene-based text indexer and extractors for popular binary formats

bugzilla Mon, 28 Jun 2004 06:57:36 -0700

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=29842>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=29842

A lucene-based text indexer and extractors for popular binary formats

           Summary: A lucene-based text indexer and extractors for popular
                    binary formats
           Product: Slide
           Version: Nightly
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Enhancement
          Priority: Other
         Component: Search
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


This is a text indexer and a set of text extractors for popular binary file 
formats.

When a document is created or updated the indexer uses the ExtractorManager to 
obtain a list of extractors for a given NodeDescriptor.  The indexer extracts 
text from the document and uses Lucene to index the text for optimized 
searching.

DASL Searches that use the contains clause are handled by 
TextContainsExpression and TextContainsExpressionFactory.

There are four extractors included for extracting text from the four most 
popular binary file formats.  With the exception of PowerPoint, I used 
available libraries (MIT/BSD) to handle the actual extraction.  I used the 
textmining library, a POI wrapper, to extract text from word(POI's Word 
library doesn't strip the formatting tags).  I used the PDFBox library to 
extract text from PDF files.  I used the high level excel library in POI to 
extract text from excel, and I used POI's low level OLE library to extract 
text from PowerPoint.

I'm going to attach the jar's that are not already included with slide.  I'm 
also attaching the file log4j.jar.  This is needed by PDFBox.  I don't 
understand why the log4j jar included with Slide doesn't work.  I just put 
both in my WAR and it worked.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 29842] New: - A lucene-based text indexer and extractors for popular binary formats

Reply via email to