DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=29842>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=29842 A lucene-based text indexer and extractors for popular binary formats Summary: A lucene-based text indexer and extractors for popular binary formats Product: Slide Version: Nightly Platform: Other OS/Version: Other Status: NEW Severity: Enhancement Priority: Other Component: Search AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] This is a text indexer and a set of text extractors for popular binary file formats. When a document is created or updated the indexer uses the ExtractorManager to obtain a list of extractors for a given NodeDescriptor. The indexer extracts text from the document and uses Lucene to index the text for optimized searching. DASL Searches that use the contains clause are handled by TextContainsExpression and TextContainsExpressionFactory. There are four extractors included for extracting text from the four most popular binary file formats. With the exception of PowerPoint, I used available libraries (MIT/BSD) to handle the actual extraction. I used the textmining library, a POI wrapper, to extract text from word(POI's Word library doesn't strip the formatting tags). I used the PDFBox library to extract text from PDF files. I used the high level excel library in POI to extract text from excel, and I used POI's low level OLE library to extract text from PowerPoint. I'm going to attach the jar's that are not already included with slide. I'm also attaching the file log4j.jar. This is needed by PDFBox. I don't understand why the log4j jar included with Slide doesn't work. I just put both in my WAR and it worked. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
