Chris, Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? Thanx, - Dmitry
________________________________ From: Christiaan Fluit [mailto:[EMAIL PROTECTED] Sent: Thu 2/9/2006 4:09 AM To: java-user@lucene.apache.org Subject: Re: Word files & Build vs. Buy? Hello all, I'm replying to two threads at once as what I have to say relates to both. My company recently started an open source project called Aperture (http://sourceforge.net/projects/aperture), together with the German DFKI institute. The project is still very much in alpha stage, but I do believe we already have some code parts that could help people here. Basically, it's a framework for crawling information sources (file systems, mail folders, websites, ...) and extracting as much information from it as possible. Besides full-text extraction, we also put a lot of effort in extraction and modeling of the metadata occurring in these sources and document formats. Both parties have some proprietary code lying on the shelf that is being open sourced and ported to the Aperture architecture. Now on to the raised questions: [EMAIL PROTECTED] wrote: > WordDocument wd = new WordDocument(is); [EMAIL PROTECTED] wrote: > MS Word - I know that POI exists, but development on the Word portion > seems to have stopped, and there are a lot of nasty looking bugs in > their DB. Since we're involved in dealing with contracts, many of our > Word files are large and complicated. How has everyone's experience > with POI's Word parsing been? My experience is that the WordDocument class crashes on about 25% of the documents, i.e. it throws some sort of Exception. I've tested POI 2.5.1-final as well as the current code in CVS, but both produce this result. I even suspect the output to be 100% the same, but I haven't verified this. Another reason I don't like this class is that it operates on an InputStream and internally creates a POIFSFileSystem which you cannot access, so that it becomes hard to extract document metadata as well (for which you need the PFSFS) without buffering the entire InputStream. The same applies to TextMining's WordExtractor, which also operates on top of lower level POI components. I've recently committed a WordExtractor to Aperture that uses its own code operating on these lower level POI datastructures, which works a lot better, failing only 5% of my 300 test docs. I don't pretend to understand all the internals of the POI APIs, but it Works For Me. When POI throws an exception, the WordExtractor will revert to applying a heuristic string extraction algorithm to extract as much human-readable text as possible from the binary stream, which works quite well on MS Office files, i.e. the output is reasonably well for indexing purposes. Be sure to checkout Aperture from CVS as this code isn't part of the alpha 1 release. A next official release is expected in a month. [EMAIL PROTECTED] wrote: > RTF - javax.swing looks fine, we use those classes already. Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly" because in the past I had many issues with it, typically throwing exceptions on 25-50% of my test documents. Recently I haven't seen a single one (using Java 1.5.0), so either I am now feeding it a more optimal document set or the Swing people have worked on the implementation. In that case people using Java 1.4.x may see different results. > Word Perfect - There doesn't seem to be any converters for this format? I'm actively working on this :) We have some proprietary code that will become part of Aperture. Right now I cannot say how well it performs in practice though, although we've never had complaints with our proprietary apps. The code uses a heuristic string extraction algorithm tuned for WordPerfect documents. This may be an issue, e.g. when you also want to display the extraction results to end users. If you're interested: one way you can help me get the most out of it is by sending me some example WordPerfect documents because I hardly have those on my hard drive. Fake documents made with very new or old WordPerfect versions are also most welcome. Regards, Chris http://aduna.biz -- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]