Hi Thomas, We encountered the exact same problem. I did some unit tests, and the org.textmining.text.extraction.WordExtractor does not work very well. As you described, it omits whole sections of documents (apparently triggered by certain formatting fields present in the document).
I noticed in the latest 3.0 alpha1 build of POI (checked out from svn), that it contains a new WordExtractor class under the scratchpad area: org.apache.poi.hwpf.extractor.WordExtractor. This class has an almost identical API to the org.texmining equivalent. I did some preliminary testing, and this new class works much better at text extraction. All my Word documents are getting indexed now. I created my own MsWordTextFilter using this alternate class, and it is working well, but I need to do more testing (especially on the other POI-based filters, to make sure they didn't break from the new POI jarfiles). Hope this information is helpful. Regards, Carina -----Original Message----- From: thomasg [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 2:52 AM To: [email protected] Subject: MsWordTextFilter Problem Has anyone encoutered problems with this text filter. I am testing the text extraction of quite a large document (6MB worth of Thinking In Java by captain Bruce Eckel). Seaching was not producing expected results. I have taken the Reader object generated by the MsWordTextFilter and converted it into a String and writen it to a file. Inspection shows that most of the document has been omitted. The missing part is in the middle of the file and there are no particularly unusal contents that mark the start of the missing section. I've tested larger docs that work fine so its a bit of a mystery? Cheers, Thomas -- View this message in context: http://www.nabble.com/MsWordTextFilter-Problem-t1626136.html#a4406009 Sent from the Jackrabbit - Dev forum at Nabble.com.
