RE: MsWordTextFilter Problem

Lansing, Carina S Wed, 07 Jun 2006 16:21:47 -0700

Hi Thomas,

We encountered the exact same problem.  I did some unit tests, and the
org.textmining.text.extraction.WordExtractor does not work very well.
As you described, it omits whole sections of documents (apparently
triggered by certain formatting fields present in the document).


I noticed in the latest 3.0 alpha1 build of POI (checked out from svn),
that it contains a new WordExtractor class under the scratchpad area:
org.apache.poi.hwpf.extractor.WordExtractor.  This class has an almost
identical API to the org.texmining equivalent.  I did some preliminary
testing, and this new class works much better at text extraction.  All
my Word documents are getting indexed now.  I created my own
MsWordTextFilter using this alternate class, and it is working well, but
I need to do more testing (especially on the other POI-based filters, to
make sure they didn't break from the new POI jarfiles).  Hope this
information is helpful.

Regards,
Carina

-----Original Message-----
From: thomasg [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 16, 2006 2:52 AM
To: [email protected]
Subject: MsWordTextFilter Problem


Has anyone encoutered problems with this text filter. I am testing the
text extraction of quite a large document (6MB worth of Thinking In Java
by
captain Bruce Eckel). Seaching    was not producing expected results. I
have
taken the Reader object generated by the MsWordTextFilter and converted
it into a String and writen it to a file. Inspection shows that most of
the document has been omitted. The missing part is in the middle of the
file and there are no particularly unusal contents that mark the start
of the missing section. I've tested larger docs that work fine so its a
bit of a mystery?

Cheers, Thomas
--
View this message in context:
http://www.nabble.com/MsWordTextFilter-Problem-t1626136.html#a4406009
Sent from the Jackrabbit - Dev forum at Nabble.com.

RE: MsWordTextFilter Problem

Reply via email to