Re: my experiences - Re: Parsing Word Docs

David Spencer Thu, 06 Mar 2003 10:27:59 -0800

Eric Anderson wrote:

Ok. Thanks for the tip.

I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called,


How? You exec passing the file name and it prints the ascii text to stdout.
This method takes the file name (e.g. "c:/dir1/dir2/foo.doc") and returns
the output from antitext as one big string:

public static String getAntiText( String fn) throws Throwable { Process p = null; InputStream is = null; DataInputStream dis = null; try { p = rt.exec( new String[] { anti, fn}); is = p.getInputStream(); dis = new DataInputStream( is); String line; StringBuffer sb = new StringBuffer(); while ( ( line = dis.readLine()) != null) { //o.println( "READ: " +line); sb.append( line); sb.append( " "); } return sb.toString(); } finally { try { dis.close(); } catch( Throwable t) { } try { is.close(); } catch( Throwable t) { } try { p.waitFor(); } catch( Throwable t) { } try { p.destroy(); } catch( Throwable t) { } } } private static String anti = "c:/antiword/antiword.exe";

and from where it would be called.

From where? If the file is a word doc e.g. name ends with ".doc".

How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox?

Hmmm not sure what the exact issue is but is this the answer:

doc.add( Field.Text( "contents", new StringReader( getAntiText( file_name_of_word_file))));

Your assistance is greatly appreciated.
Eric Anderson
815-505-6132
Quoting David Spencer <[EMAIL PROTECTED]>:
FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of
them
with exceptions being thrown about the formats being invalid.
I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free *.exe, and it worked great ( well it seemed to process all the files fine).

I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/).

Not satisfying to a java developer but these work better than anything else I can find.

You get source and I use them on windows & linux, no prob.

Eric Anderson wrote:

I'm interested in using the textmining/textextraction utilities using Apache

POI, that Ryan was discussing. However, I'm having some difficulty

determining

what the insertion point would be to replace the default parser with the

word
parser.

Any assistance would be appreciated.
LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: my experiences - Re: Parsing Word Docs

Reply via email to