Re: Need advice: what Word/Excel/PowerPoint lib to use?
Their API is amazing. However, you run into the same problems that you do when you automate MS Office using VBA. Which is instability and everything is single-threaded. Your are basically automating a gui application. -Ryan - Original Message - From: Genty Jean-Paul [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, October 25, 2004 1:17 PM Subject: Re: Need advice: what Word/Excel/PowerPoint lib to use? At 19:42 25/10/2004, you wrote: At 17:05 25/10/2004, you wrote: of course POI, for open source. There are some commercial products based on POI also. for WORD consider textmining.org for XLS, POI does anything you need for powerpoint there is one commercial (it's about 1000$), but you can also find some source code in archives. And what do you think about using Open Office's UNO APIs ? I didn't knew about them. Are they implemented in Java? Yes Check out http://api.openoffice.org/ , They have good examples, I can also provide you my small test. You can do some amazing things with their API. Do they support all MSOffice formats (97/2000/XP)? Check http://www.openoffice.org/product/docs/OOoFlyer11s.pdf Jean-Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Textmining.org IS NOT POI (was Re: worddoucments search)
Go to http://www.textmining.org for a platform independent library to extract text from Word documents. I wrote 99.99% of the Word component of POI and all of the textmining.org library. I have seen several discussions and web pages that point to textmining.org that say I simply wrap POI classes (For example, the JGuru GAQ http://www.jguru.com ) This is totally false. * The textmining.org library is optimized for extracting text. POI is not. * The textmining.org libraries supports extracting text from Word 6/95. POI does not. * The textmining.org libraries do not extract deleted text that is still in the document for the purposes of revision marking. POI does not handle this. -Ryan Ackley - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 7:31 AM Subject: Re: worddoucments search please look at Apache POI project. http://jakarta.apache.org Words documents can be extracted using POI apis and later can be indexed. regards - Original Message - From: Santosh [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 6:00 PM Subject: worddoucments search Can lucene be able to search word documents? if so please give me information about it regards Santosh kumar ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: worddoucments search
Otis, Why didn't you use the textmining.org library? You even asked me to fix a bug for the book , which I did. Also, the code would have been about three lines. -Ryan - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 7:41 AM Subject: Re: worddoucments search For Lucene in Action Erik and I wrote a little extensible framework for indexing various documents, including MS Word. We used POI, so the solution works on Winblows, UNIX/Linux, OSX I think the code is bit too big for the list, but the book will be out soon. Erik and I are going through copy and tech editing right now. POI: http://jakarta.apache.org/poi . Otis --- Don Vaillancourt [EMAIL PROTECTED] wrote: I could ber wrong, but I don't think that there is an indexer for word documents. There's a Python version of Lucene called Lupy with a Python indexer for all sorts of document types (http://www.methods.co.nz/docindexer/). Would anyone be willing to port those over. Although the MSWord indexer only words on MSWindows and you may need MSWord for it to work. Man, that's no good. I think that we'd need to ask the OpenOffice people for help on this. Santosh wrote: Can lucene be able to search word documents? if so please give me information about it regards Santosh kumar ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. -- *Don Vaillancourt Director of Software Development * *WEB IMPACT INC.* phone: 416-815-2000 ext. 245 fax: 416-815-2001 email: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] web: http://www.web-impact.com / This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. / - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: worddoucments search
Code example for textmining.org library: FileInputStream in = new FileInputStream (test.doc); WordExtractor extractor = new WordExtractor(); String str = extractor.extractText(); - Original Message - From: Natarajan.T [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 8:11 AM Subject: RE: worddoucments search Hi Santhosh, Try out the below attached code.(POI.jar should be in your class path) public String getContent(InputStream reader) throws IOException { ArrayList text = new ArrayList(); POIFSFileSystem fsys = new POIFSFileSystem(reader); DocumentEntry headerProps = (DocumentEntry)fsys.getRoot().getEntry(WordDocument); DocumentInputStream din = fsys.createDocumentInputStream(WordDocument); byte[] header = new byte[headerProps.getSize()]; din.read(header); din.close(); //Get the information we need from the header int info = LittleEndian.getShort(header, 0xa); boolean useTable1 = (info 0x200) != 0; //get the location of the piece table int complexOffset = LittleEndian.getInt(header, 0x1a2); String tableName = null; if (useTable1) { tableName = 1Table; } else{ tableName = 0Table; } DocumentEntry table = (DocumentEntry)fsys.getRoot().getEntry(tableName); byte[] tableStream = new byte[table.getSize()]; din = fsys.createDocumentInputStream(tableName); din.read(tableStream); din.close(); din = null; fsys = null; table = null; headerProps = null; int multiple = findText(tableStream, complexOffset, text); StringBuffer sb = new StringBuffer(); int size = text.size(); tableStream = null; WordTextPiece nextPiece = null; int start ; int length; String toStr = ; for (int x = 0; x size; x++) { nextPiece = (WordTextPiece)text.get(x); start = nextPiece.getStart(); length = nextPiece.getLength(); boolean unicode = nextPiece.usesUnicode(); if (unicode) { toStr = new String(header, start, length * multiple, UTF-16LE); } else{ toStr = new String(header, start, length , ISO-8859-1); } } reader.close(); return toStr; } Regards, Natarajan. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 5:46 PM To: Lucene Users List Subject: worddoucments search Can lucene be able to search word documents? if so please give me information about it regards Santosh kumar ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index MSOffice Documents
Thanks Sergiu, You should also post to the Lucene Users list. -Ryan - Original Message - From: Sergiu Gordea [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: POI Users List [EMAIL PROTECTED] Sent: Friday, June 25, 2004 8:42 AM Subject: Index MSOffice Documents Hi all, I'm working on a project in which we are building a knowledge management platform. We are using Turbine/Velocity as framework and we are using lucene for search. We want to make the search to be able to index MSOffice Documents, therefore I was searching for some possibilities to extract the text from this documents. I found some examples based on POI library (http://jakarta.apache.org/poi) and I addapted them to our needs. The extraction of the text elements from XLS file I think is trustable (the from POI development comunity did a great job with the package that work with XSL files). The examples that extract the text from DOC and PPT files are not very general, I think they have problems with the documents written with special charsets but they are working just well on the documents I use. I hope someone that has more experience that I have will improve this and will a better source code. Congratulations to all people involved in development of the Jakarta project and it's subprojects, Sergiu Gordea Ps: ExeConverteImpl uses an external stand alone application (like antiwort or pdf2txt) to extract the text. /* @(#) CWK 1.4 07.06.2004 * * Copyright 2003-2005 ConfigWorks Informationssysteme Consulting GmbH * Universitätsstr. 94/7 9020 Klagenfurt Austria * www.configworks.com * All rights reserved. */ package com.configworks.cwk.be.search.converters; import java.io.BufferedWriter; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.poi.hssf.usermodel.HSSFCell; import org.apache.poi.hssf.usermodel.HSSFRow; import org.apache.poi.hssf.usermodel.HSSFSheet; import org.apache.poi.hssf.usermodel.HSSFWorkbook; /** * Class description * * @author sergiu * @version 1.0 * @since CWK 1.5 */ public class XLSConverterImpl extends JavaDocumentConverter { private Log logger = null; File dest = null; public boolean extractText(InputStream reader, BufferedWriter writer) throws FileNotFoundException, IOException { HSSFWorkbook workbook = new HSSFWorkbook(reader); for (int k = 0; k workbook.getNumberOfSheets(); k++) { HSSFSheet sheet = workbook.getSheetAt(k); if (sheet != null) { int rows = sheet.getLastRowNum(); //I don't know why the last row = sheet.getRow(rows) and first row = sheet.getRow(0) for (int r = 0; r = rows; r++) { HSSFRow row = sheet.getRow(r); if (row != null) { int cells = row.getLastCellNum(); for (int c = 0; c = cells; c++) { HSSFCell cell = row.getCell((short) c); String value = null; if (cell != null) { switch (cell.getCellType()) { case HSSFCell.CELL_TYPE_FORMULA: value = cell.getCellFormula(); break; case HSSFCell.CELL_TYPE_STRING: value = cell.getStringCellValue(); break; case HSSFCell.CELL_TYPE_NUMERIC: value = + cell.getNumericCellValue(); break; default: value = cell.getStringCellValue(); } } if (value != null) { writer.write(value + ); } }//cels } }//rows } }//sheets //if no Exception was thrown consider that the conversion was successful return true; } /** * @return Returns the logger. */ public Log getLogger() { if (logger == null) logger = LogFactory.getLog(XLSConverterImpl.class); return logger; } } package com.configworks.cwk.be.search.converters; import com.configworks.cwk.share.Utils; import java.io.BufferedInputStream; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import org.apache.commons.logging.Log; import
Fw: PowerPoint to Text
I haven't tested this out but I thought this would be of interest to Lucene users. I may eventually add this to the textmining.org libraries. -Ryan - Original Message - From: Koundinya (Sudhakar Chavali) [EMAIL PROTECTED] To: POI Users List [EMAIL PROTECTED]; Ryan Ackley [EMAIL PROTECTED] Sent: Friday, March 26, 2004 12:32 AM Subject: PowerPoint to Text Hi All, We have done initail ground work for extracting PowerPoint 2 text. We would like to say thanks to POI group. Though the base work is rough, we are able to extract the text from PowerPoint. Sorry for bad programming. But hope this wll be helpful to make the good program from this scrath by the efficient developers. Here is the sample. When ever there are modifictaions, we will post the information. import java.io.*; import java.util.*; import org.apache.poi.hpsf.*; import org.apache.poi.poifs.eventfilesystem.*; import org.apache.poi.util.HexDump; import org.apache.poi.util.LittleEndian; public class PPT2Text { public static void main(String[] args) throws IOException { final String filename = args[0]; POIFSReader r = new POIFSReader(); /* Register a listener for *all* documents. */ r.registerListener(new MyPOIFSReaderListener()); r.read(new FileInputStream(filename)); } static class MyPOIFSReaderListener implements POIFSReaderListener { static int filename=1; public void processPOIFSReaderEvent(POIFSReaderEvent event) { PropertySet ps = null; try { org.apache.poi.poifs.filesystem.DocumentInputStream dis=null; System.out.println(\n\n); System.out.println(event.getPath()+event.getName()); dis=event.getStream(); /* byte btoWrite[]= new byte[12]; dis.read(btoWrite); System.out.println(Version :+LittleEndian.getUnsignedByte(btoWrite,0)); System.out.println(Instance :+LittleEndian.getUShort(btoWrite,0)); System.out.println(Type :+LittleEndian.getUShort(btoWrite,2)); System.out.println(Len :+LittleEndian.getLong(btoWrite,4)); */ FileOutputStream fos= new FileOutputStream(+filename+.txt); byte btoWrite[]= new byte[dis.available()]; dis.read(btoWrite,0,dis.available()); for(int i=0;ibtoWrite.length-20;i++) { //System.out.println(Version :+LittleEndian.getUnsignedByte(btoWrite,i+0)); //System.out.println(Instance :+LittleEndian.getUShort(btoWrite,i+0)); //System.out.println(Type :+LittleEndian.getUShort(btoWrite,i+2)); //System.out.println(Len :+LittleEndian.getUInt(btoWrite,i+4)); long type=LittleEndian.getUShort(btoWrite,i+2); long size=LittleEndian.getUInt(btoWrite,i+4); if (type==4008) { fos.write(btoWrite,i+4+1,(int)size+3); } } filename++; //System.out.println(event.getStream().toString()); //ps = PropertySetFactory.create(event.getStream()); } catch (Exception ex) { //System.out.println(No property set stream: \ + event.getPath() + // event.getName() + \); System.out.println(ex); return; } } } } thanks, Sudhakar = No one can earn a million dollars honestly.- William Jennings Bryan (1860-1925) Make everything as simple as possible, but not simpler.- Albert Einstein (1879-1955) It is dangerous to be sincere unless you are also stupid.- George Bernard Shaw (1856-1950) __ Do you Yahoo!? Yahoo! Finance Tax Center - File online. File on time. http://taxes.yahoo.com/filing.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
New Word Document text extractor released
Version 0.4 of the TextMining.org text extraction library has been released! I have finally gotten around to releasing a new version of the textmining.org text extractor. This is a pure java library for extracting text from Word 6.0/97/2000/XP/2003. Some highlights from this release: -I removed support for PDF documents. I was only wrapping the excellent PDFBox (http://www.pdfbox.org) library with a few lines of code. -I added support for Word 6.0 documents. -The extractor will no longer extract text that has been deleted but is still in the document because of revision tracking -I added two exceptions, PasswordProtectedException and FastSavedException, for more graceful failures. -Fixed bugs -Updated the license to Apache 2.0 A special thanks to BeeText Inc. (http://www.beetext.com) They are a software company that is on the cutting edge of software development for translation professionals. Besides that, they sponsored all of the above changes. Remember...support companies that support open source! -Ryan Ackley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word Documents
I have written a library located at http://textmining.org that will extract text from Word documents. I am the author of the Word library in POI btw. This is just a lightweight version because I got sick of everyone asking how to extract text from a Word document. If it doesn't work its because the document is *not* from Word 97 or later or the file was fast-saved. Everytime somebody has problems they send me their files and they turn out to be RTF or Word 95 documents. You can check the format by opening the file in Word then going to Save As. The format of the document will be in the Save as Type dropdown. At least in my version of Word it does. -Ryan - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 15, 2003 9:19 AM Subject: RE: Word Documents Hi, we had some problems using the POI Word filter. In one document set, everything would work fine, in another more than 50% documents refused to work with it (does not index). I am not an OLE2 pro and cannot see any apparent difference in the documents between the different sets. The version used was Word 97 in almost all the docs. For the moment, I switched to a native converter (that does not process metadata and must be run using Runtime.exec(), though) until I have time to revisit the problem. I do not want to disrecommend the POI-filters, it's a very cool idea. Please do try your particular document set with it. For a quick test, you can use the Docco personal search tool by Peter Becker and colleagues (available from SourceForge). It has a current version of POI included as a plugin and Lucene running as indexing backend. So you don't have to write code to get answers... Cheers, gregor -Original Message- From: Pleasant, Tracy [mailto:[EMAIL PROTECTED] Sent: Monday, December 15, 2003 2:58 PM To: Lucene Users List Subject: Word Documents As a spinoff, I was wondering if anyone has been happy with indexing and searching Word docs. What about reading the contents? Any problems? -Original Message- From: Ryan Ackley [mailto:[EMAIL PROTECTED] Sent: Friday, December 12, 2003 5:59 PM To: Zhou, Oliver; Lucene Users List Subject: Re: textmining: document title Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the HPSF API. It allows you to extract metadata like Title, Author, etc. from OLE documents. -Ryan - Original Message - From: Zhou, Oliver [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, December 12, 2003 5:26 PM Subject: textmining: document title Ryan, I'm using textmining and lucene to index word documents but don't know how to get word document title. Your advice on this matter is appreciated. Thanks, Oliver Zhou - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: textmining: document title
Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the HPSF API. It allows you to extract metadata like Title, Author, etc. from OLE documents. -Ryan - Original Message - From: Zhou, Oliver [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, December 12, 2003 5:26 PM Subject: textmining: document title Ryan, I'm using textmining and lucene to index word documents but don't know how to get word document title. Your advice on this matter is appreciated. Thanks, Oliver Zhou - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exotic format indexing?
Finally, a while back, somebody on this list mentioned quiet a different approach: simply read the raw binary document and go fishing for what looks like text. I would like to try that :) I have tried that approach and it works ok. You end up with a bunch of junk in with the useful stuff. It can clutter up your index and make searching slower. There are a lot of file formats that don't store all of the text as sequential text so it won't work. PDF is one, I know that PowerPoint is another. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: parser
xls is done by POI, another jakarta project. - Original Message - From: Daniel Hunziker [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, March 20, 2003 10:48 PM Subject: parser Are there any parser for the following format - doc - xls - ppt - pdf Thanks for help Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. If you link your java code with native code you have lost one of the biggest benefits of Java, platform independence. I would suggest you use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library. Ryan Ackley - Original Message - From: Eric Anderson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 7:14 PM Subject: Re: my experiences - Re: Parsing Word Docs Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, and from where it would be called. How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer [EMAIL PROTECTED]: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word doc parser
Go to http://www.textmining.org - Original Message - From: Pinky Iyer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, February 28, 2003 3:44 PM Subject: Word doc parser Anybody knows of a good word document parsers. Thanks ! P Iyer - Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, and more - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
THIS IS HOW YOU INDEX WORD DOCUMENTS
I wrote the apache POI HDF (Word library) stuff. I wrote a light version that just does text extraction. You can download it at http://www.textmining.org. Ryan Ackley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index a Word document
POI it' s correct, but use a OLE If your application running under unix POI it' s incorrect... This isn't true, POI is written in 100% pure java and will work on any platform that supports java. It uses no native libraries. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]