Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Ryan Ackley
Their API is amazing. However, you run into the same problems that you do 
when you automate MS Office using VBA. Which is instability and everything 
is single-threaded. Your are basically automating a gui application.

-Ryan
- Original Message - 
From: Genty Jean-Paul [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, October 25, 2004 1:17 PM
Subject: Re: Need advice: what Word/Excel/PowerPoint lib to use?


At 19:42 25/10/2004, you wrote:
At 17:05 25/10/2004, you wrote:
of course POI, for open source.
There are some commercial products based on POI also.
for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you can 
also find some source code in archives.

 And what do you think about using Open Office's UNO APIs  ?
I didn't knew about them. Are they implemented in Java?
Yes
 Check out  http://api.openoffice.org/ , They have good examples, I can 
also provide you my small test.
 You can do some amazing things with their API.

Do they support all MSOffice formats (97/2000/XP)?
Check http://www.openoffice.org/product/docs/OOoFlyer11s.pdf
Jean-Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Textmining.org IS NOT POI (was Re: worddoucments search)

2004-08-24 Thread Ryan Ackley
Go to http://www.textmining.org for a platform independent library to
extract text from Word documents. I wrote 99.99% of the Word component of
POI and all of the textmining.org library.

 I have seen several discussions and web pages that point to textmining.org
that say I simply wrap POI classes (For example, the JGuru GAQ
http://www.jguru.com ) This is totally false.

* The textmining.org library is optimized for extracting text. POI is not.
* The textmining.org libraries supports extracting text from Word 6/95. POI
does not.
* The textmining.org libraries do not extract deleted text that is still in
the document for the purposes of revision marking. POI does not handle this.

-Ryan Ackley

- Original Message - 
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 7:31 AM
Subject: Re: worddoucments search


 please look at Apache POI project.
 http://jakarta.apache.org

 Words documents can be extracted using POI apis and later can be indexed.

 regards

 - Original Message - 
 From: Santosh [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 24, 2004 6:00 PM
 Subject: worddoucments search


 Can lucene be able to search word documents? if so please give me
 information about it

 regards
 Santosh kumar


 ---SOFTPRO DISCLAIMER--

 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.

 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it immediately
 and notify the sender by E-MAIL.

 In such a case reading, reproducing, printing or further dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.

 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.

 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: worddoucments search

2004-08-24 Thread Ryan Ackley
Otis,

Why didn't you use the textmining.org library? You even asked me to fix a
bug for the book , which I did. Also, the code would have been about three
lines.

-Ryan

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 7:41 AM
Subject: Re: worddoucments search


 For Lucene in Action Erik and I wrote a little extensible framework for
 indexing various documents, including MS Word.  We used POI, so the
 solution works on Winblows, UNIX/Linux, OSX  I think the code is
 bit too big for the list, but the book will be out soon.  Erik and I
 are going through copy and tech editing right now.  POI:
 http://jakarta.apache.org/poi .

 Otis


 --- Don Vaillancourt [EMAIL PROTECTED] wrote:

  I could ber wrong, but I don't think that there is an indexer for
  word
  documents.
 
  There's a Python version of Lucene called Lupy with a Python indexer
  for
  all sorts of document types (http://www.methods.co.nz/docindexer/).
  Would anyone be willing to port those over.  Although the MSWord
  indexer
  only words on MSWindows and you may need MSWord for it to work.  Man,
 
  that's no good.
 
  I think that we'd need to ask the OpenOffice people for help on this.
 
 
  Santosh wrote:
 
  Can lucene be able to search word documents? if so please give me
  information about it
  
  regards
  Santosh kumar
  
  
  ---SOFTPRO
  DISCLAIMER--
  
  Information contained in this E-MAIL and any attachments are
  confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
  and 'confidential'.
  
  If you are not an intended or authorised recipient of this E-MAIL or
  have received it in error, You are notified that any use, copying or
  dissemination  of the information contained in this E-MAIL in any
  manner whatsoever is strictly prohibited. Please delete it
  immediately
  and notify the sender by E-MAIL.
  
  In such a case reading, reproducing, printing or further
  dissemination
  of this E-MAIL is strictly prohibited and may be unlawful.
  
  SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
  hereto is free from computer viruses or other defects.
  
  The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
  those of the author and are not necessarily those of SOFTPRO
  SYSTEMS.
 
 
  
  
  
 
 
  -- 
  *Don Vaillancourt
  Director of Software Development
  *
  *WEB IMPACT INC.*
  phone: 416-815-2000 ext. 245
  fax: 416-815-2001
  email: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
  web: http://www.web-impact.com
 
 
 
  / This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright. If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.
  /
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: worddoucments search

2004-08-24 Thread Ryan Ackley
Code example for textmining.org library:

FileInputStream in = new FileInputStream (test.doc);
WordExtractor extractor = new WordExtractor();

String str = extractor.extractText();


- Original Message - 
From: Natarajan.T [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 8:11 AM
Subject: RE: worddoucments search


 Hi Santhosh,
 
 Try out the below attached code.(POI.jar should be in your class
 path)
 
 
 public String getContent(InputStream reader) throws IOException {
 ArrayList text = new ArrayList();
 POIFSFileSystem fsys = new POIFSFileSystem(reader);
 
 DocumentEntry headerProps =
 (DocumentEntry)fsys.getRoot().getEntry(WordDocument);
 DocumentInputStream din =
 fsys.createDocumentInputStream(WordDocument);
 byte[] header = new byte[headerProps.getSize()];
 
 din.read(header);
 din.close();
 
 //Get the information we need from the header
 int info = LittleEndian.getShort(header, 0xa);
 boolean useTable1 = (info  0x200) != 0;
 
 //get the location of the piece table
 int complexOffset = LittleEndian.getInt(header,
 0x1a2);
 
 String tableName = null;
 if (useTable1) {
   tableName = 1Table;
 }
 else{
   tableName = 0Table;
 }
 
 DocumentEntry table =
 (DocumentEntry)fsys.getRoot().getEntry(tableName);
 byte[] tableStream = new byte[table.getSize()];
 din = fsys.createDocumentInputStream(tableName);
 din.read(tableStream);
 din.close();
 
 din = null;
 fsys = null;
 table = null;
 headerProps = null;
 
 int multiple = findText(tableStream, complexOffset,
 text);
 
 StringBuffer sb = new StringBuffer();
 int size = text.size();
 tableStream = null;
 
 WordTextPiece nextPiece = null;
 int start ;
 int length;
 String toStr = ;
 for (int x = 0; x  size; x++) {
 nextPiece = (WordTextPiece)text.get(x);
 start = nextPiece.getStart();
 length = nextPiece.getLength();
 
 boolean unicode =
 nextPiece.usesUnicode();
 if (unicode) {
 toStr = new String(header,
 start, length * multiple, UTF-16LE); 
 }
 else{ 
 toStr = new String(header,
 start, length , ISO-8859-1); 
 } 
 
 }
 
 reader.close();
 return toStr;
 }
 
 
 Regards,
 Natarajan.
 
 
 
 -Original Message-
 From: Santosh [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, August 24, 2004 5:46 PM
 To: Lucene Users List
 Subject: worddoucments search
 
 Can lucene be able to search word documents? if so please give me
 information about it
 
 regards
 Santosh kumar
 
 
 ---SOFTPRO DISCLAIMER--
 
 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.
 
 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it immediately
 and notify the sender by E-MAIL.
 
 In such a case reading, reproducing, printing or further dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.
 
 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects. 
 
 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index MSOffice Documents

2004-06-25 Thread Ryan Ackley
Thanks Sergiu,

You should also post to the Lucene Users list.

-Ryan

- Original Message - 
From: Sergiu Gordea [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED];
[EMAIL PROTECTED]
Cc: POI Users List [EMAIL PROTECTED]
Sent: Friday, June 25, 2004 8:42 AM
Subject: Index MSOffice Documents


 Hi all,

  I'm working on a project in which we are building a knowledge
 management platform. We are using Turbine/Velocity
 as framework and we are using lucene for search.

  We want to make the search to be able to index MSOffice Documents,
 therefore I was searching for some possibilities to extract the text
 from this
 documents. I found some examples based on POI library
 (http://jakarta.apache.org/poi) and I addapted them to our needs.
 The extraction of the text elements from XLS file I think is trustable
 (the from POI development comunity did a great job with the package that
 work with XSL files). The examples that extract the text from DOC and
 PPT files are not very general, I think they have problems with the
 documents
 written with special charsets but they are working just well on the
 documents I use. I hope someone that has more experience that I have
 will improve this
 and will a better source code.

  Congratulations to all people involved in development of the Jakarta
 project and it's subprojects,

  Sergiu Gordea

 Ps: ExeConverteImpl uses an external stand alone application (like
 antiwort or pdf2txt) to extract the text.







 /* @(#) CWK 1.4 07.06.2004
  *
  * Copyright 2003-2005 ConfigWorks Informationssysteme  Consulting GmbH
  * Universitätsstr. 94/7 9020 Klagenfurt Austria
  * www.configworks.com
  * All rights reserved.
  */

 package com.configworks.cwk.be.search.converters;

 import java.io.BufferedWriter;
 import java.io.File;
 import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.io.InputStream;
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 import org.apache.poi.hssf.usermodel.HSSFCell;
 import org.apache.poi.hssf.usermodel.HSSFRow;
 import org.apache.poi.hssf.usermodel.HSSFSheet;
 import org.apache.poi.hssf.usermodel.HSSFWorkbook;

 /**
  * Class description
  *
  * @author sergiu
  * @version 1.0
  * @since CWK 1.5
  */
 public class XLSConverterImpl extends JavaDocumentConverter {

 private Log logger = null;
 File dest = null;



 public boolean extractText(InputStream reader, BufferedWriter writer)
throws FileNotFoundException,
 IOException {

 HSSFWorkbook workbook = new HSSFWorkbook(reader);

 for (int k = 0; k  workbook.getNumberOfSheets(); k++) {
 HSSFSheet sheet = workbook.getSheetAt(k);

 if (sheet != null) {
 int rows = sheet.getLastRowNum();
 //I don't know why the last row = sheet.getRow(rows) and
first row = sheet.getRow(0)
 for (int r = 0; r = rows; r++) {
 HSSFRow row = sheet.getRow(r);
 if (row != null) {
 int cells = row.getLastCellNum();
 for (int c = 0; c = cells; c++) {
 HSSFCell cell = row.getCell((short) c);
 String value = null;
 if (cell != null) {
 switch (cell.getCellType()) {
 case HSSFCell.CELL_TYPE_FORMULA:
 value = cell.getCellFormula();
 break;
 case HSSFCell.CELL_TYPE_STRING:
 value = cell.getStringCellValue();
 break;
 case HSSFCell.CELL_TYPE_NUMERIC:
 value =  + cell.getNumericCellValue();
 break;
 default:
 value = cell.getStringCellValue();
 }
 }
 if (value != null) {
 writer.write(value +  );
 }
 }//cels
 }
 }//rows
 }
 }//sheets

 //if no Exception was thrown consider that the conversion was
successful
 return true;
 }

 /**
  * @return Returns the logger.
  */
 public Log getLogger() {
 if (logger == null)
 logger = LogFactory.getLog(XLSConverterImpl.class);
 return logger;
 }

 }









 package com.configworks.cwk.be.search.converters;

 import com.configworks.cwk.share.Utils;
 import java.io.BufferedInputStream;
 import java.io.File;
 import java.io.FileReader;
 import java.io.IOException;
 import java.io.Reader;
 import org.apache.commons.logging.Log;
 import 

Fw: PowerPoint to Text

2004-03-26 Thread Ryan Ackley

I haven't tested this out but I thought this would be of interest to Lucene
users. I may eventually add this to the textmining.org libraries.

-Ryan

- Original Message - 
From: Koundinya (Sudhakar Chavali) [EMAIL PROTECTED]
To: POI Users List [EMAIL PROTECTED]; Ryan Ackley
[EMAIL PROTECTED]
Sent: Friday, March 26, 2004 12:32 AM
Subject: PowerPoint to Text


 Hi All,

 We have done initail ground work for extracting PowerPoint 2
 text. We would like to say thanks to POI group. Though the base
 work is rough, we are able to extract the text from PowerPoint.

 Sorry for bad programming. But hope this wll be helpful to make
 the good program from this scrath by the efficient developers.


 Here is the sample. When ever there are modifictaions, we will
 post the information.


 import java.io.*;
 import java.util.*;
 import org.apache.poi.hpsf.*;
 import org.apache.poi.poifs.eventfilesystem.*;
 import org.apache.poi.util.HexDump;
 import org.apache.poi.util.LittleEndian;

 public class PPT2Text
 {
 public static void main(String[] args)
 throws IOException
 {
 final String filename = args[0];
 POIFSReader r = new POIFSReader();

 /* Register a listener for *all* documents. */
 r.registerListener(new MyPOIFSReaderListener());
 r.read(new FileInputStream(filename));
 }



 static class MyPOIFSReaderListener implements
 POIFSReaderListener
 {

 static int filename=1;

 public void processPOIFSReaderEvent(POIFSReaderEvent event)
 {
 PropertySet ps = null;


 try
 {

 org.apache.poi.poifs.filesystem.DocumentInputStream
 dis=null;

 System.out.println(\n\n);
 System.out.println(event.getPath()+event.getName());
 dis=event.getStream();
 /*
 byte btoWrite[]= new byte[12];

 dis.read(btoWrite);

 System.out.println(Version
 :+LittleEndian.getUnsignedByte(btoWrite,0));
 System.out.println(Instance
 :+LittleEndian.getUShort(btoWrite,0));
 System.out.println(Type
 :+LittleEndian.getUShort(btoWrite,2));
 System.out.println(Len
 :+LittleEndian.getLong(btoWrite,4));

 */

 FileOutputStream fos= new
 FileOutputStream(+filename+.txt);

 byte btoWrite[]= new byte[dis.available()];
 dis.read(btoWrite,0,dis.available());
 for(int i=0;ibtoWrite.length-20;i++)
 {
 //System.out.println(Version
 :+LittleEndian.getUnsignedByte(btoWrite,i+0));
 //System.out.println(Instance
 :+LittleEndian.getUShort(btoWrite,i+0));
 //System.out.println(Type
 :+LittleEndian.getUShort(btoWrite,i+2));
 //System.out.println(Len
 :+LittleEndian.getUInt(btoWrite,i+4));

 long type=LittleEndian.getUShort(btoWrite,i+2);
 long size=LittleEndian.getUInt(btoWrite,i+4);
 if (type==4008)
 {
 fos.write(btoWrite,i+4+1,(int)size+3);

 }

 }

 filename++;
 //System.out.println(event.getStream().toString());
 //ps = PropertySetFactory.create(event.getStream());
 }
 catch (Exception ex)
 {
 //System.out.println(No property set stream: \ +
 event.getPath() +
 // event.getName() + \);
 System.out.println(ex);
 return;
 }
 }
 }


 }






 thanks,
 Sudhakar




 =
 No one can earn a million dollars honestly.- William Jennings Bryan
(1860-1925)

 Make everything as simple as possible, but not simpler.- Albert Einstein
(1879-1955)

 It is dangerous to be sincere unless you are also stupid.- George
Bernard Shaw (1856-1950)

 __
 Do you Yahoo!?
 Yahoo! Finance Tax Center - File online. File on time.
 http://taxes.yahoo.com/filing.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



New Word Document text extractor released

2004-03-03 Thread Ryan Ackley
Version 0.4 of the TextMining.org text extraction library has been released!

I have finally gotten around to releasing a new version of the
textmining.org text extractor. This is a pure java library for extracting
text from Word 6.0/97/2000/XP/2003.

Some highlights from this release:

-I removed support for PDF documents. I was only wrapping the excellent
PDFBox (http://www.pdfbox.org) library with a few lines of code.
-I added support for Word 6.0 documents.
-The extractor will no longer extract text that has been deleted but is
still in the document because of revision tracking
-I added two exceptions, PasswordProtectedException and FastSavedException,
for more graceful failures.
-Fixed bugs
-Updated the license to Apache 2.0

A special thanks to BeeText Inc. (http://www.beetext.com)  They are a
software company that is on the cutting edge of software development for
translation professionals. Besides that, they sponsored all of the above
changes. Remember...support companies that support open source!

-Ryan Ackley


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word Documents

2003-12-15 Thread Ryan Ackley
I have written a library located at http://textmining.org that will extract
text from Word documents. I am the author of the Word library in POI btw.
This is just a lightweight version because I got sick of everyone asking how
to extract text from a Word document. If it doesn't work its because the
document is *not* from Word 97 or later or the file was fast-saved.
Everytime somebody has problems they send me their files and they turn out
to be RTF or Word 95 documents. You can check the format by opening the file
in Word then going to Save As. The format of the document will be in the
Save as Type dropdown. At least in my version of Word it does.

-Ryan

- Original Message - 
From: Gregor Heinrich [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, December 15, 2003 9:19 AM
Subject: RE: Word Documents


 Hi,

 we had some problems using the POI Word filter. In one document set,
 everything would work fine, in another more than 50% documents refused to
 work with it (does not index). I am not an OLE2 pro and cannot see any
 apparent difference in the documents between the different sets. The
version
 used was Word 97 in almost all the docs. For the moment, I switched to a
 native converter (that does not process metadata and must be run using
 Runtime.exec(), though) until I have time to revisit the problem.

 I do not want to disrecommend the POI-filters, it's a very cool idea.
Please
 do try your particular document set with it. For a quick test, you can use
 the Docco personal search tool by Peter Becker and colleagues (available
 from SourceForge). It has a current version of POI included as a plugin
and
 Lucene running as indexing backend. So you don't have to write code to get
 answers...

 Cheers, gregor

 -Original Message-
 From: Pleasant, Tracy [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 15, 2003 2:58 PM
 To: Lucene Users List
 Subject: Word Documents


 As a spinoff, I was wondering if anyone has been happy with indexing and
 searching Word docs. What about reading the contents? Any problems?


 -Original Message-
 From: Ryan Ackley [mailto:[EMAIL PROTECTED]
 Sent: Friday, December 12, 2003 5:59 PM
 To: Zhou, Oliver; Lucene Users List
 Subject: Re: textmining: document title


 Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the
HPSF
 API. It allows you to extract metadata like Title, Author, etc. from OLE
 documents.

 -Ryan

 - Original Message - 
 From: Zhou, Oliver [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Friday, December 12, 2003 5:26 PM
 Subject: textmining: document title


  Ryan,
 
  I'm using textmining and lucene to index word documents but don't know
how
  to get word document title.  Your advice on this matter is appreciated.
 
  Thanks,
  Oliver Zhou
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: textmining: document title

2003-12-12 Thread Ryan Ackley
Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the HPSF
API. It allows you to extract metadata like Title, Author, etc. from OLE
documents.

-Ryan

- Original Message - 
From: Zhou, Oliver [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, December 12, 2003 5:26 PM
Subject: textmining: document title


 Ryan,

 I'm using textmining and lucene to index word documents but don't know how
 to get word document title.  Your advice on this matter is appreciated.

 Thanks,
 Oliver Zhou




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exotic format indexing?

2003-10-30 Thread Ryan Ackley
 Finally, a while back, somebody on this list mentioned quiet a
 different approach: simply read the raw binary document and go fishing
 for what looks like text. I would like to try that :)

I have tried that approach and it works ok. You end up with a bunch of junk
in with the useful stuff. It can clutter up your index and make searching
slower. There are a lot of file formats that don't store all of the text as
sequential text so it won't work. PDF is one, I know that PowerPoint is
another.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: parser

2003-03-21 Thread Ryan Ackley
xls is done by POI, another jakarta project.

- Original Message - 
From: Daniel Hunziker [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, March 20, 2003 10:48 PM
Subject: parser


 Are there any parser for the following format
 - doc
 - xls
 - ppt
 - pdf
  
 Thanks for help
 Daniel
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Ryan Ackley
David,

The textmining.org stuff only works on Word97 and above. It should work with
no exceptions on any Word 97 doc. If you have any problems then it is from
an earlier version (most likely Word 6.0) or its not a word document. If
this isn't the case you need to email me so I can fix it and make it better
for the benefit of everyone. I plan on adding support for Word 6 in the
future.

Ryan Ackley

- Original Message -
From: David Spencer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 05, 2003 6:24 PM
Subject: my experiences - Re: Parsing Word Docs


 FYI I tried the textmining.org/poi combo and on a collection of 350 word
 docs people have developed here over the years, and it failed on 33% of
them
 with exceptions being thrown about the formats being invalid.

 I tried antiword ( http://www.winfield.demon.nl/ ), a native  free
 *.exe, and
 it worked great ( well it seemed to process all the files fine).

 I've had similar experiences with PDF - I tried the 3 or so
 freeware/java PDF
 text extractors and they were not as good as the exe, pdftotext,
 from foolabs (http://www.foolabs.com/xpdf/).

 Not satisfying to a java developer but these work better than anything
 else I can find.

 You get source and I use them on windows  linux, no prob.



 Eric Anderson wrote:

 I'm interested in using the textmining/textextraction utilities using
Apache
 POI, that Ryan was discussing. However, I'm having some difficulty
determining
 what the insertion point would be to replace the default parser with the
word
 parser.
 
 Any assistance would be appreciated.
 
 
 
 
 
 LanRx Network Solutions, Inc.
 Providing Enterprise Level Solutions...On A Small Business Budget
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread Ryan Ackley
Eric,

The problem with antiword is that it is a native application. You must write
a class that uses JNI to access the native code. If you link your java code
with native code you have lost one of the biggest benefits of Java, platform
independence. I would suggest you use the library at http://textmining.org.
contrary to what David Spencer says, it should work on all documents created
with Word 97 or above. I have literally indexed 100,000s of unique documents
using my library.

Ryan Ackley

- Original Message -
From: Eric Anderson [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, March 05, 2003 7:14 PM
Subject: Re: my experiences - Re: Parsing Word Docs


 Ok. Thanks for the tip.

 I downloaded and compiled Antiword, and would like to now add it to my
indexing
 class. However, I'm not sure how the application would be called, and from
 where it would be called.

 How will I have the class parse the document through Antiword to create
the
 keyword index, but leaving the DOC intact, as Mr. Litchfield did with
PDFBox?

 Your assistance is greatly appreciated.

 Eric Anderson
 815-505-6132


 Quoting David Spencer [EMAIL PROTECTED]:

  FYI I tried the textmining.org/poi combo and on a collection of 350 word
  docs people have developed here over the years, and it failed on 33% of
  them
  with exceptions being thrown about the formats being invalid.
 
  I tried antiword ( http://www.winfield.demon.nl/ ), a native  free
  *.exe, and
  it worked great ( well it seemed to process all the files fine).
 
  I've had similar experiences with PDF - I tried the 3 or so
  freeware/java PDF
  text extractors and they were not as good as the exe, pdftotext,
  from foolabs (http://www.foolabs.com/xpdf/).
 
  Not satisfying to a java developer but these work better than anything
  else I can find.
 
  You get source and I use them on windows  linux, no prob.
 
 
 
  Eric Anderson wrote:
 
  I'm interested in using the textmining/textextraction utilities using
Apache
 
  POI, that Ryan was discussing. However, I'm having some difficulty
  determining
  what the insertion point would be to replace the default parser with
the
  word
  parser.
  
  Any assistance would be appreciated.
  
  
  
  
  
  LanRx Network Solutions, Inc.
  Providing Enterprise Level Solutions...On A Small Business Budget
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 

 LanRx Network Solutions, Inc.
 Providing Enterprise Level Solutions...On A Small Business Budget

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word doc parser

2003-03-01 Thread Ryan Ackley
Go to http://www.textmining.org 

- Original Message - 
From: Pinky Iyer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, February 28, 2003 3:44 PM
Subject: Word doc parser


 
  Anybody knows of a good word document parsers. 
 Thanks !
 P Iyer
 
 
 
 -
 Do you Yahoo!?
 Yahoo! Tax Center - forms, calculators, tips, and more

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



THIS IS HOW YOU INDEX WORD DOCUMENTS

2003-01-31 Thread Ryan Ackley
I wrote the apache POI HDF (Word library) stuff. I wrote a light version
that just does text extraction. You can download it at
http://www.textmining.org.

Ryan Ackley


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: How to index a Word document

2003-01-31 Thread Ryan Ackley
POI it' s correct, but use a OLE
If your application running under unix POI it' s incorrect...

This isn't true,  POI is written in 100% pure java and will work on any
platform that supports java. It uses no native libraries.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]