You can use text extractors for the document formats you mentioned.
Lucene as such does not deal with this text extraction process.
Following are the extractors we generally use:
PDF             -> PDFBox: Java API to read PDF documents
http://www.pdfbox.org.
WORD            -> Antiword: http://www.winfield.demon.nl/
TXT             -> You can read the content using Java IO classes and
index them.
MSG             -> We currently using strings utility in Solaris that
reads printable characters from files.
XLS             -> Apache POI utils has classes to read Excel files. so
you can use that.
PPT/PPS -> Apache POI's PowerPointExtractor
RTF             -> Java Swing has RTFEditorKit which we use to read RTF
documents.

Krovi.

-----Original Message-----
From: Shajahan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 1:19 PM
To: java-user@lucene.apache.org
Subject: Lucene Help



Hi all,

i am new to Lucene. i want to work indexing for PDF,word,txt files. can
any
one tell me how to dun indexing by Lucene. please give some informetion.

Thanking you
shaik
--
View this message in context:
http://www.nabble.com/Lucene-Help-t1442764.html#a3896122
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to