You can use text extractors for the document formats you mentioned. Lucene as such does not deal with this text extraction process. Following are the extractors we generally use: PDF -> PDFBox: Java API to read PDF documents http://www.pdfbox.org. WORD -> Antiword: http://www.winfield.demon.nl/ TXT -> You can read the content using Java IO classes and index them. MSG -> We currently using strings utility in Solaris that reads printable characters from files. XLS -> Apache POI utils has classes to read Excel files. so you can use that. PPT/PPS -> Apache POI's PowerPointExtractor RTF -> Java Swing has RTFEditorKit which we use to read RTF documents.
Krovi. -----Original Message----- From: Shajahan [mailto:[EMAIL PROTECTED] Sent: Thursday, April 13, 2006 1:19 PM To: java-user@lucene.apache.org Subject: Lucene Help Hi all, i am new to Lucene. i want to work indexing for PDF,word,txt files. can any one tell me how to dun indexing by Lucene. please give some informetion. Thanking you shaik -- View this message in context: http://www.nabble.com/Lucene-Help-t1442764.html#a3896122 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]