If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
heybluez wrote: > > I know how to do english text with POI and PDFBox and so on. Now, I want > to start indexing non-english language such as french and spanish. Which > extraction libs are available for me? > > I want to do: > > Excel > Word > PowerPoint > PDF > HTML > RTF > > Thanks! > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc....---tf4198171.html#a11962580 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]