Check out.. http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e
heybluez wrote: > > Yea, I have seen those. I guess the question is what do you all use to > extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and > so on? This is what I use now to extract english. > > Thanks, > Michael > > testn wrote: >> If you can extract token stream from those files already, you can simply >> use >> different analyzers to analyze those token stream appropriately. Check >> out >> Lucen-contrib analyzers at >> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ >> >> >> >> heybluez wrote: >> >>> I know how to do english text with POI and PDFBox and so on. Now, I >>> want >>> to start indexing non-english language such as french and spanish. >>> Which >>> extraction libs are available for me? >>> >>> I want to do: >>> >>> Excel >>> Word >>> PowerPoint >>> PDF >>> HTML >>> RTF >>> >>> Thanks! >>> Michael >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc....---tf4198171.html#a11964422 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]