Re: extracting non-english text from word, pdf, etc....??

Grant Ingersoll Thu, 02 Aug 2007 07:18:57 -0700

Hey Michael,

Have you given it a try? I would think they would work, but haven'tactually done it. Setup a small test that reads in a PDF in Frenchor Spanish and give it a try. You might have to worry aboutencodings or something, but the structure of the files should be thesame, i.e. they are valid Word, etc. documents.


-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

Yea, I have seen those. I guess the question is what do you alluse to extract text from Word, Excel, PPT and PDF? Can I use POI,PDFBox and so on? This is what I use now to extract english.
Thanks,
Michael

testn wrote:
If you can extract token stream from those files already, you cansimply usedifferent analyzers to analyze those token stream appropriately.Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
heybluez wrote:
I know how to do english text with POI and PDFBox and so on.Now, I wantto start indexing non-english language such as french andspanish. Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??

Reply via email to