In terms of PDF documents...
PDFBox should work just fine with any latin based languages; at this
time certain PDFs that have CJK characters can pose some issues. In
general english/french/spanish should be fine.
Some PDFs use custom encodings that make it impossible to extract text
and it comes out as gibberish. As a simple test if Acrobat can
extract the text then PDFBox should be able to as well.
Ben
Quoting Grant Ingersoll <[EMAIL PROTECTED]>:
Hey Michael,
Have you given it a try? I would think they would work, but haven't
actually done it. Setup a small test that reads in a PDF in French or
Spanish and give it a try. You might have to worry about encodings or
something, but the structure of the files should be the same, i.e. they
are valid Word, etc. documents.
-Grant
On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:
Yea, I have seen those. I guess the question is what do you all
use to extract text from Word, Excel, PPT and PDF? Can I use POI,
PDFBox and so on? This is what I use now to extract english.
Thanks,
Michael
testn wrote:
If you can extract token stream from those files already, you can
simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
heybluez wrote:
I know how to do english text with POI and PDFBox and so on. Now, I want
to start indexing non-english language such as french and spanish. Which
extraction libs are available for me?
I want to do:
Excel
Word
PowerPoint
PDF
HTML
RTF
Thanks!
Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]