Re: extracting non-english text from word, pdf, etc....??

testn Thu, 02 Aug 2007 06:23:10 -0700

Check out..
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e




heybluez wrote:
> 
> Yea, I have seen those.  I guess the question is what do you all use to 
> extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
> so on?  This is what I use now to extract english.
> 
> Thanks,
> Michael
> 
> testn wrote:
>> If you can extract token stream from those files already, you can simply
>> use
>> different analyzers to analyze those token stream appropriately. Check
>> out
>> Lucen-contrib analyzers at
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
>>
>>
>>
>> heybluez wrote:
>>   
>>> I know how to do english text with POI and PDFBox and so on.  Now, I
>>> want
>>> to start indexing non-english language such as french and spanish. 
>>> Which
>>> extraction libs are available for me?
>>>
>>> I want to do:
>>>
>>> Excel
>>> Word
>>> PowerPoint
>>> PDF
>>> HTML
>>> RTF
>>>
>>> Thanks!
>>> Michael
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc....---tf4198171.html#a11964422
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??

Reply via email to