On Thu, 2 Aug 2012, Alexander Cougarman wrote:
Hi. Does the latest version of Tika index text in these file types?
- Office 2007/2010 file types of DOCX, XLSX, PPTX

Yes (thought the few tiny bits of new functionality introduced in 2010 will be skipped over)

- MHT file (MHTML Document)

Not sure, how close is this to a regular html file?

This page helped on many of the file formats, but wanted to clarify: http://tika.apache.org/1.2/formats.html

Often the best way to check is to grab the tika-app jar, and try a few sample files with it

Nick

Reply via email to